Replace missing values with column mean

RMissing DataImputation

R Problem Overview


I am not sure how to loop over each column to replace the NA values with the column mean. When I am trying to replace for one column using the following, it works well.

Column1[is.na(Column1)] <- round(mean(Column1, na.rm = TRUE))

The code for looping over columns is not working:

for(i in 1:ncol(data)){
    data[i][is.na(data[i])] <- round(mean(data[i], na.rm = TRUE))
}

the values are not replaced. Can someone please help me with this?

R Solutions


Solution 1 - R

A relatively simple modification of your code should solve the issue:

for(i in 1:ncol(data)){
  data[is.na(data[,i]), i] <- mean(data[,i], na.rm = TRUE)
}

Solution 2 - R

If DF is your data frame of numeric columns:

library(zoo)
na.aggregate(DF)

ADDED:

Using only the base of R define a function which does it for one column and then lapply to every column:

NA2mean <- function(x) replace(x, is.na(x), mean(x, na.rm = TRUE))
replace(DF, TRUE, lapply(DF, NA2mean))

The last line could be replaced with the following if it's OK to overwrite the input:

DF[] <- lapply(DF, NA2mean)

Solution 3 - R

There is also quick solution using the imputeTS package:

library(imputeTS)
na_mean(yourDataFrame)

Solution 4 - R

dplyr's mutate_all or mutate_at could be useful here:

library(dplyr)                                                             
                                                                           
set.seed(10)                                                               
df <- data.frame(a = sample(c(NA, 1:3)    , replace = TRUE, 10),           
                 b = sample(c(NA, 101:103), replace = TRUE, 10),                            
                 c = sample(c(NA, 201:203), replace = TRUE, 10))                            
                                                                           
df         
                                                            
#>     a   b   c
#> 1   2 102 203
#> 2   1 102 202
#> 3   1  NA 203
#> 4   2 102 201
#> 5  NA 101 201
#> 6  NA 101 202
#> 7   1  NA 203
#> 8   1 101  NA
#> 9   2 101 203
#> 10  1 103 201
                                                                           
df %>% mutate_all(~ifelse(is.na(.x), mean(.x, na.rm = TRUE), .x))          

#>        a       b        c
#> 1  2.000 102.000 203.0000
#> 2  1.000 102.000 202.0000
#> 3  1.000 101.625 203.0000
#> 4  2.000 102.000 201.0000
#> 5  1.375 101.000 201.0000
#> 6  1.375 101.000 202.0000
#> 7  1.000 101.625 203.0000
#> 8  1.000 101.000 202.1111
#> 9  2.000 101.000 203.0000
#> 10 1.000 103.000 201.0000
                                                                           
df %>% mutate_at(vars(a, b),~ifelse(is.na(.x), mean(.x, na.rm = TRUE), .x))

#>        a       b   c
#> 1  2.000 102.000 203
#> 2  1.000 102.000 202
#> 3  1.000 101.625 203
#> 4  2.000 102.000 201
#> 5  1.375 101.000 201
#> 6  1.375 101.000 202
#> 7  1.000 101.625 203
#> 8  1.000 101.000  NA
#> 9  2.000 101.000 203
#> 10 1.000 103.000 201

Solution 5 - R

To add to the alternatives, using @akrun's sample data, I would do the following:

d1[] <- lapply(d1, function(x) { 
  x[is.na(x)] <- mean(x, na.rm = TRUE)
  x
})
d1

Solution 6 - R

lapply can be used instead of a for loop.

d1[] <- lapply(d1, function(x) ifelse(is.na(x), mean(x, na.rm = TRUE), x))

This doesn't really have any advantages over the for loop, though maybe it's easier if you have non-numeric columns as well, in which case

d1[sapply(d1, is.numeric)] <- lapply(d1[sapply(d1, is.numeric)], function(x) ifelse(is.na(x), mean(x, na.rm = TRUE), x))

is almost as easy.

Solution 7 - R

You could also try:

 cM <- colMeans(d1, na.rm=TRUE)
 indx <- which(is.na(d1), arr.ind=TRUE)
 d1[indx] <- cM[indx[,2]]
 d1  

###data set.seed(42) d1 <- as.data.frame(matrix(sample(c(NA,0:5), 5*10, replace=TRUE), ncol=10))

Solution 8 - R

A one-liner using tidyr's replace_na is

library(tidyr)
replace_na(mtcars,as.list(colMeans(mtcars,na.rm=T)))

If your df has columns that are non-numeric, this takes a little bit more work than a one-liner.

mean_to_fill <- select_if(ungroup(df), is.numeric) %>%
 colMeans(na.rm=T)

bind_cols(select(df, group1, group2, group3),
          select_if(ungroup(df), is.numeric) %>% 
            tidyr::replace_na(as.list(mean_to_fill))
          ) 

Solution 9 - R

Go simply with Zoo, it will simply replace all NA values with mean of the column values:

library(zoo)
na.aggregate(data) 

Solution 10 - R

# Lets say I have a dataframe , df as following -
df <- data.frame(a=c(2,3,4,NA,5,NA),b=c(1,2,3,4,NA,NA))

# create a custom function
fillNAwithMean <- function(x){
    na_index <- which(is.na(x))        
    mean_x <- mean(x, na.rm=T)
    x[na_index] <- mean_x
    return(x)
}

(df <- apply(df,2,fillNAwithMean))
   a   b
2.0 1.0
3.0 2.0
4.0 3.0
3.5 4.0
5.0 2.5
3.5 2.5

Solution 11 - R

Similar to the answer pointed out by @Thomas, This can also be done using ifelse() method of R:

for(i in 1:ncol(data)){
  data[,i]=ifelse(is.na(data[,i]),
                  ave(data[,i],FUN=function(y) mean(y, na.rm = TRUE)),
                  data[,i])
}

where, Arguments to ifelse(TEST, YES , NO) are:-

TEST- logical condition to be checked

YES- executed if the condition is True

NO- else when the condition is False

and ave(x, ..., FUN = mean) is method in R used for calculating averages of subsets of x[]

Solution 12 - R

With the data.table package you could use the set() function and loop over the columns and replace the NAs or whatever you like with an aggregate or value of your choice (here: mean):

require(data.table)

# data
dt = copy(iris[ ,-5])
setDT(dt)
dt[1:4, Sepal.Length := NA] # introduce NAs

# replace NAs with mean (or whatever function you like)
for (j in seq_along(names(dt))) {
  set(dt,
      i = which(is.na(dt[[j]])),
      j = j, 
      value = mean(dt[[j]], na.rm = TRUE))
}

Solution 13 - R

R has a function just for this purpose: replace_na

meanvals <- as.list(colMeans(data, na.rm = TRUE))
data <- replace_na(data, meanvals)

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionNikitaView Question on Stackoverflow
Solution 1 - RThomasView Answer on Stackoverflow
Solution 2 - RG. GrothendieckView Answer on Stackoverflow
Solution 3 - RSteffen MoritzView Answer on Stackoverflow
Solution 4 - RzackView Answer on Stackoverflow
Solution 5 - RA5C1D2H2I1M1N2O1R2T1View Answer on Stackoverflow
Solution 6 - RIstaView Answer on Stackoverflow
Solution 7 - RakrunView Answer on Stackoverflow
Solution 8 - RMarcus RittView Answer on Stackoverflow
Solution 9 - RpariView Answer on Stackoverflow
Solution 10 - RshekharView Answer on Stackoverflow
Solution 11 - RAseem YadavView Answer on Stackoverflow
Solution 12 - RandscharView Answer on Stackoverflow
Solution 13 - RDenis KazakovView Answer on Stackoverflow