Combine column to remove NA's

RMergeNa

R Problem Overview


I have some columns in R and for each row there will only ever be a value in one of them, the rest will be NA's. I want to combine these into one column with the non-NA value. Does anyone know of an easy way of doing this. For example I could have as follows:

data <- data.frame('a' = c('A','B','C','D','E'),
                   'x' = c(1,2,NA,NA,NA),
                   'y' = c(NA,NA,3,NA,NA),
                   'z' = c(NA,NA,NA,4,5))

So I would have

'a' 'x' 'y' 'z'  
 A   1   NA  NA  
 B   2   NA  NA  
 C  NA   3   NA  
 D  NA   NA  4  
 E  NA   NA  5

And I would to get

 'a' 'mycol'  
  A   1  
  B   2  
  C   3  
  D   4  
  E   5  

The names of the columns containing NA changes depending on code earlier in the query so I won't be able to call the column names explicitly, but I have the column names of the columns which contains NA's stored as a vector e.g. in this example cols <- c('x','y','z'), so could call the columns using data[, cols].

Any help would be appreciated.

Thanks

R Solutions


Solution 1 - R

A dplyr::coalesce based solution could be as:

data %>% mutate(mycol = coalesce(x,y,z)) %>%
         select(a, mycol)
#   a mycol
# 1 A     1
# 2 B     2
# 3 C     3
# 4 D     4
# 5 E     5 

Data

data <- data.frame('a' = c('A','B','C','D','E'),
                 'x' = c(1,2,NA,NA,NA),
                 'y' = c(NA,NA,3,NA,NA),
                 'z' = c(NA,NA,NA,4,5))

Solution 2 - R

You can use unlist to turn the columns into one vector. Afterwards, na.omit can be used to remove the NAs.

cbind(data[1], mycol = na.omit(unlist(data[-1])))

   a mycol
x1 A     1
x2 B     2
y3 C     3
z4 D     4
z5 E     5

Solution 3 - R

Here's a more general (but even simpler) solution which extends to all column types (factors, characters etc.) with non-ordered NA's. The strategy is simply to merge the non-NA values of other columns into your merged column using is.na for indexing:

data$m = data$x  # your new merged column start with x
data$m[!is.na(data$y)] = data$y[!is.na(data$y)]  # merge with y
data$m[!is.na(data$z)] = data$z[!is.na(data$z)]  # merge with z

> data
  a  x  y  z m
1 A  1 NA NA 1
2 B  2 NA NA 2
3 C NA  3 NA 3
4 D NA NA  4 4
5 E NA NA  5 5

Note that this will overwrite existing values in m if there are several non-NA values in the same row. If you have a lot of columns you could automate this by looping over colnames(data).

Solution 4 - R

I would use rowSums() with the na.rm = TRUE argument:

cbind.data.frame(a=data$a, mycol = rowSums(data[, -1], na.rm = TRUE))

which gives:

> cbind.data.frame(a=data$a, mycol = rowSums(data[, -1], na.rm = TRUE))
  a mycol
1 A     1
2 B     2
3 C     3
4 D     4
5 E     5

You have to call the method directly (cbind.data.frame) as the first argument above is not a data frame.

Solution 5 - R

Something like this ?

data.frame(a=data$a, mycol=apply(data[,-1],1,sum,na.rm=TRUE))

gives :

  a mycol
1 A     1
2 B     2
3 C     3
4 D     4
5 E     5

Solution 6 - R

max works too. Also works on strings vectors.

cbind(data[1], mycol=apply(data[-1], 1, max, na.rm=T))

Solution 7 - R

In a related link (https://stackoverflow.com/questions/13673894/suppress-nas-in-paste/31508774#31508774) I present a version of paste with a na.rm option (with the unfortunate name of paste5).

With this the code becomes

cols <- c("x", "y", "z")
cbind.data.frame(a = data$a, mycol = paste2(data[, cols], na.rm = TRUE))

The output of paste5 is a character, which works if you have character data otherwise you'll need to coerce to the type you want.

Solution 8 - R

Though this is not the OP case, it seems some people like the approach based on sums, how about thinking in mean and mode, to make the answer more universal. This answer matches the title, which is what many people will find.

data <- data.frame('a' = c('A','B','C','D','E'),
                   'x' = c(1,2,NA,NA,9),
                   'y' = c(NA,6,3,NA,5),
                   'z' = c(NA,NA,NA,4,5))

splitdf<-split(data[,c(2:4)], seq(nrow(data[,c(2:4)])))

data$mean<-unlist(lapply(splitdf, function(x)  mean(unlist(x), na.rm=T) ) )
data$mode<-unlist(lapply(splitdf, function(x)  {
  tab <- tabulate(match(x, na.omit(unique(unlist(x) )))); 
                  paste(na.omit(unique(unlist(x) ))[tab == max(tab) ], collapse = ", " )}) )

data
  a  x  y  z     mean mode
1 A  1 NA NA 1.000000    1
2 B  2  6 NA 4.000000 2, 6
3 C NA  3 NA 3.000000    3
4 D NA NA  4 4.000000    4
5 E  9  5  5 6.333333    5

Solution 9 - R

One possibility using dplyr and tidyr could be:

data %>%
 gather(variables, mycol, -1, na.rm = TRUE) %>%
 select(-variables)

   a mycol
1  A     1
2  B     2
8  C     3
14 D     4
15 E     5

Here it transforms the data from wide to long format, excluding the first column from this operation and removing the NAs.

Solution 10 - R

If you want to stick with base,

data <- data.frame('a' = c('A','B','C','D','E'),'x' = c(1,2,NA,NA,NA),'y' = c(NA,NA,3,NA,NA),'z' = c(NA,NA,NA,4,5))
data[is.na(data)]<-","
data$mycol<-paste0(data$x,data$y,data$z)
data$mycol <- gsub(',','',data$mycol)

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
Questionuser1165199View Question on Stackoverflow
Solution 1 - RMKRView Answer on Stackoverflow
Solution 2 - RSven HohensteinView Answer on Stackoverflow
Solution 3 - RJonas LindeløvView Answer on Stackoverflow
Solution 4 - RGavin SimpsonView Answer on Stackoverflow
Solution 5 - RjubaView Answer on Stackoverflow
Solution 6 - RdrollixView Answer on Stackoverflow
Solution 7 - RJWillimanView Answer on Stackoverflow
Solution 8 - RFerroaoView Answer on Stackoverflow
Solution 9 - RtmfmnkView Answer on Stackoverflow
Solution 10 - RBharadwaj A KView Answer on Stackoverflow