Cleaning `Inf` values from an R dataframe

RDataframedata.table

R Problem Overview


In R, I have an operation which creates some Inf values when I transform a dataframe.

I would like to turn these Inf values into NA values. The code I have is slow for large data, is there a faster way of doing this?

Say I have the following dataframe:

dat <- data.frame(a=c(1, Inf), b=c(Inf, 3), d=c("a","b"))

The following works in a single case:

 dat[,1][is.infinite(dat[,1])] = NA

So I generalized it with following loop

cf_DFinf2NA <- function(x)
{
	for (i in 1:ncol(x)){
		  x[,i][is.infinite(x[,i])] = NA
	}
	return(x)
}

But I don't think that this is really using the power of R.

R Solutions


Solution 1 - R

Option 1

Use the fact that a data.frame is a list of columns, then use do.call to recreate a data.frame.

do.call(data.frame,lapply(DT, function(x) replace(x, is.infinite(x),NA)))

Option 2 -- data.table

You could use data.table and set. This avoids some internal copying.

DT <- data.table(dat)
invisible(lapply(names(DT),function(.name) set(DT, which(is.infinite(DT[[.name]])), j = .name,value =NA)))

Or using column numbers (possibly faster if there are a lot of columns):

for (j in 1:ncol(DT)) set(DT, which(is.infinite(DT[[j]])), j, NA)

Timings

# some `big(ish)` data
dat <- data.frame(a = rep(c(1,Inf), 1e6), b = rep(c(Inf,2), 1e6), 
                  c = rep(c('a','b'),1e6),d = rep(c(1,Inf), 1e6),  
                  e = rep(c(Inf,2), 1e6))
# create data.table
library(data.table)
DT <- data.table(dat)

# replace (@mnel)
system.time(na_dat <- do.call(data.frame,lapply(dat, function(x) replace(x, is.infinite(x),NA))))
## user  system elapsed 
#  0.52    0.01    0.53 

# is.na (@dwin)
system.time(is.na(dat) <- sapply(dat, is.infinite))
# user  system elapsed 
# 32.96    0.07   33.12 

# modified is.na
system.time(is.na(dat) <- do.call(cbind,lapply(dat, is.infinite)))
#  user  system elapsed 
# 1.22    0.38    1.60 


# data.table (@mnel)
system.time(invisible(lapply(names(DT),function(.name) set(DT, which(is.infinite(DT[[.name]])), j = .name,value =NA))))
# user  system elapsed 
# 0.29    0.02    0.31 

data.table is the quickest. Using sapply slows things down noticeably.

Solution 2 - R

Use sapply and is.na<-

> dat <- data.frame(a=c(1, Inf), b=c(Inf, 3), d=c("a","b"))
> is.na(dat) <- sapply(dat, is.infinite)
> dat
   a  b d
1  1 NA a
2 NA  3 b

Or you can use (giving credit to @mnel, whose edit this is),

> is.na(dat) <- do.call(cbind,lapply(dat, is.infinite))

which is significantly faster.

Solution 3 - R

Here is a dplyr/tidyverse solution using the na_if() function:

dat %>% mutate_if(is.numeric, list(~na_if(., Inf)))

Note that this only replaces positive infinity with NA. Need to repeat if negative infinity values also need to be replaced.

dat %>% mutate_if(is.numeric, list(~na_if(., Inf))) %>% 
  mutate_if(is.numeric, list(~na_if(., -Inf)))

Solution 4 - R

[<- with mapply is a bit faster than sapply.

> dat[mapply(is.infinite, dat)] <- NA

With mnel's data, the timing is

> system.time(dat[mapply(is.infinite, dat)] <- NA)
#   user  system elapsed 
# 15.281   0.000  13.750 

Solution 5 - R

There is very simple solution to this problem in the hablar package:

library(hablar)

dat %>% rationalize()

Which return a data frame with all Inf are converted to NA.

Timings compared to some above solutions. Code: library(hablar) library(data.table)

dat <- data.frame(a = rep(c(1,Inf), 1e6), b = rep(c(Inf,2), 1e6), 
                  c = rep(c('a','b'),1e6),d = rep(c(1,Inf), 1e6),  
                  e = rep(c(Inf,2), 1e6))
DT <- data.table(dat)

system.time(dat[mapply(is.infinite, dat)] <- NA)
system.time(dat[dat==Inf] <- NA)
system.time(invisible(lapply(names(DT),function(.name) set(DT, which(is.infinite(DT[[.name]])), j = .name,value =NA))))
system.time(rationalize(dat))

Result:

> system.time(dat[mapply(is.infinite, dat)] <- NA)
   user  system elapsed 
  0.125   0.039   0.164 
> system.time(dat[dat==Inf] <- NA)
   user  system elapsed 
  0.095   0.010   0.108 
> system.time(invisible(lapply(names(DT),function(.name) set(DT, which(is.infinite(DT[[.name]])), j = .name,value =NA))))
   user  system elapsed 
  0.065   0.002   0.067 
> system.time(rationalize(dat))
   user  system elapsed 
  0.058   0.014   0.072 
> 

Seems like data.table is faster than hablar. But has longer syntax.

Solution 6 - R

Feng Mai has a tidyverse answer above to get negative and positive infinities:

dat %>% mutate_if(is.numeric, list(~na_if(., Inf))) %>% 
  mutate_if(is.numeric, list(~na_if(., -Inf)))

This works well, but a word of warning is not to swap in abs(.) here to do both lines at once as is proposed in an upvoted comment. It will look like it works, but changes all negative values in the dataset to positive! You can confirm with this:

data(iris)
#The last line here is bad - it converts all negative values to positive
iris %>% 
  mutate_if(is.numeric, ~scale(.)) %>%
  mutate(infinities = Sepal.Length / 0) %>%
  mutate_if(is.numeric, list(~na_if(abs(.), Inf)))

For one line, this works:

  mutate_if(is.numeric, ~ifelse(abs(.) == Inf,NA,.))

Solution 7 - R

Inside a dplyr pipe chain, you can do this.

%>% mutate_all(.,.funs = function(x){ifelse(is.infinite(x),NA,x)}) %>%

I find it simple, elegant and fast.

Solution 8 - R

There are many answers already, but would like to add that for me this tidyverse solution always worked well:

%>% mutate_all(function(x) ifelse(is.nan(x) | is.infinite(x), NA, x)) %>%

Solution 9 - R

Another solution:

    dat <- data.frame(a = rep(c(1,Inf), 1e6), b = rep(c(Inf,2), 1e6), 
                      c = rep(c('a','b'),1e6),d = rep(c(1,Inf), 1e6),  
                      e = rep(c(Inf,2), 1e6))
    system.time(dat[dat==Inf] <- NA)

#   user  system elapsed
#  0.316   0.024   0.340

Solution 10 - R

Also, if someone need the Infs' coordinates, can do this:

library(rlist)
list.clean(apply(df, 2, function(x){which(is.infinite(x))}), function(x) length(x) == 0L, TRUE)

Result:

$colname1
[1] row1 row2 ...
$colname2
[2] row1 row2 ... 

With this information, you can replace the Inf values in particular places with the mean, median, or whatever operator that you want.

For example (for element 01):

repInf = list.clean(apply(df, 2, function(x){which(is.infinite(x))}), function(x) length(x) == 0L, TRUE)
df[repInf[[1]], names(repInf)[[1]]] = median or mean(is.finite(df[ ,names(repInf)[[1]]]), na.rm = TRUE)

In loop:

for (nonInf in 1:length(repInf)) {
df[repInf[[nonInf]], names(repInf)[[nonInf]]] = mean(is.finite(df[ , names(repInf)[[nonInf]]]))
}

Solution 11 - R

You may also use the handy replace_na function: https://tidyr.tidyverse.org/reference/replace_na.html

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionricardoView Question on Stackoverflow
Solution 1 - RmnelView Answer on Stackoverflow
Solution 2 - RIRTFMView Answer on Stackoverflow
Solution 3 - RFeng MaiView Answer on Stackoverflow
Solution 4 - RRich ScrivenView Answer on Stackoverflow
Solution 5 - RdavsjobView Answer on Stackoverflow
Solution 6 - RMark E.View Answer on Stackoverflow
Solution 7 - RSagarView Answer on Stackoverflow
Solution 8 - RToWiiView Answer on Stackoverflow
Solution 9 - RStudentView Answer on Stackoverflow
Solution 10 - RMirko Bozanic LealView Answer on Stackoverflow
Solution 11 - RGang SuView Answer on Stackoverflow