Cleaning `Inf` values from an R dataframe
RDataframedata.tableR Problem Overview
In R, I have an operation which creates some Inf
values when I transform a dataframe.
I would like to turn these Inf
values into NA
values. The code I have is slow for large data, is there a faster way of doing this?
Say I have the following dataframe:
dat <- data.frame(a=c(1, Inf), b=c(Inf, 3), d=c("a","b"))
The following works in a single case:
dat[,1][is.infinite(dat[,1])] = NA
So I generalized it with following loop
cf_DFinf2NA <- function(x)
{
for (i in 1:ncol(x)){
x[,i][is.infinite(x[,i])] = NA
}
return(x)
}
But I don't think that this is really using the power of R.
R Solutions
Solution 1 - R
Option 1
Use the fact that a data.frame
is a list of columns, then use do.call
to recreate a data.frame
.
do.call(data.frame,lapply(DT, function(x) replace(x, is.infinite(x),NA)))
data.table
Option 2 -- You could use data.table
and set
. This avoids some internal copying.
DT <- data.table(dat)
invisible(lapply(names(DT),function(.name) set(DT, which(is.infinite(DT[[.name]])), j = .name,value =NA)))
Or using column numbers (possibly faster if there are a lot of columns):
for (j in 1:ncol(DT)) set(DT, which(is.infinite(DT[[j]])), j, NA)
Timings
# some `big(ish)` data
dat <- data.frame(a = rep(c(1,Inf), 1e6), b = rep(c(Inf,2), 1e6),
c = rep(c('a','b'),1e6),d = rep(c(1,Inf), 1e6),
e = rep(c(Inf,2), 1e6))
# create data.table
library(data.table)
DT <- data.table(dat)
# replace (@mnel)
system.time(na_dat <- do.call(data.frame,lapply(dat, function(x) replace(x, is.infinite(x),NA))))
## user system elapsed
# 0.52 0.01 0.53
# is.na (@dwin)
system.time(is.na(dat) <- sapply(dat, is.infinite))
# user system elapsed
# 32.96 0.07 33.12
# modified is.na
system.time(is.na(dat) <- do.call(cbind,lapply(dat, is.infinite)))
# user system elapsed
# 1.22 0.38 1.60
# data.table (@mnel)
system.time(invisible(lapply(names(DT),function(.name) set(DT, which(is.infinite(DT[[.name]])), j = .name,value =NA))))
# user system elapsed
# 0.29 0.02 0.31
data.table
is the quickest. Using sapply
slows things down noticeably.
Solution 2 - R
Use sapply
and is.na<-
> dat <- data.frame(a=c(1, Inf), b=c(Inf, 3), d=c("a","b"))
> is.na(dat) <- sapply(dat, is.infinite)
> dat
a b d
1 1 NA a
2 NA 3 b
Or you can use (giving credit to @mnel, whose edit this is),
> is.na(dat) <- do.call(cbind,lapply(dat, is.infinite))
which is significantly faster.
Solution 3 - R
Here is a dplyr/tidyverse solution using the na_if() function:
dat %>% mutate_if(is.numeric, list(~na_if(., Inf)))
Note that this only replaces positive infinity with NA. Need to repeat if negative infinity values also need to be replaced.
dat %>% mutate_if(is.numeric, list(~na_if(., Inf))) %>%
mutate_if(is.numeric, list(~na_if(., -Inf)))
Solution 4 - R
[<-
with mapply
is a bit faster than sapply
.
> dat[mapply(is.infinite, dat)] <- NA
With mnel's data, the timing is
> system.time(dat[mapply(is.infinite, dat)] <- NA)
# user system elapsed
# 15.281 0.000 13.750
Solution 5 - R
There is very simple solution to this problem in the hablar package:
library(hablar)
dat %>% rationalize()
Which return a data frame with all Inf are converted to NA.
Timings compared to some above solutions. Code: library(hablar) library(data.table)
dat <- data.frame(a = rep(c(1,Inf), 1e6), b = rep(c(Inf,2), 1e6),
c = rep(c('a','b'),1e6),d = rep(c(1,Inf), 1e6),
e = rep(c(Inf,2), 1e6))
DT <- data.table(dat)
system.time(dat[mapply(is.infinite, dat)] <- NA)
system.time(dat[dat==Inf] <- NA)
system.time(invisible(lapply(names(DT),function(.name) set(DT, which(is.infinite(DT[[.name]])), j = .name,value =NA))))
system.time(rationalize(dat))
Result:
> system.time(dat[mapply(is.infinite, dat)] <- NA)
user system elapsed
0.125 0.039 0.164
> system.time(dat[dat==Inf] <- NA)
user system elapsed
0.095 0.010 0.108
> system.time(invisible(lapply(names(DT),function(.name) set(DT, which(is.infinite(DT[[.name]])), j = .name,value =NA))))
user system elapsed
0.065 0.002 0.067
> system.time(rationalize(dat))
user system elapsed
0.058 0.014 0.072
>
Seems like data.table is faster than hablar. But has longer syntax.
Solution 6 - R
Feng Mai has a tidyverse answer above to get negative and positive infinities:
dat %>% mutate_if(is.numeric, list(~na_if(., Inf))) %>%
mutate_if(is.numeric, list(~na_if(., -Inf)))
This works well, but a word of warning is not to swap in abs(.) here to do both lines at once as is proposed in an upvoted comment. It will look like it works, but changes all negative values in the dataset to positive! You can confirm with this:
data(iris)
#The last line here is bad - it converts all negative values to positive
iris %>%
mutate_if(is.numeric, ~scale(.)) %>%
mutate(infinities = Sepal.Length / 0) %>%
mutate_if(is.numeric, list(~na_if(abs(.), Inf)))
For one line, this works:
mutate_if(is.numeric, ~ifelse(abs(.) == Inf,NA,.))
Solution 7 - R
Inside a dplyr pipe chain, you can do this.
%>% mutate_all(.,.funs = function(x){ifelse(is.infinite(x),NA,x)}) %>%
I find it simple, elegant and fast.
Solution 8 - R
There are many answers already, but would like to add that for me this tidyverse
solution always worked well:
%>% mutate_all(function(x) ifelse(is.nan(x) | is.infinite(x), NA, x)) %>%
Solution 9 - R
Another solution:
dat <- data.frame(a = rep(c(1,Inf), 1e6), b = rep(c(Inf,2), 1e6),
c = rep(c('a','b'),1e6),d = rep(c(1,Inf), 1e6),
e = rep(c(Inf,2), 1e6))
system.time(dat[dat==Inf] <- NA)
# user system elapsed
# 0.316 0.024 0.340
Solution 10 - R
Also, if someone need the Infs' coordinates, can do this:
library(rlist)
list.clean(apply(df, 2, function(x){which(is.infinite(x))}), function(x) length(x) == 0L, TRUE)
Result:
$colname1
[1] row1 row2 ...
$colname2
[2] row1 row2 ...
With this information, you can replace the Inf values in particular places with the mean, median, or whatever operator that you want.
For example (for element 01):
repInf = list.clean(apply(df, 2, function(x){which(is.infinite(x))}), function(x) length(x) == 0L, TRUE)
df[repInf[[1]], names(repInf)[[1]]] = median or mean(is.finite(df[ ,names(repInf)[[1]]]), na.rm = TRUE)
In loop:
for (nonInf in 1:length(repInf)) {
df[repInf[[nonInf]], names(repInf)[[nonInf]]] = mean(is.finite(df[ , names(repInf)[[nonInf]]]))
}
Solution 11 - R
You may also use the handy replace_na function: https://tidyr.tidyverse.org/reference/replace_na.html