dplyr summarise_each with na.rm
RDplyrR Problem Overview
Is there a way to instruct dplyr
to use summarise_each
with na.rm=TRUE
? I would like to take the mean of variables with summarise_each("mean")
but I don't know how to specify it to ignore missing values.
R Solutions
Solution 1 - R
Following the links in the doc, it seems you can use funs(mean(., na.rm = TRUE))
:
library(dplyr)
by_species <- iris %>% group_by(Species)
by_species %>% summarise_each(funs(mean(., na.rm = TRUE)))
Solution 2 - R
update
the current dplyr version strongly suggests the use of across
instead of the more specified functions summarise_all
etc.
Translating the below syntax (naming the functions in a named list) into across
could look like this:
library(dplyr)
ggplot2::msleep %>%
select(vore, sleep_total, sleep_rem) %>%
group_by(vore) %>%
summarise(across(everything(), .f = list(mean = mean, max = max, sd = sd), na.rm = TRUE))
#> # A tibble: 5 x 7
#> vore sleep_total_mean sleep_total_max sleep_total_sd sleep_rem_mean
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 carni 10.4 19.4 4.67 2.29
#> 2 herbi 9.51 16.6 4.88 1.37
#> 3 inse~ 14.9 19.9 5.92 3.52
#> 4 omni 10.9 18 2.95 1.96
#> 5 <NA> 10.2 13.7 3.00 1.88
#> # ... with 2 more variables: sleep_rem_max <dbl>, sleep_rem_sd <dbl>
older answer
summarise_each
is deprecated now, here an option with summarise_all
.
- One can still specify
na.rm = TRUE
within thefuns
argument (cf @flodel 's answer: just replacesummarise_each
withsummarise_all
). - But you can also add
na.rm = TRUE
after thefuns
argument.
That is useful when you want to call more than only one function, e.g.:
edit
the funs()
argument is now (soft)deprecated, thanks to comment @Mikko. One can use the suggestions that are given by the warning, see below in the code. na.rm
can still be specified as additional argument within summarise_all
.
I used ggplot2::msleep
because it contains NAs and shows this better.
library(dplyr)
ggplot2::msleep %>%
select(vore, sleep_total, sleep_rem) %>%
group_by(vore) %>%
summarise_all(funs(mean, max, sd), na.rm = TRUE)
#> Warning: funs() is soft deprecated as of dplyr 0.8.0
#> Please use a list of either functions or lambdas:
#>
#> # Simple named list:
#> list(mean = mean, median = median)
#>
#> # Auto named with `tibble::lst()`:
#> tibble::lst(mean, median)
#>
#> # Using lambdas
#> list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
Solution 3 - R
Take for instance mtcars
data set
library(dplyr)
You can always use summarise
to avoid long syntax:
mtcars %>%
group_by(cyl) %>%
summarise(mean_mpg = mean(mpg, na.rm=T),
sd_mpg = sd(mpg, na.rm = T))
Solution 4 - R
I don't know if my answer will add something to the previous comments. Hopefully yes.
In my case, I had a database from an experiment with two groups (control, exp) with different levels for a specific variable (day) and I wanted to get a summary of mean and sd of another variable (weight) for each group for specific levels of the variable day.
Here is an example of my database:
> animal group day weight
> 1.1 "control" 73 NA
> 1.2 "control" 73 NA
> 3.1 "control" 73 NA
> 9.2 "control" 73 25.2
> 9.3 "control" 73 23.4
> 9.4 "control" 73 25.8
> 2.1 "exp" 73 NA
> 2.2 "exp" 73 NA
> 10.1 "exp" 73 24.4
> 10.2 "exp" 73 NA
> 10.3 "exp" 73 24.6
So, for instance, in this case I wanted to get the mean and sd of the weight on day 73 for each of the groups (control, exp), omitting the NAs.
I did this with this command:
data[data$day=="73",] %>% group_by(group) %>% summarise(mean(weight[group == "exp"], na.rm=T),sd(weight[group == "exp"], na.rm=T))
data[data$day=="73",] %>% group_by(group) %>% summarise(mean(weight[group == "control"], na.rm=T),sd(weight[group == "control"], na.rm=T))
Solution 5 - R
summarise_at
function in dplyr
will summarise a dataset at specific column and allow to remove NAs for each functions applied. Take iris dataset and compute mean and median for variables from Sepal.Length to Petal.Width.
library(dplyr)
summarise_at(iris,vars(Sepal.Length:Petal.Width),funs(mean,median),na.rm=T)