dplyr summarise_each with na.rm

RDplyr

R Problem Overview


Is there a way to instruct dplyr to use summarise_each with na.rm=TRUE? I would like to take the mean of variables with summarise_each("mean") but I don't know how to specify it to ignore missing values.

R Solutions


Solution 1 - R

Following the links in the doc, it seems you can use funs(mean(., na.rm = TRUE)):

library(dplyr)
by_species <- iris %>% group_by(Species)
by_species %>% summarise_each(funs(mean(., na.rm = TRUE)))

Solution 2 - R

update

the current dplyr version strongly suggests the use of across instead of the more specified functions summarise_all etc.

Translating the below syntax (naming the functions in a named list) into across could look like this:

library(dplyr)
ggplot2::msleep %>% 
  select(vore, sleep_total, sleep_rem) %>%
  group_by(vore) %>%
  summarise(across(everything(), .f = list(mean = mean, max = max, sd = sd), na.rm = TRUE))

#> # A tibble: 5 x 7
#>   vore  sleep_total_mean sleep_total_max sleep_total_sd sleep_rem_mean
#>   <chr>            <dbl>           <dbl>          <dbl>          <dbl>
#> 1 carni            10.4             19.4           4.67           2.29
#> 2 herbi             9.51            16.6           4.88           1.37
#> 3 inse~            14.9             19.9           5.92           3.52
#> 4 omni             10.9             18             2.95           1.96
#> 5 <NA>             10.2             13.7           3.00           1.88
#> # ... with 2 more variables: sleep_rem_max <dbl>, sleep_rem_sd <dbl>


older answer

summarise_each is deprecated now, here an option with summarise_all.

  • One can still specify na.rm = TRUE within the funs argument (cf @flodel 's answer: just replace summarise_each with summarise_all ).
  • But you can also add na.rm = TRUE after the funs argument.

That is useful when you want to call more than only one function, e.g.:

edit

the funs() argument is now (soft)deprecated, thanks to comment @Mikko. One can use the suggestions that are given by the warning, see below in the code. na.rm can still be specified as additional argument within summarise_all.

I used ggplot2::msleep because it contains NAs and shows this better.

library(dplyr)

ggplot2::msleep %>% 
  select(vore, sleep_total, sleep_rem) %>%
  group_by(vore) %>%
  summarise_all(funs(mean, max, sd), na.rm = TRUE)
#> Warning: funs() is soft deprecated as of dplyr 0.8.0
#> Please use a list of either functions or lambdas: 
#> 
#>   # Simple named list: 
#>   list(mean = mean, median = median)
#> 
#>   # Auto named with `tibble::lst()`: 
#>   tibble::lst(mean, median)
#> 
#>   # Using lambdas
#>   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))

Solution 3 - R

Take for instance mtcars data set

library(dplyr)

You can always use summarise to avoid long syntax:

mtcars %>%
  group_by(cyl) %>% 
  summarise(mean_mpg = mean(mpg, na.rm=T),
            sd_mpg = sd(mpg, na.rm = T))

Solution 4 - R

I don't know if my answer will add something to the previous comments. Hopefully yes.

In my case, I had a database from an experiment with two groups (control, exp) with different levels for a specific variable (day) and I wanted to get a summary of mean and sd of another variable (weight) for each group for specific levels of the variable day.

Here is an example of my database: > animal group day weight
> 1.1 "control" 73 NA
> 1.2 "control" 73 NA
> 3.1 "control" 73 NA
> 9.2 "control" 73 25.2 > 9.3 "control" 73 23.4 > 9.4 "control" 73 25.8
> 2.1 "exp" 73 NA > 2.2 "exp" 73 NA
> 10.1 "exp" 73 24.4
> 10.2 "exp" 73 NA
> 10.3 "exp" 73 24.6

So, for instance, in this case I wanted to get the mean and sd of the weight on day 73 for each of the groups (control, exp), omitting the NAs.

I did this with this command:

data[data$day=="73",] %>% group_by(group) %>% summarise(mean(weight[group == "exp"], na.rm=T),sd(weight[group == "exp"], na.rm=T))
data[data$day=="73",] %>% group_by(group) %>% summarise(mean(weight[group == "control"], na.rm=T),sd(weight[group == "control"], na.rm=T))

Solution 5 - R

summarise_at function in dplyr will summarise a dataset at specific column and allow to remove NAs for each functions applied. Take iris dataset and compute mean and median for variables from Sepal.Length to Petal.Width.

library(dplyr)
summarise_at(iris,vars(Sepal.Length:Petal.Width),funs(mean,median),na.rm=T)

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionpaljenczyView Question on Stackoverflow
Solution 1 - RflodelView Answer on Stackoverflow
Solution 2 - RtjeboView Answer on Stackoverflow
Solution 3 - RIkuroView Answer on Stackoverflow
Solution 4 - RKonstantinos ArmaosView Answer on Stackoverflow
Solution 5 - RFlipView Answer on Stackoverflow