standard evaluation in dplyr: summarise a variable given as a character string

RDplyr

R Problem Overview


UPDATE July 2020:

dplyr 1.0 has changed pretty much everything about this question as well as all of the answers. See the dplyr programming vignette here:

https://cran.r-project.org/web/packages/dplyr/vignettes/programming.html

The new way to refer to columns when their identifier is stored as a character vector is to use the .data pronoun from rlang, and then subset as you would in base R.

library(dplyr)

key <- "v3"
val <- "v2"
drp <- "v1"

df <- tibble(v1 = 1:5, v2 = 6:10, v3 = c(rep("A", 3), rep("B", 2)))

df %>% 
    select(-matches(drp)) %>% 
    group_by(.data[[key]]) %>% 
    summarise(total = sum(.data[[val]], na.rm = TRUE))

#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 2 x 2
#>   v3    total
#>   <chr> <int>
#> 1 A        21
#> 2 B        19

If your code is in a package function, you can @importFrom rlang .data to avoid R check notes about undefined globals.

ORIGINAL QUESTION:

I want to refer to an unknown column name inside a summarise. The standard evaluation functions introduced in dplyr 0.3 allow column names to be referenced using variables, but this doesn't appear to work when you call a base R function within e.g. a summarise.

library(dplyr)
 
key <- "v3"
val <- "v2"
drp <- "v1"
 
df <- data_frame(v1 = 1:5, v2 = 6:10, v3 = c(rep("A", 3), rep("B", 2)))

The df looks like this:

> df
Source: local data frame [5 x 3]

  v1 v2 v3
1  1  6  A
2  2  7  A
3  3  8  A
4  4  9  B
5  5 10  B

I want to drop v1, group by v3, and sum v2 for each group:

df %>% select(-matches(drp)) %>% group_by_(key) %>% summarise_(sum(val, na.rm = TRUE))

Error in sum(val, na.rm = TRUE) : invalid 'type' (character) of argument

The NSE version of select() works fine, since it can match a character string. The SE version of group_by() works fine, since it can now accept variables as arguments and evaluate them. However, I haven't found a way to achieve similar results when using base R functions inside dplyr functions.

Things that don't work:

df %>% group_by_(key) %>% summarise_(sum(get(val), na.rm = TRUE))
Error in get(val) : object 'v2' not found

df %>% group_by_(key) %>% summarise_(sum(eval(as.symbol(val)), na.rm = TRUE))
Error in eval(expr, envir, enclos) : object 'v2' not found

I've checked out several related questions, but none of the proposed solutions have worked for me so far.

R Solutions


Solution 1 - R

Please note that this answer does not apply to dplyr >= 0.7.0, but to previous versions.

>[dplyr 0.7.0] has a new approach to non-standard evaluation (NSE) called tidyeval. It is described in detail in vignette("programming").


The dplyr vignette on non-standard evalutation is helpful here. Check the section "Mixing constants and variables" and you find that the function interp from package lazyeval could be used, and "[u]se as.name if you have a character string that gives a variable name":

library(lazyeval)
df %>%
  select(-matches(drp)) %>%
  group_by_(key) %>%
  summarise_(sum_val = interp(~sum(var, na.rm = TRUE), var = as.name(val)))
#   v3 sum_val
# 1  A      21
# 2  B      19

Solution 2 - R

With the release of the rlang package and the 0.7.0 update to dplyr, this is now fairly simple.

When you want to use a character string (e.g. "v1") as a variable name, you just:

  1. Convert the string to a symbol using sym() from the rlang package
  2. In your function call, write !! in front of the symbol

For instance, you'd do the following:

my_var <- "Sepal.Length"
my_sym <- sym(my_var)
summarize(iris, Mean = mean(!!my_sym))

More compactly, you could combine the step of converting your string to a symbol with sym() and prefixing it with !! when writing your function call.

For instance, you could write:

my_var <- "Sepal.Length"
summarize(iris, mean(!!sym(my_var)))


To return to your original example, you could do the following:

library(rlang)

key <- "v3"
val <- "v2"
drp <- "v1"

df <- data_frame(v1 = 1:5, v2 = 6:10, v3 = c(rep("A", 3), rep("B", 2)))

df %>% 
  # NOTE: we don't have to do anything to `drp`
  # since the matches() function expects a character string
  select(-matches(drp)) %>% 
  group_by(!!sym(key)) %>% 
  summarise(sum(!!sym(val), na.rm = TRUE))


Alternative Syntax

With the release of rlang version 0.4.0, you can use the following syntax:

my_var <- "Sepal.Length"
my_sym <- sym(my_var)
summarize(iris, Mean = mean({{ my_sym }}))

Instead of writing !!my_sym, you can write {{ my_sym }}. This has the advantage of being arguably clearer, but has the disadvantage that you have to convert the string to a symbol before placing it inside the brackets. For instance, you can write !!sym(my_var) but you can't write {{sym(my_var)}}

Additional details

Of all the official documentation explaining how the usage of sym() and !! works, these seem to be the most accessible:

  1. dplyr vignette: Programming with dplyr

  2. The section of Hadley Wickham's book 'Advanced R' on metaprogramming

Solution 3 - R

Pass the .dots argument a list of strings constructing the strings using paste, sprintf or using string interpolation from package gsubfn via fn$list in place of list as we do here:

library(gsubfn)
df %>% 
   group_by_(key) %>% 
   summarise_(.dots = fn$list(mean = "mean($val)", sd = "sd($val)"))

giving:

Source: local data frame [2 x 3]

  v3 mean        sd
1  A  7.0 1.0000000
2  B  9.5 0.7071068

Solution 4 - R

New dplyr update:

The new functionality of dplyr can help with this. Instead of strings for the variables that need non-standard evaluation, we use quosures quo(). We undo the quoting with another function !!. For more on these see this vignette. You will need the developer's version of dplyr until the full release.

library(dplyr) #0.5.0.9004+
key <- quo(v3)
val <- quo(v2)
drp <- "v1"

df <- data_frame(v1 = 1:5, v2 = 6:10, v3 = c(rep("A", 3), rep("B", 2)))
df %>% select(-matches("v1")) %>% 
  group_by(!!key) %>% 
  summarise(sum(!!val, na.rm = TRUE))
# # A tibble: 2 × 2
#      v3 `sum(v2, na.rm = TRUE)`
#   <chr>                   <int>
# 1     A                      21
# 2     B                      19

Solution 5 - R

dplyr 1.0 has changed pretty much everything about this question as well as all of the answers. See the dplyr programming vignette here:

https://cran.r-project.org/web/packages/dplyr/vignettes/programming.html

The new way to refer to columns when their identifier is stored as a character vector is to use the .data pronoun from rlang, and then subset as you would in base R.

library(dplyr)

key <- "v3"
val <- "v2"
drp <- "v1"

df <- tibble(v1 = 1:5, v2 = 6:10, v3 = c(rep("A", 3), rep("B", 2)))

df %>% 
    select(-matches(drp)) %>% 
    group_by(.data[[key]]) %>% 
    summarise(total = sum(.data[[val]], na.rm = TRUE))

#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 2 x 2
#>   v3    total
#>   <chr> <int>
#> 1 A        21
#> 2 B        19

If your code is in a package function, you can @importFrom rlang .data to avoid R check notes about undefined globals.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionAjarView Question on Stackoverflow
Solution 1 - RHenrikView Answer on Stackoverflow
Solution 2 - RbschneidrView Answer on Stackoverflow
Solution 3 - RG. GrothendieckView Answer on Stackoverflow
Solution 4 - RPierre LView Answer on Stackoverflow
Solution 5 - RAjarView Answer on Stackoverflow