dplyr: how to reference columns by column index rather than column name using mutate?
RDplyrR Problem Overview
Using dplyr, you can do something like this:
iris %>% head %>% mutate(sum=Sepal.Length + Sepal.Width)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species sum
1 5.1 3.5 1.4 0.2 setosa 8.6
2 4.9 3.0 1.4 0.2 setosa 7.9
3 4.7 3.2 1.3 0.2 setosa 7.9
4 4.6 3.1 1.5 0.2 setosa 7.7
5 5.0 3.6 1.4 0.2 setosa 8.6
6 5.4 3.9 1.7 0.4 setosa 9.3
But above, I referenced the columns by their column names. How can I use 1
and 2
, which are the column indices to achieve the same result?
Here I have the following, but I feel it's not as elegant.
iris %>% head %>% mutate(sum=apply(select(.,1,2),1,sum))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species sum
1 5.1 3.5 1.4 0.2 setosa 8.6
2 4.9 3.0 1.4 0.2 setosa 7.9
3 4.7 3.2 1.3 0.2 setosa 7.9
4 4.6 3.1 1.5 0.2 setosa 7.7
5 5.0 3.6 1.4 0.2 setosa 8.6
6 5.4 3.9 1.7 0.4 setosa 9.3
R Solutions
Solution 1 - R
You can try:
iris %>% head %>% mutate(sum = .[[1]] + .[[2]])
Sepal.Length Sepal.Width Petal.Length Petal.Width Species sum
1 5.1 3.5 1.4 0.2 setosa 8.6
2 4.9 3.0 1.4 0.2 setosa 7.9
3 4.7 3.2 1.3 0.2 setosa 7.9
4 4.6 3.1 1.5 0.2 setosa 7.7
5 5.0 3.6 1.4 0.2 setosa 8.6
6 5.4 3.9 1.7 0.4 setosa 9.3
Solution 2 - R
I'm a bit late to the game, but my personal strategy in cases like this is to write my own tidyverse-compliant function that will do exactly what I want. By tidyverse-compliant, I mean that the first argument of the function is a data frame and that the output is a vector that can be added to the data frame.
sum_cols <- function(x, col1, col2){
x[[col1]] + x[[col2]]
}
iris %>%
head %>%
mutate(sum = sum_cols(x = ., col1 = 1, col2 = 2))
Solution 3 - R
What do you think about this version?
Inspired by @SavedByJesus's answer.
applySum <- function(df, ...) {
assertthat::assert_that(...length() > 0, msg = "one or more column indexes are required")
mutate(df, Sum = apply(as.data.frame(df[, c(...)]), 1, sum))
}
iris %>%
head(2) %>%
applySum(1, 2)
#
### output
#
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sum
1 5.1 3.5 1.4 0.2 setosa 8.6
2 4.9 3.0 1.4 0.2 setosa 7.9
#
### you can select and sum more then two columns by the same function
#
iris %>%
head(2) %>%
applySum(1, 2, 3, 4)
#
### output
#
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sum
1 5.1 3.5 1.4 0.2 setosa 10.2
2 4.9 3.0 1.4 0.2 setosa 9.5
Solution 4 - R
To address the issue that @pluke is asking about in the comments, dplyr doesn't really support column index.
Not a perfect solution, but you can use base R to get around this
iris[1] <- iris[1] + iris[2]
Solution 5 - R
An alternative to reusing .
in mutate
that will respect grouping is to use dplyr::cur_data_all()
. From help(cur_data_all)
>cur_data_all() gives the current data for the current group (including grouping variables)
Consider the following:
iris %>% group_by(Species) %>% mutate(sum = .[[1]] + .[[2]]) %>% head
#Error: Problem with `mutate()` column `sum`.
#ℹ `sum = .[[1]] + .[[2]]`.
#ℹ `sum` must be size 50 or 1, not 150.
#ℹ The error occurred in group 1: Species = setosa.
If instead you use cur_data_all()
, it works without issue:
iris %>% mutate(sum = select(cur_data_all(),1) + select(cur_data_all(),2)) %>% head()
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length
#1 5.1 3.5 1.4 0.2 setosa 8.6
#2 4.9 3.0 1.4 0.2 setosa 7.9
#3 4.7 3.2 1.3 0.2 setosa 7.9
#4 4.6 3.1 1.5 0.2 setosa 7.7
#5 5.0 3.6 1.4 0.2 setosa 8.6
#6 5.4 3.9 1.7 0.4 setosa 9.3
The same approach works with the extract operator ([[
).
iris %>% mutate(sum = cur_data()[[1]] + cur_data()[[2]]) %>% head()
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species sum
#1 5.1 3.5 1.4 0.2 setosa 8.6
#2 4.9 3.0 1.4 0.2 setosa 7.9
#3 4.7 3.2 1.3 0.2 setosa 7.9
#4 4.6 3.1 1.5 0.2 setosa 7.7
#5 5.0 3.6 1.4 0.2 setosa 8.6
#6 5.4 3.9 1.7 0.4 setosa 9.3