How to select the rows with maximum values in each group with dplyr?
RDplyrPlyrGreatest N-per-GroupR Problem Overview
I would like to select a row with maximum value in each group with dplyr.
Firstly I generate some random data to show my question
set.seed(1)
df <- expand.grid(list(A = 1:5, B = 1:5, C = 1:5))
df$value <- runif(nrow(df))
In plyr, I could use a custom function to select this row.
library(plyr)
ddply(df, .(A, B), function(x) x[which.max(x$value),])
In dplyr, I am using this code to get the maximum value, but not the rows with maximum value (Column C in this case).
library(dplyr)
df %>% group_by(A, B) %>%
summarise(max = max(value))
How could I achieve this? Thanks for any suggestion.
sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252
[3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C
[5] LC_TIME=English_Australia.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] dplyr_0.2 plyr_1.8.1
loaded via a namespace (and not attached):
[1] assertthat_0.1.0.99 parallel_3.1.0 Rcpp_0.11.1
[4] tools_3.1.0
R Solutions
Solution 1 - R
Try this:
result <- df %>%
group_by(A, B) %>%
filter(value == max(value)) %>%
arrange(A,B,C)
Seems to work:
identical(
as.data.frame(result),
ddply(df, .(A, B), function(x) x[which.max(x$value),])
)
#[1] TRUE
As pointed out in the comments, slice
may be preferred here as per @RoyalITS' answer below if you strictly only want 1 row per group. This answer will return multiple rows if there are multiple with an identical maximum value.
Solution 2 - R
df %>% group_by(A,B) %>% slice(which.max(value))
Solution 3 - R
You can use top_n
df %>% group_by(A, B) %>% top_n(n=1)
This will rank by the last column (value
) and return the top n=1
rows.
Currently, you can't change the this default without causing an error (See https://github.com/hadley/dplyr/issues/426)
Solution 4 - R
This more verbose solution provides greater control on what happens in case of duplicate maximum value (in this example, it will take one of the corresponding rows randomly)
library(dplyr)
df %>% group_by(A, B) %>%
mutate(the_rank = rank(-value, ties.method = "random")) %>%
filter(the_rank == 1) %>% select(-the_rank)
Solution 5 - R
More generally, I think you might want to get "top" of the rows that are sorted within a given group.
For the case of where a single value is max'd out, you have essentially sorted by only one column. However, it's often useful to hierarchically sort by multiple columns (for example: a date column and a time-of-day column).
# Answering the question of getting row with max "value".
df %>%
# Within each grouping of A and B values.
group_by( A, B) %>%
# Sort rows in descending order by "value" column.
arrange( desc(value) ) %>%
# Pick the top 1 value
slice(1) %>%
# Remember to ungroup in case you want to do further work without grouping.
ungroup()
# Answering an extension of the question of
# getting row with the max value of the lowest "C".
df %>%
# Within each grouping of A and B values.
group_by( A, B) %>%
# Sort rows in ascending order by C, and then within that by
# descending order by "value" column.
arrange( C, desc(value) ) %>%
# Pick the one top row based on the sort
slice(1) %>%
# Remember to ungroup in case you want to do further work without grouping.
ungroup()
Solution 6 - R
For me, it helped to count the number of values per group. Copy the count table into a new object. Then filter for the max of the group based on the first grouping characteristic. For example:
count_table <- df %>%
group_by(A, B) %>%
count() %>%
arrange(A, desc(n))
count_table %>%
group_by(A) %>%
filter(n == max(n))
or
count_table %>%
group_by(A) %>%
top_n(1, n)