Aggregate a dataframe on a given column and display another column

RAggregatePlyrGreatest N-per-Group

R Problem Overview


I have a dataframe in R of the following form:

> head(data)
  Group Score Info
1     1     1    a
2     1     2    b
3     1     3    c
4     2     4    d
5     2     3    e
6     2     1    f

I would like to aggregate it following the Score column using the max function

> aggregate(data$Score, list(data$Group), max)

  Group.1         x
1       1         3
2       2         4

But I also would like to display the Info column associated to the maximum value of the Score column for each group. I have no idea how to do this. My desired output would be:

  Group.1         x        y
1       1         3        c
2       2         4        d

Any hint?

R Solutions


Solution 1 - R

A base R solution is to combine the output of aggregate() with a merge() step. I find the formula interface to aggregate() a little more useful than the standard interface, partly because the names on the output are nicer, so I'll use that:

The aggregate() step is

maxs <- aggregate(Score ~ Group, data = dat, FUN = max)

and the merge() step is simply

merge(maxs, dat)

This gives us the desired output:

R> maxs <- aggregate(Score ~ Group, data = dat, FUN = max)
R> merge(maxs, dat)
  Group Score Info
1     1     3    c
2     2     4    d

You could, of course, stick this into a one-liner (the intermediary step was more for exposition):

merge(aggregate(Score ~ Group, data = dat, FUN = max), dat)

The main reason I used the formula interface is that it returns a data frame with the correct names for the merge step; these are the names of the columns from the original data set dat. We need to have the output of aggregate() have the correct names so that merge() knows which columns in the original and aggregated data frames match.

The standard interface gives odd names, whichever way you call it:

R> aggregate(dat$Score, list(dat$Group), max)
  Group.1 x
1       1 3
2       2 4
R> with(dat, aggregate(Score, list(Group), max))
  Group.1 x
1       1 3
2       2 4

We can use merge() on those outputs, but we need to do more work telling R which columns match up.

Solution 2 - R

First, you split the data using split:

split(z,z$Group)

Than, for each chunk, select the row with max Score:

lapply(split(z,z$Group),function(chunk) chunk[which.max(chunk$Score),])

Finally reduce back to a data.frame do.calling rbind:

do.call(rbind,lapply(split(z,z$Group),function(chunk) chunk[which.max(chunk$Score),]))

Result:

  Group Score Info
1     1     3    c
2     2     4    d

One line, no magic spells, fast, result has good names =)

Solution 3 - R

Here is a solution using the plyr package.

The following line of code essentially tells ddply to first group your data by Group, and then within each group returns a subset where the Score equals the maximum score in that group.

library(plyr)
ddply(data, .(Group), function(x)x[x$Score==max(x$Score), ])

  Group Score Info
1     1     3    c
2     2     4    d

And, as @SachaEpskamp points out, this can be further simplified to:

ddply(df, .(Group), function(x)x[which.max(x$Score), ])

(which also has the advantage that which.max will return multiple max lines, if there are any).

Solution 4 - R

The plyr package can be used for this. With the ddply() function you can split a data frame on one or more columns and apply a function and return a data frame, then with the summarize() function you can use the columns of the splitted data frame as variables to make the new data frame/;

dat <- read.table(textConnection('Group Score Info
1     1     1    a
2     1     2    b
3     1     3    c
4     2     4    d
5     2     3    e
6     2     1    f'))

library("plyr")

ddply(dat,.(Group),summarize,
    Max = max(Score),
    Info = Info[which.max(Score)])
  Group Max Info
1     1   3    c
2     2   4    d

Solution 5 - R

A late answer, but and approach using data.table

library(data.table)
DT <- data.table(dat)

DT[, .SD[which.max(Score),], by = Group]

Or, if it is possible to have more than one equally highest score

DT[, .SD[which(Score == max(Score)),], by = Group]

Noting that (from ?data.table

> .SD is a data.table containing the Subset of x's Data for each group, excluding the group column(s)

Solution 6 - R

To add to Gavin's answer: prior to the merge, it is possible to get aggregate to use proper names when not using the formula interface:

aggregate(data[,"score", drop=F], list(group=data$group), mean) 

Solution 7 - R

This is how I baseically think of the problem.

my.df <- data.frame(group = rep(c(1,2), each = 3), 
		score = runif(6), info = letters[1:6])
my.agg <- with(my.df, aggregate(score, list(group), max))
my.df.split <- with(my.df, split(x = my.df, f = group))
my.agg$info <- unlist(lapply(my.df.split, FUN = function(x) {
			x[which(x$score == max(x$score)), "info"]
		}))

> my.agg
  Group.1         x info
1       1 0.9344336    a
2       2 0.7699763    e

Solution 8 - R

I don't have a high enough reputation to comment on Gavin Simpson's answer, but I wanted to warn that there seems to be a difference in the default treatment of missing values between the standard syntax and the formula syntax for aggregate.

#Create some data with missing values 
a<-data.frame(day=rep(1,5),hour=c(1,2,3,3,4),val=c(1,NA,3,NA,5))
  day hour val
1   1    1   1
2   1    2  NA
3   1    3   3
4   1    3  NA
5   1    4   5

#Standard syntax
aggregate(a$val,by=list(day=a$day,hour=a$hour),mean,na.rm=T)
  day hour   x
1   1    1   1
2   1    2 NaN
3   1    3   3
4   1    4   5

#Formula syntax.  Note the index for hour 2 has been silently dropped.
aggregate(val ~ hour + day,data=a,mean,na.rm=T)
  hour day val
1    1   1   1
2    3   1   3
3    4   1   5

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
Questionjul635View Question on Stackoverflow
Solution 1 - RGavin SimpsonView Answer on Stackoverflow
Solution 2 - RmbqView Answer on Stackoverflow
Solution 3 - RAndrieView Answer on Stackoverflow
Solution 4 - RSacha EpskampView Answer on Stackoverflow
Solution 5 - RmnelView Answer on Stackoverflow
Solution 6 - RDanView Answer on Stackoverflow
Solution 7 - RRoman LuštrikView Answer on Stackoverflow
Solution 8 - RJohnView Answer on Stackoverflow