Name columns within aggregate in R

RRenameAggregate

R Problem Overview


I know I can rename columns after I aggregate the data:

blubb <- aggregate(dat$two ~ dat$one, ...)
colnames(blubb) <- c("One", "Two")

Nothing wrong with that. But is there a way to aggregate and name the columns in one go? Sort of like:

blubb <- aggregate( ... , cols = c("One", "Two"))

It would be escpecially nice (and typo-proof) to somehow catch the original column names and do like:

blubb <- aggregate( ... , cols = c(name_of_dat$one, name_of_dat$two."_Mean"))

R Solutions


Solution 1 - R

You can use setNames as in:

blubb <- setNames(aggregate(dat$two ~ dat$one, ...), c("One", "Two"))

Alternatively, you can bypass the slick formula method, and use syntax like:

blubb <- aggregate(list(One = dat$one), list(Two = dat$two), ...)

Update

This update is to just help get you started on deriving a solution on your own.

If you inspect the code for stats:::aggregate.formula, you'll see the following lines towards the end:

if (is.matrix(mf[[1L]])) {
    lhs <- as.data.frame(mf[[1L]])
    names(lhs) <- as.character(m[[2L]][[2L]])[-1L]
    aggregate.data.frame(lhs, mf[-1L], FUN = FUN, ...)
}
else aggregate.data.frame(mf[1L], mf[-1L], FUN = FUN, ...)

If all that you want to do is append the function name to the variable that was aggregated, perhaps you can change that to something like:

if (is.matrix(mf[[1L]])) {
  lhs <- as.data.frame(mf[[1L]])
  names(lhs) <- as.character(m[[2L]][[2L]])[-1L]
  myOut <- aggregate.data.frame(lhs, mf[-1L], FUN = FUN, ...)
  colnames(myOut) <- c(names(mf[-1L]), 
                       paste(names(lhs), deparse(substitute(FUN)), sep = "."))
}
else {
  myOut <- aggregate.data.frame(mf[1L], mf[-1L], FUN = FUN, ...)
  colnames(myOut) <- c(names(mf[-1L]), 
                       paste(strsplit(gsub("cbind\\(|\\)|\\s", "", 
                                           names(mf[1L])), ",")[[1]],
                             deparse(substitute(FUN)), sep = "."))
} 
myOut

This basically captures the value entered for FUN by using deparse(substitute(FUN)), so you can probably modify the function to accept a custom suffix, or perhaps even a vector of suffixes. This can probably be improved a bit with some work, but I'm not going to do it!

Here is a Gist with this concept applied, creating a function named "myAgg".

Here is some sample output of just the resulting column names:

> names(myAgg(weight ~ feed, data = chickwts, mean))
[1] "feed"        "weight.mean"
> names(myAgg(breaks ~ wool + tension, data = warpbreaks, sum))
[1] "wool"       "tension"    "breaks.sum"
> names(myAgg(weight ~ feed, data = chickwts, FUN = function(x) mean(x^2)))
[1] "feed"                         "weight.function(x) mean(x^2)"

Notice that only the aggregated variable name changes. But notice also that if you use a custom function, you'll end up with a really strange column name!

Solution 2 - R

The answer to your first question is yes. You can certainly include the column names in the aggregate function. Using the names from your example above:

blubb <- aggregate(dat,list(One=dat$One,Two=dat$Two),sum)

I like the part about possibly pulling in the original column names automatically. If I figure it out I'll post it.

Solution 3 - R

In case you prefer writing aggregates as formula the documentation shows the usage of cbind. And cbind allows you to name its arguments, which are used by aggregate.

aggregate(cbind(SLength = Sepal.Length) ~ cbind(Type = Species),
  data = iris, mean)
#  Type SLength
#1    1   5.006
#2    2   5.936
#3    3   6.588

But cbind replaces factors by their internal codes and when used on the right side of ~ it keeps the cbind call in the result. To avoid this the new names can already be given in the input data set using transform or within

aggregate(SLength ~ Type, transform(iris, SLength = Sepal.Length,
   Type = Species), mean)
#        Type SLength
#1     setosa   5.006
#2 versicolor   5.936
#3  virginica   6.588

or

aggregate(cbind(SLength = Sepal.Length) ~ Type, transform(iris, Type = Species),
  mean)
#        Type SLength
#1     setosa   5.006
#2 versicolor   5.936
#3  virginica   6.588

The advantage of using cbind or data.frame compared to list is that not all columns need to be given a (new) name. Aggregation of more than one column by more than one grouping factor could be done like:

aggregate(cbind("Miles/gallon" = mpg, Weight = wt, hp) ~ Cylinders + Carburetors
          + gear, transform(mtcars, Cylinders = cyl, Carburetors = carb), mean)
#   Cylinders Carburetors gear Miles/gallon  Weight    hp
#1          4           1    3        21.50 2.46500  97.0
#2          6           1    3        19.75 3.33750 107.5
#...

and if you want to use more than one function:

aggregate(cbind(cases=ncases, ncontrols) ~ alc + tobgp, transform(esoph,
  alc=alcgp), FUN = function(x) c("mean" = mean(x), "median" = median(x)))
#         alc    tobgp cases.mean cases.median ncontrols.mean ncontrols.median
#1  0-39g/day 0-9g/day  1.5000000    1.0000000      42.000000        44.000000
#2      40-79 0-9g/day  5.6666667    4.0000000      24.166667        29.000000
#...

which adds to the colname the used aggregate-function.

Some examples to aggregate mpg renamed to Miles/gallon and hp by cyl renamed to Cylinders and gear using mtcars resulting in:

#  Cylinders gear Miles/gallon       hp
#1         4    3       21.500  97.0000
#2         6    3       19.750 107.5000
#3         8    3       15.050 194.1667
#4         4    4       26.925  76.0000
#5         6    4       19.750 116.5000
#6         4    5       28.200 102.0000
#7         6    5       19.700 175.0000
#8         8    5       15.400 299.5000
aggregate(cbind("Miles/gallon" = mpg, hp) ~ Cylinders + gear,
          transform(mtcars, Cylinders = cyl, Carburetors = carb), mean)

with(mtcars, aggregate(cbind("Miles/gallon" = mpg, hp),
                       data.frame(Cylinders = cyl, gear), mean))

with(mtcars, aggregate(data.frame("Miles/gallon" = mpg, hp),
                       data.frame(Cylinders = cyl, gear), mean))

Solution 4 - R

w <- data.frame(Funding<-"Fully Insured",Region="North East",claim_count=rnbinom(1000, 300.503572818, mu= 0.5739467))
x <- data.frame(Funding<-"Fully Insured",Region="South East",claim_count=rnbinom(1000, 1000, mu= 0.70000000))
y <- data.frame(Funding<-"Self Insured",Region="North East",claim_count=rnbinom(1000, 400, mu= 0.80000000))
z <- data.frame(Funding<-"Self Insured",Region="South East",claim_count=rnbinom(1000, 700, mu= 1.70000000))
names(w)<-c("Funding","Region","claim_count")
names(x)<-c("Funding","Region","claim_count")
names(y)<-c("Funding","Region","claim_count")
names(z)<-c("Funding","Region","claim_count")
my_df <- rbind(w,x,y,z)
my_df2<-with(my_df, aggregate(x=claim_count, by=list(Funding,Region), FUN=sum))
colnames(my_df2)<-colnames(my_df)

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
Questionuser1322720View Question on Stackoverflow
Solution 1 - RA5C1D2H2I1M1N2O1R2T1View Answer on Stackoverflow
Solution 2 - Rorville jacksonView Answer on Stackoverflow
Solution 3 - RGKiView Answer on Stackoverflow
Solution 4 - Rrwinkel2000View Answer on Stackoverflow