How to sum a variable by group

RDataframeAggregateR Faq

R Problem Overview


I have a data frame with two columns. First column contains categories such as "First", "Second", "Third", and the second column has numbers that represent the number of times I saw the specific groups from "Category".

For example:

Category     Frequency
First        10
First        15
First        5
Second       2
Third        14
Third        20
Second       3

I want to sort the data by Category and sum all the Frequencies:

Category     Frequency
First        30
Second       5
Third        34

How would I do this in R?

R Solutions


Solution 1 - R

Using aggregate:

aggregate(x$Frequency, by=list(Category=x$Category), FUN=sum)
  Category  x
1    First 30
2   Second  5
3    Third 34

In the example above, multiple dimensions can be specified in the list. Multiple aggregated metrics of the same data type can be incorporated via cbind:

aggregate(cbind(x$Frequency, x$Metric2, x$Metric3) ...

(embedding @thelatemail comment), aggregate has a formula interface too

aggregate(Frequency ~ Category, x, sum)

Or if you want to aggregate multiple columns, you could use the . notation (works for one column too)

aggregate(. ~ Category, x, sum)

or tapply:

tapply(x$Frequency, x$Category, FUN=sum)
 First Second  Third 
    30      5     34 

Using this data:

x <- data.frame(Category=factor(c("First", "First", "First", "Second",
                                      "Third", "Third", "Second")), 
                    Frequency=c(10,15,5,2,14,20,3))

Solution 2 - R

You can also use the dplyr package for that purpose:

library(dplyr)
x %>% 
  group_by(Category) %>% 
  summarise(Frequency = sum(Frequency))

#Source: local data frame [3 x 2]
#
#  Category Frequency
#1    First        30
#2   Second         5
#3    Third        34

Or, for multiple summary columns (works with one column too):

x %>% 
  group_by(Category) %>% 
  summarise(across(everything(), sum))

Here are some more examples of how to summarise data by group using dplyr functions using the built-in dataset mtcars:

# several summary columns with arbitrary names
mtcars %>% 
  group_by(cyl, gear) %>%                            # multiple group columns
  summarise(max_hp = max(hp), mean_mpg = mean(mpg))  # multiple summary columns

# summarise all columns except grouping columns using "sum" 
mtcars %>% 
  group_by(cyl) %>% 
  summarise(across(everything(), sum))

# summarise all columns except grouping columns using "sum" and "mean"
mtcars %>% 
  group_by(cyl) %>% 
  summarise(across(everything(), list(mean = mean, sum = sum)))

# multiple grouping columns
mtcars %>% 
  group_by(cyl, gear) %>% 
  summarise(across(everything(), list(mean = mean, sum = sum)))

# summarise specific variables, not all
mtcars %>% 
  group_by(cyl, gear) %>% 
  summarise(across(c(qsec, mpg, wt), list(mean = mean, sum = sum)))

# summarise specific variables (numeric columns except grouping columns)
mtcars %>% 
  group_by(gear) %>% 
  summarise(across(where(is.numeric), list(mean = mean, sum = sum)))

For more information, including the %>% operator, see the introduction to dplyr.

Solution 3 - R

The answer provided by rcs works and is simple. However, if you are handling larger datasets and need a performance boost there is a faster alternative:

library(data.table)
data = data.table(Category=c("First","First","First","Second","Third", "Third", "Second"), 
                  Frequency=c(10,15,5,2,14,20,3))
data[, sum(Frequency), by = Category]
#    Category V1
# 1:    First 30
# 2:   Second  5
# 3:    Third 34
system.time(data[, sum(Frequency), by = Category] )
# user    system   elapsed 
# 0.008     0.001     0.009 

Let's compare that to the same thing using data.frame and the above above:

data = data.frame(Category=c("First","First","First","Second","Third", "Third", "Second"),
                  Frequency=c(10,15,5,2,14,20,3))
system.time(aggregate(data$Frequency, by=list(Category=data$Category), FUN=sum))
# user    system   elapsed 
# 0.008     0.000     0.015 

And if you want to keep the column this is the syntax:

data[,list(Frequency=sum(Frequency)),by=Category]
#    Category Frequency
# 1:    First        30
# 2:   Second         5
# 3:    Third        34

The difference will become more noticeable with larger datasets, as the code below demonstrates:

data = data.table(Category=rep(c("First", "Second", "Third"), 100000),
                  Frequency=rnorm(100000))
system.time( data[,sum(Frequency),by=Category] )
# user    system   elapsed 
# 0.055     0.004     0.059 
data = data.frame(Category=rep(c("First", "Second", "Third"), 100000), 
                  Frequency=rnorm(100000))
system.time( aggregate(data$Frequency, by=list(Category=data$Category), FUN=sum) )
# user    system   elapsed 
# 0.287     0.010     0.296 

For multiple aggregations, you can combine lapply and .SD as follows

data[, lapply(.SD, sum), by = Category]
#    Category Frequency
# 1:    First        30
# 2:   Second         5
# 3:    Third        34

Solution 4 - R

You can also use the by() function:

x2 <- by(x$Frequency, x$Category, sum)
do.call(rbind,as.list(x2))

Those other packages (plyr, reshape) have the benefit of returning a data.frame, but it's worth being familiar with by() since it's a base function.

Solution 5 - R

Several years later, just to add another simple base R solution that isn't present here for some reason- xtabs

xtabs(Frequency ~ Category, df)
# Category
# First Second  Third 
#    30      5     34 

Or if you want a data.frame back

as.data.frame(xtabs(Frequency ~ Category, df))
#   Category Freq
# 1    First   30
# 2   Second    5
# 3    Third   34

Solution 6 - R

library(plyr)
ddply(tbl, .(Category), summarise, sum = sum(Frequency))

Solution 7 - R

If x is a dataframe with your data, then the following will do what you want:

require(reshape)
recast(x, Category ~ ., fun.aggregate=sum)

Solution 8 - R

While I have recently become a convert to dplyr for most of these types of operations, the sqldf package is still really nice (and IMHO more readable) for some things.

Here is an example of how this question can be answered with sqldf

x <- data.frame(Category=factor(c("First", "First", "First", "Second",
                                  "Third", "Third", "Second")), 
                Frequency=c(10,15,5,2,14,20,3))

sqldf("select 
          Category
          ,sum(Frequency) as Frequency 
       from x 
       group by 
          Category")

##   Category Frequency
## 1    First        30
## 2   Second         5
## 3    Third        34

Solution 9 - R

Just to add a third option:

require(doBy)
summaryBy(Frequency~Category, data=yourdataframe, FUN=sum)

EDIT: this is a very old answer. Now I would recommend the use of group_by and summarise from dplyr, as in @docendo answer.

Solution 10 - R

Another solution that returns sums by groups in a matrix or a data frame and is short and fast:

rowsum(x$Frequency, x$Category)

Solution 11 - R

I find ave very helpful (and efficient) when you need to apply different aggregation functions on different columns (and you must/want to stick on base R) :

e.g.

Given this input :

DF <-                
data.frame(Categ1=factor(c('A','A','B','B','A','B','A')),
           Categ2=factor(c('X','Y','X','X','X','Y','Y')),
           Samples=c(1,2,4,3,5,6,7),
           Freq=c(10,30,45,55,80,65,50))

> DF
  Categ1 Categ2 Samples Freq
1      A      X       1   10
2      A      Y       2   30
3      B      X       4   45
4      B      X       3   55
5      A      X       5   80
6      B      Y       6   65
7      A      Y       7   50

we want to group by Categ1 and Categ2 and compute the sum of Samples and mean of Freq.
Here's a possible solution using ave :

# create a copy of DF (only the grouping columns)
DF2 <- DF[,c('Categ1','Categ2')]

# add sum of Samples by Categ1,Categ2 to DF2 
# (ave repeats the sum of the group for each row in the same group)
DF2$GroupTotSamples <- ave(DF$Samples,DF2,FUN=sum)

# add mean of Freq by Categ1,Categ2 to DF2 
# (ave repeats the mean of the group for each row in the same group)
DF2$GroupAvgFreq <- ave(DF$Freq,DF2,FUN=mean)

# remove the duplicates (keep only one row for each group)
DF2 <- DF2[!duplicated(DF2),]

Result :

> DF2
  Categ1 Categ2 GroupTotSamples GroupAvgFreq
1      A      X               6           45
2      A      Y               9           40
3      B      X               7           50
6      B      Y               6           65

Solution 12 - R

Since dplyr 1.0.0, the across() function could be used:

df %>%
 group_by(Category) %>%
 summarise(across(Frequency, sum))

  Category Frequency
  <chr>        <int>
1 First           30
2 Second           5
3 Third           34

If interested in multiple variables:

df %>%
 group_by(Category) %>%
 summarise(across(c(Frequency, Frequency2), sum))

  Category Frequency Frequency2
  <chr>        <int>      <int>
1 First           30         55
2 Second           5         29
3 Third           34        190

And the selection of variables using select helpers:

df %>%
 group_by(Category) %>%
 summarise(across(starts_with("Freq"), sum))

  Category Frequency Frequency2 Frequency3
  <chr>        <int>      <int>      <dbl>
1 First           30         55        110
2 Second           5         29         58
3 Third           34        190        380

Sample data:

df <- read.table(text = "Category Frequency Frequency2 Frequency3
                 1    First        10         10         20
                 2    First        15         30         60
                 3    First         5         15         30
                 4   Second         2          8         16
                 5    Third        14         70        140
                 6    Third        20        120        240
                 7   Second         3         21         42",
                 header = TRUE,
                 stringsAsFactors = FALSE)

Solution 13 - R

You could use the function group.sum from package Rfast.

Category <- Rfast::as_integer(Category,result.sort=FALSE) # convert character to numeric. R's as.numeric produce NAs.
result <- Rfast::group.sum(Frequency,Category)
names(result) <- Rfast::Sort(unique(Category)
# 30 5 34

Rfast has many group functions and group.sum is one of them.

Solution 14 - R

using cast instead of recast (note 'Frequency' is now 'value')

df  <- data.frame(Category = c("First","First","First","Second","Third","Third","Second")
                  , value = c(10,15,5,2,14,20,3))

install.packages("reshape")

result<-cast(df, Category ~ . ,fun.aggregate=sum)

to get:

Category (all)
First     30
Second    5
Third     34

Solution 15 - R

library(tidyverse)

x <- data.frame(Category= c('First', 'First', 'First', 'Second', 'Third', 'Third', 'Second'), 
           Frequency = c(10, 15, 5, 2, 14, 20, 3))

count(x, Category, wt = Frequency)

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
Questionuser5243421View Question on Stackoverflow
Solution 1 - RrcsView Answer on Stackoverflow
Solution 2 - RtalatView Answer on Stackoverflow
Solution 3 - RasieiraView Answer on Stackoverflow
Solution 4 - RShaneView Answer on Stackoverflow
Solution 5 - RDavid ArenburgView Answer on Stackoverflow
Solution 6 - RlearnrView Answer on Stackoverflow
Solution 7 - RRob HyndmanView Answer on Stackoverflow
Solution 8 - RjoemienkoView Answer on Stackoverflow
Solution 9 - RdalloliogmView Answer on Stackoverflow
Solution 10 - RKarolis KoncevičiusView Answer on Stackoverflow
Solution 11 - RdigEmAllView Answer on Stackoverflow
Solution 12 - RtmfmnkView Answer on Stackoverflow
Solution 13 - RManos PapadakisView Answer on Stackoverflow
Solution 14 - RGrant ShannonView Answer on Stackoverflow
Solution 15 - Ruser12703198View Answer on Stackoverflow