Mean per group in a data.frame

RAggregate

R Problem Overview


I have a data.frame and I need to calculate the mean per group (i.e. per Month, below).

Name     Month  Rate1     Rate2
Aira       1      12        23
Aira       2      18        73
Aira       3      19        45
Ben        1      53        19
Ben        2      22        87
Ben        3      19        45
Cat        1      22        87
Cat        2      67        43
Cat        3      45        32

My desired output is like below, where the values for Rate1 and Rate2 are the group means. Please disregard the value, I have made it up for the example.

Name       Rate1       Rate2
Aira        23.21       12.2
Ben         45.23       43.9
Cat         33.22       32.2

R Solutions


Solution 1 - R

This type of operation is exactly what aggregate was designed for:

d <- read.table(text=
'Name     Month  Rate1     Rate2
Aira       1      12        23
Aira       2      18        73
Aira       3      19        45
Ben        1      53        19
Ben        2      22        87
Ben        3      19        45
Cat        1      22        87
Cat        2      67        43
Cat        3      45        32', header=TRUE)

aggregate(d[, 3:4], list(d$Name), mean)

  Group.1    Rate1    Rate2
1    Aira 16.33333 47.00000
2     Ben 31.33333 50.33333
3     Cat 44.66667 54.00000

Here we aggregate columns 3 and 4 of data.frame d, grouping by d$Name, and applying the mean function.


Or, using a formula interface:

aggregate(. ~ Name, d[-2], mean)

Solution 2 - R

Or use group_by & summarise_at from the dplyr package:

library(dplyr)

d %>%
  group_by(Name) %>%
  summarise_at(vars(-Month), funs(mean(., na.rm=TRUE)))

# A tibble: 3 x 3
  Name  Rate1 Rate2
  <fct> <dbl> <dbl>
1 Aira   16.3  47.0
2 Ben    31.3  50.3
3 Cat    44.7  54.0

See ?summarise_at for the many ways to specify the variables to act on. Here, vars(-Month) says all variables except Month.


In more recent versions of tidyverse/dplyr, using summarise(across(...)) is preferred to summarise_at:

d %>% 
  group_by(Name) %>%
  summarise(across(-Month, mean, na.rm = TRUE))

Solution 3 - R

You can also use package plyr, which is somehow more versatile:

library(plyr)

ddply(d, .(Name), summarize,  Rate1=mean(Rate1), Rate2=mean(Rate2))

  Name    Rate1    Rate2
1 Aira 16.33333 47.00000
2  Ben 31.33333 50.33333
3  Cat 44.66667 54.00000

Solution 4 - R

A third great alternative is using the package data.table, which also has the class data.frame, but operations like you are looking for are computed much faster.

library(data.table)
mydt <- structure(list(Name = c("Aira", "Aira", "Aira", "Ben", "Ben", "Ben", "Cat", "Cat", "Cat"), Month = c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), Rate1 = c(15.6396600443877, 2.15649279424609, 6.24692918928743, 2.37658797276116, 34.7500663272292, 3.28750138697048, 29.3265553981065, 17.9821839334431, 10.8639802575958), Rate2 = c(17.1680489538369, 5.84231656330206, 8.54330866437461, 5.88415184986176, 3.02064294862551, 17.2053351400752, 16.9552950199166, 2.56058000170089, 15.7496228048122)), .Names = c("Name", "Month", "Rate1", "Rate2"), row.names = c(NA, -9L), class = c("data.table", "data.frame"))

Now to take the mean of Rate1 and Rate2 for all 3 months, for each person (Name): First, decide which columns you want to take the mean of

colstoavg <- names(mydt)[3:4]

Now we use lapply to take the mean over the columns we want to avg (colstoavg)

mydt.mean <- mydt[,lapply(.SD,mean,na.rm=TRUE),by=Name,.SDcols=colstoavg]

 mydt.mean
   Name     Rate1     Rate2
1: Aira  8.014361 10.517891
2:  Ben 13.471385  8.703377
3:  Cat 19.390907 11.755166

Solution 5 - R

Here are a variety of ways to do this in base R including an alternative aggregate approach. The examples below return means per month, which I think is what you requested. Although, the same approach could be used to return means per person:

Using ave:

my.data <- read.table(text = '
     Name     Month  Rate1     Rate2
     Aira       1      12        23
     Aira       2      18        73
     Aira       3      19        45
     Ben        1      53        19
     Ben        2      22        87
     Ben        3      19        45
     Cat        1      22        87
     Cat        2      67        43
     Cat        3      45        32
', header = TRUE, stringsAsFactors = FALSE, na.strings = 'NA')

Rate1.mean <- with(my.data, ave(Rate1, Month, FUN = function(x) mean(x, na.rm = TRUE)))
Rate2.mean <- with(my.data, ave(Rate2, Month, FUN = function(x) mean(x, na.rm = TRUE)))

my.data <- data.frame(my.data, Rate1.mean, Rate2.mean)
my.data

Using by:

my.data <- read.table(text = '
     Name     Month  Rate1     Rate2
     Aira       1      12        23
     Aira       2      18        73
     Aira       3      19        45
     Ben        1      53        19
     Ben        2      22        87
     Ben        3      19        45
     Cat        1      22        87
     Cat        2      67        43
     Cat        3      45        32
', header = TRUE, stringsAsFactors = FALSE, na.strings = 'NA')

by.month <- as.data.frame(do.call("rbind", by(my.data, my.data$Month, FUN = function(x) colMeans(x[,3:4]))))
colnames(by.month) <- c('Rate1.mean', 'Rate2.mean')
by.month <- cbind(Month = rownames(by.month), by.month)

my.data <- merge(my.data, by.month, by = 'Month')
my.data

Using lapply and split:

my.data <- read.table(text = '
     Name     Month  Rate1     Rate2
     Aira       1      12        23
     Aira       2      18        73
     Aira       3      19        45
     Ben        1      53        19
     Ben        2      22        87
     Ben        3      19        45
     Cat        1      22        87
     Cat        2      67        43
     Cat        3      45        32
', header = TRUE, stringsAsFactors = FALSE, na.strings = 'NA')

ly.mean <- lapply(split(my.data, my.data$Month), function(x) c(Mean = colMeans(x[,3:4])))
ly.mean <- as.data.frame(do.call("rbind", ly.mean))
ly.mean <- cbind(Month = rownames(ly.mean), ly.mean)

my.data <- merge(my.data, ly.mean, by = 'Month')
my.data

Using sapply and split:

my.data <- read.table(text = '
     Name     Month  Rate1     Rate2
     Aira       1      12        23
     Aira       2      18        73
     Aira       3      19        45
     Ben        1      53        19
     Ben        2      22        87
     Ben        3      19        45
     Cat        1      22        87
     Cat        2      67        43
     Cat        3      45        32
', header = TRUE, stringsAsFactors = FALSE, na.strings = 'NA')
my.data

sy.mean <- t(sapply(split(my.data, my.data$Month), function(x) colMeans(x[,3:4])))
colnames(sy.mean) <- c('Rate1.mean', 'Rate2.mean')
sy.mean <- data.frame(Month = rownames(sy.mean), sy.mean, stringsAsFactors = FALSE)
my.data <- merge(my.data, sy.mean, by = 'Month')
my.data

Using aggregate:

my.data <- read.table(text = '
     Name     Month  Rate1     Rate2
     Aira       1      12        23
     Aira       2      18        73
     Aira       3      19        45
     Ben        1      53        19
     Ben        2      22        87
     Ben        3      19        45
     Cat        1      22        87
     Cat        2      67        43
     Cat        3      45        32
', header = TRUE, stringsAsFactors = FALSE, na.strings = 'NA')

my.summary <- with(my.data, aggregate(list(Rate1, Rate2), by = list(Month), 
                   FUN = function(x) { mon.mean = mean(x, na.rm = TRUE) } ))

my.summary <- do.call(data.frame, my.summary)
colnames(my.summary) <- c('Month', 'Rate1.mean', 'Rate2.mean')
my.summary

my.data <- merge(my.data, my.summary, by = 'Month')
my.data

EDIT: June 28, 2020

Here I use aggregate to obtain the column means of an entire matrix by group where group is defined in an external vector:

my.group <- c(1,2,1,2,2,3,1,2,3,3)

my.data <- matrix(c(   1,    2,    3,    4,    5,
                      10,   20,   30,   40,   50,
                       2,    4,    6,    8,   10,
                      20,   30,   40,   50,   60,
                      20,   18,   16,   14,   12,
                    1000, 1100, 1200, 1300, 1400,
                       2,    3,    4,    3,    2,
                      50,   40,   30,   20,   10,
                    1001, 2001, 3001, 4001, 5001,
                    1000, 2000, 3000, 4000, 5000), nrow = 10, ncol = 5, byrow = TRUE)
my.data

my.summary <- aggregate(list(my.data), by = list(my.group), FUN = function(x) { my.mean = mean(x, na.rm = TRUE) } )
my.summary
#  Group.1          X1       X2          X3       X4          X5
#1       1    1.666667    3.000    4.333333    5.000    5.666667
#2       2   25.000000   27.000   29.000000   31.000   33.000000
#3       3 1000.333333 1700.333 2400.333333 3100.333 3800.333333

Solution 6 - R

I describe two ways to do this, one based on data.table and the other based on reshape2 package . The data.table way already has an answer, but I have tried to make it cleaner and more detailed.

The data is like this:

 d <- structure(list(Name = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 
3L, 3L), .Label = c("Aira", "Ben", "Cat"), class = "factor"), 
    Month = c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), Rate1 = c(12L, 
    18L, 19L, 53L, 22L, 19L, 22L, 67L, 45L), Rate2 = c(23L, 73L, 
    45L, 19L, 87L, 45L, 87L, 43L, 32L)), .Names = c("Name", "Month", 
"Rate1", "Rate2"), class = "data.frame", row.names = c(NA, -9L
))
head(d)
  Name Month Rate1 Rate2
1 Aira     1    12    23
2 Aira     2    18    73
3 Aira     3    19    45
4  Ben     1    53    19
5  Ben     2    22    87
6  Ben     3    19    45


library("reshape2")
mym <- melt(d, id = c("Name"))
res <- dcast(mym, Name ~ variable, mean)
res
#Name Month    Rate1    Rate2
#1 Aira     2 16.33333 47.00000
#2  Ben     2 31.33333 50.33333
#3  Cat     2 44.66667 54.00000

Using data.table:

# At first, I convert the data.frame to data.table and then I group it 
setDT(d)
d[, .(Rate1 = mean(Rate1), Rate2 = mean(Rate2)), by = .(Name)]
#   Name    Rate1    Rate2
#1: Aira 16.33333 47.00000
#2:  Ben 31.33333 50.33333
#3:  Cat 44.66667 54.00000

There is another way of doing it by avoiding to write many argument for j in data.table using a .SD

d[, lapply(.SD, mean), by = .(Name)]
#   Name Month    Rate1    Rate2
#1: Aira     2 16.33333 47.00000
#2:  Ben     2 31.33333 50.33333
#3:  Cat     2 44.66667 54.00000

if we only want to have Rate1 and Rate2 then we can use the .SDcols as follows:

d[, lapply(.SD, mean), by = .(Name), .SDcols = 3:4]
#  Name    Rate1    Rate2
#1: Aira 16.33333 47.00000
#2:  Ben 31.33333 50.33333
#3:  Cat 44.66667 54.00000

Solution 7 - R

You could also use the generic function cbind() and lm() without the intercept:

cbind(lm(d$Rate1~-1+d$Name)$coef,lm(d$Rate2~-1+d$Name)$coef)
>               [,1]     [,2]
>d$NameAira 16.33333 47.00000
>d$NameBen  31.33333 50.33333
>d$NameCat  44.66667 54.00000

Solution 8 - R

You can also accomplish this using the sqldf package as shown below:

library(sqldf)

x <- read.table(text='Name     Month  Rate1     Rate2
Aira       1      12        23
                Aira       2      18        73
                Aira       3      19        45
                Ben        1      53        19
                Ben        2      22        87
                Ben        3      19        45
                Cat        1      22        87
                Cat        2      67        43
                Cat        3      45        32', header=TRUE)

sqldf("
select 
  Name
  ,avg(Rate1) as Rate1_float
  ,avg(Rate2) as Rate2_float
  ,avg(Rate1) as Rate1
  ,avg(Rate2) as Rate2
from x
group by 
  Name
")

#  Name Rate1_float Rate2_float Rate1 Rate2
#1 Aira    16.33333    47.00000    16    47
#2  Ben    31.33333    50.33333    31    50
#3  Cat    44.66667    54.00000    44    54

I am a recent convert to dplyr as shown in other answers, but sqldf is nice as most data analysts/data scientists/developers have at least some fluency in SQL. In this way, I think it tends to make for more universally readable code than dplyr or other solutions presented above.

UPDATE: In responding to the comment below, I attempted to update the code as shown above. However, the behavior was not as I expected. It seems that the column definition (i.e. int vs float) is only carried through when the column alias matches the original column name. When you specify a new name, the aggregate column is returned without rounding.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionIantheView Question on Stackoverflow
Solution 1 - RjbaumsView Answer on Stackoverflow
Solution 2 - RSam FirkeView Answer on Stackoverflow
Solution 3 - RZbynekView Answer on Stackoverflow
Solution 4 - RduHaasView Answer on Stackoverflow
Solution 5 - RMark MillerView Answer on Stackoverflow
Solution 6 - Ruser6376316View Answer on Stackoverflow
Solution 7 - RBeckyView Answer on Stackoverflow
Solution 8 - RjoemienkoView Answer on Stackoverflow