Add count of unique / distinct values by group to the original data
RCountUniqueAggregateR Problem Overview
I wish to count the number of unique values by grouping of a second variable, and then add the count to the existing data.frame as a new column. For example, if the existing data frame looks like this:
color type
1 black chair
2 black chair
3 black sofa
4 green sofa
5 green sofa
6 red sofa
7 red plate
8 blue sofa
9 blue plate
10 blue chair
I want to add for each color
, the count of unique types
that are present in the data:
color type unique_types
1 black chair 2
2 black chair 2
3 black sofa 2
4 green sofa 1
5 green sofa 1
6 red sofa 2
7 red plate 2
8 blue sofa 3
9 blue plate 3
10 blue chair 3
I was hoping to use ave
, but can't seem to find a straightforward method that doesn't require many lines. I have >100,000 rows, so am also not sure how important efficiency is.
It's somewhat similar to this issue: https://stackoverflow.com/questions/7450600/how-do-i-count-aggregate-values-from-a-data-frame-and-reincorporate-them-into-th
R Solutions
Solution 1 - R
Here's a solution with the dplyr package - it has n_distinct()
as a wrapper for length(unique())
.
df %>%
group_by(color) %>%
mutate(unique_types = n_distinct(type))
Solution 2 - R
Using ave
(since you ask for it specifically):
within(df, { count <- ave(type, color, FUN=function(x) length(unique(x)))})
Make sure that type
is character vector and not factor.
Since you also say your data is huge and that speed/performance may therefore be a factor, I'd suggest a data.table
solution as well.
require(data.table)
setDT(df)[, count := uniqueN(type), by = color] # v1.9.6+
# if you don't want df to be modified by reference
ans = as.data.table(df)[, count := uniqueN(type), by = color]
uniqueN
was implemented in v1.9.6
and is a faster equivalent of length(unique(.))
. In addition it also works with data.frames/data.tables.
Other solutions:
Using plyr:
require(plyr)
ddply(df, .(color), mutate, count = length(unique(type)))
Using aggregate
:
agg <- aggregate(data=df, type ~ color, function(x) length(unique(x)))
merge(df, agg, by="color", all=TRUE)
Solution 3 - R
This can be also achieved in a vectorized without by group operations by combining unique
with table
or tabulate
If df$color
is factor
, then
Either
table(unique(df)$color)[as.character(df$color)]
# black black black green green red red blue blue blue
# 2 2 2 1 1 2 2 3 3 3
Or
tabulate(unique(df)$color)[as.integer(df$color)]
# [1] 2 2 2 1 1 2 2 3 3 3
If df$color
is character
then just
table(unique(df)$color)[df$color]
If df$color
is an integer
then just
tabulate(unique(df)$color)[df$color]