Multiple aggregations of the same column using pandas GroupBy.agg()
PythonPandasDataframeAggregatePandas GroupbyPython Problem Overview
Is there a pandas built-in way to apply two different aggregating functions f1, f2
to the same column df["returns"]
, without having to call agg()
multiple times?
Example dataframe:
import pandas as pd
import datetime as dt
import numpy as np
pd.np.random.seed(0)
df = pd.DataFrame({
"date" : [dt.date(2012, x, 1) for x in range(1, 11)],
"returns" : 0.05 * np.random.randn(10),
"dummy" : np.repeat(1, 10)
})
The syntactically wrong, but intuitively right, way to do it would be:
# Assume `f1` and `f2` are defined for aggregating.
df.groupby("dummy").agg({"returns": f1, "returns": f2})
Obviously, Python doesn't allow duplicate keys. Is there any other manner for expressing the input to agg()
? Perhaps a list of tuples [(column, function)]
would work better, to allow multiple functions applied to the same column? But agg()
seems like it only accepts a dictionary.
Is there a workaround for this besides defining an auxiliary function that just applies both of the functions inside of it? (How would this work with aggregation anyway?)
Python Solutions
Solution 1 - Python
You can simply pass the functions as a list:
In [20]: df.groupby("dummy").agg({"returns": [np.mean, np.sum]})
Out[20]:
mean sum
dummy
1 0.036901 0.369012
or as a dictionary:
In [21]: df.groupby('dummy').agg({'returns':
{'Mean': np.mean, 'Sum': np.sum}})
Out[21]:
returns
Mean Sum
dummy
1 0.036901 0.369012
To avoid deprecation warining:
df.groupby('dummy').agg(Mean=('returns', np.mean),
Sum=('returns', np.sum))
Solution 2 - Python
TLDR; Pandas groupby.agg
has a new, easier syntax for specifying (1) aggregations on multiple columns, and (2) multiple aggregations on a column. So, to do this for pandas >= 0.25, use
df.groupby('dummy').agg(Mean=('returns', 'mean'), Sum=('returns', 'sum'))
Mean Sum
dummy
1 0.036901 0.369012
OR
df.groupby('dummy')['returns'].agg(Mean='mean', Sum='sum')
Mean Sum
dummy
1 0.036901 0.369012
#Pandas >= 0.25: Named Aggregation
Pandas has changed the behavior of GroupBy.agg
in favour of a more intuitive syntax for specifying named aggregations. See the 0.25 docs section on Enhancements as well as relevant GitHub issues GH18366 and GH26512.
From the documentation,
> To support column-specific aggregation with control over the output
> column names, pandas accepts the special syntax in GroupBy.agg()
,
> known as “named aggregation”, where
>
> - The keywords are the output column names
> - The values are tuples whose first element is the column to select and the second element is the aggregation to apply to that column.
> Pandas provides the pandas.NamedAgg namedtuple with the fields
> ['column', 'aggfunc'] to make it clearer what the arguments are. As
> usual, the aggregation can be a callable or a string alias.
You can now pass a tuple via keyword arguments. The tuples follow the format of (<colName>, <aggFunc>)
.
import pandas as pd
pd.__version__
# '0.25.0.dev0+840.g989f912ee'
# Setup
df = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
'height': [9.1, 6.0, 9.5, 34.0],
'weight': [7.9, 7.5, 9.9, 198.0]
})
df.groupby('kind').agg(
max_height=('height', 'max'), min_weight=('weight', 'min'),)
max_height min_weight
kind
cat 9.5 7.9
dog 34.0 7.5
Alternatively, you can use pd.NamedAgg
(essentially a namedtuple) which makes things more explicit.
df.groupby('kind').agg(
max_height=pd.NamedAgg(column='height', aggfunc='max'),
min_weight=pd.NamedAgg(column='weight', aggfunc='min')
)
max_height min_weight
kind
cat 9.5 7.9
dog 34.0 7.5
It is even simpler for Series, just pass the aggfunc to a keyword argument.
df.groupby('kind')['height'].agg(max_height='max', min_height='min')
max_height min_height
kind
cat 9.5 9.1
dog 34.0 6.0
Lastly, if your column names aren't valid python identifiers, use a dictionary with unpacking:
df.groupby('kind')['height'].agg(**{'max height': 'max', ...})
Pandas < 0.25
In more recent versions of pandas leading upto 0.24, if using a dictionary for specifying column names for the aggregation output, you will get a FutureWarning
:
df.groupby('dummy').agg({'returns': {'Mean': 'mean', 'Sum': 'sum'}})
# FutureWarning: using a dict with renaming is deprecated and will be removed
# in a future version
Using a dictionary for renaming columns is deprecated in v0.20. On more recent versions of pandas, this can be specified more simply by passing a list of tuples. If specifying the functions this way, all functions for that column need to be specified as tuples of (name, function) pairs.
df.groupby("dummy").agg({'returns': [('op1', 'sum'), ('op2', 'mean')]})
returns
op1 op2
dummy
1 0.328953 0.032895
Or,
df.groupby("dummy")['returns'].agg([('op1', 'sum'), ('op2', 'mean')])
op1 op2
dummy
1 0.328953 0.032895
Solution 3 - Python
Would something like this work:
In [7]: df.groupby('dummy').returns.agg({'func1' : lambda x: x.sum(), 'func2' : lambda x: x.prod()})
Out[7]:
func2 func1
dummy
1 -4.263768e-16 -0.188565