What is the difference between pandas agg and apply function?
PythonPandasPandas GroupbyPython Problem Overview
I can't figure out the difference between Pandas .aggregate
and .apply
functions.
Take the following as an example: I load a dataset, do a groupby
, define a simple function,
and either user .agg
or .apply
.
As you may see, the printing statement within my function results in the same output
after using .agg
and .apply
. The result, on the other hand is different. Why is that?
import pandas
import pandas as pd
iris = pd.read_csv('iris.csv')
by_species = iris.groupby('Species')
def f(x):
...: print type(x)
...: print x.head(3)
...: return 1
Using apply
:
by_species.apply(f)
#<class 'pandas.core.frame.DataFrame'>
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#0 5.1 3.5 1.4 0.2 setosa
#1 4.9 3.0 1.4 0.2 setosa
#2 4.7 3.2 1.3 0.2 setosa
#<class 'pandas.core.frame.DataFrame'>
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#0 5.1 3.5 1.4 0.2 setosa
#1 4.9 3.0 1.4 0.2 setosa
#2 4.7 3.2 1.3 0.2 setosa
#<class 'pandas.core.frame.DataFrame'>
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#50 7.0 3.2 4.7 1.4 versicolor
#51 6.4 3.2 4.5 1.5 versicolor
#52 6.9 3.1 4.9 1.5 versicolor
#<class 'pandas.core.frame.DataFrame'>
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#100 6.3 3.3 6.0 2.5 virginica
#101 5.8 2.7 5.1 1.9 virginica
#102 7.1 3.0 5.9 2.1 virginica
#Out[33]:
#Species
#setosa 1
#versicolor 1
#virginica 1
#dtype: int64
Using agg
by_species.agg(f)
#<class 'pandas.core.frame.DataFrame'>
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#0 5.1 3.5 1.4 0.2 setosa
#1 4.9 3.0 1.4 0.2 setosa
#2 4.7 3.2 1.3 0.2 setosa
#<class 'pandas.core.frame.DataFrame'>
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#50 7.0 3.2 4.7 1.4 versicolor
#51 6.4 3.2 4.5 1.5 versicolor
#52 6.9 3.1 4.9 1.5 versicolor
#<class 'pandas.core.frame.DataFrame'>
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#100 6.3 3.3 6.0 2.5 virginica
#101 5.8 2.7 5.1 1.9 virginica
#102 7.1 3.0 5.9 2.1 virginica
#Out[34]:
# Sepal.Length Sepal.Width Petal.Length Petal.Width
#Species
#setosa 1 1 1 1
#versicolor 1 1 1 1
#virginica 1 1 1 1
Python Solutions
Solution 1 - Python
apply
applies the function to each group (your Species
). Your function returns 1, so you end up with 1 value for each of 3 groups.
agg
aggregates each column (feature) for each group, so you end up with one value per column per group.
Do read the groupby
docs, they're quite helpful. There are also a bunch of tutorials floating around the web.
Solution 2 - Python
(Note: These comparisons are relevant for DataframeGroupby objects)
Some plausible advantages of using .agg()
compared to .apply()
, for DataFrame GroupBy objects would be:
-
.agg()
gives the flexibility of applying multiple functions at once, or pass a list of function to each column. -
Also, applying different functions at once to different columns of dataframe.
That means you have pretty much control over each column with each operation.
Here is the link for more details: http://pandas.pydata.org/pandas-docs/version/0.13.1/groupby.html
However, the apply
function could be limited to apply one function to each column of the dataframe at a time. So, you might have to call the apply function repeatedly to call upon different operations to the same column.
Here are some example comparisons for .apply()
vs .agg()
for DataframeGroupBy objects :
Given the following dataframe:
In [261]: df = pd.DataFrame({"name":["Foo", "Baar", "Foo", "Baar"], "score_1":[5,10,15,10], "score_2" :[10,15,10,25], "score_3" : [10,20,30,40]})
In [262]: df
Out[262]:
name score_1 score_2 score_3
0 Foo 5 10 10
1 Baar 10 15 20
2 Foo 15 10 30
3 Baar 10 25 40
Lets first see the operations using .apply()
:
In [263]: df.groupby(["name", "score_1"])["score_2"].apply(lambda x : x.sum())
Out[263]:
name score_1
Baar 10 40
Foo 5 10
15 10
Name: score_2, dtype: int64
In [264]: df.groupby(["name", "score_1"])["score_2"].apply(lambda x : x.min())
Out[264]:
name score_1
Baar 10 15
Foo 5 10
15 10
Name: score_2, dtype: int64
In [265]: df.groupby(["name", "score_1"])["score_2"].apply(lambda x : x.mean())
Out[265]:
name score_1
Baar 10 20.0
Foo 5 10.0
15 10.0
Name: score_2, dtype: float64
Now, look at the same operations using .agg( ) effortlessly:
In [276]: df.groupby(["name", "score_1"]).agg({"score_3" :[np.sum, np.min, np.mean, np.max], "score_2":lambda x : x.mean()})
Out[276]:
score_2 score_3
<lambda> sum amin mean amax
name score_1
Baar 10 20 60 20 30 40
Foo 5 10 10 10 10 10
15 10 30 30 30 30
So, .agg()
could be really handy at handling the DataFrameGroupBy objects, as compared to .apply()
. But, if you are handling only pure dataframe objects and not DataFrameGroupBy objects, then apply()
can be very useful, as apply()
can apply a function along any axis of the dataframe.
(For Eg: axis = 0
implies column-wise operation with .apply(),
which is a default mode, and axis = 1
would imply for row-wise operation while dealing with pure dataframe objects).
Solution 3 - Python
The main difference between apply and aggregate is:
apply()-
cannot be applied to multiple groups together
For apply() - We have to get_group()
ERROR : -iris.groupby('Species').apply({'Sepal.Length':['min','max'],'Sepal.Width':['mean','min']})# It will throw error
Work Fine:-iris.groupby('Species').get_group('Setosa').apply({'Sepal.Length':['min','max'],'Sepal.Width':['mean','min']})# It will throw error
#because functions are applied to one data frame
agg()-
can be applied to multiple groups together
For apply() - We do not have to get_group()
iris.groupby('Species').agg({'Sepal.Length':['min','max'],'Sepal.Width':['mean','min']})
iris.groupby('Species').get_group('versicolor').agg({'Sepal.Length':['min','max'],'Sepal.Width':['mean','min']})
Solution 4 - Python
When using apply to a groupby I have encountered that .apply
will return the grouped columns. There is a note in the documentation (pandas.pydata.org/pandas-docs/stable/groupby.html):
>"...Thus the grouped columns(s) may be included in the output as well as set the indices."
.aggregate
will not return the grouped columns.