Pandas groupby cumulative sum

PythonPandasPandas Groupby

Python Problem Overview


I would like to add a cumulative sum column to my Pandas dataframe so that:

name | day       | no
-----|-----------|----
Jack | Monday    | 10
Jack | Tuesday   | 20
Jack | Tuesday   | 10
Jack | Wednesday | 50
Jill | Monday    | 40
Jill | Wednesday | 110

becomes:

Jack | Monday     | 10  | 10
Jack | Tuesday    | 30  | 40
Jack | Wednesday  | 50  | 90
Jill | Monday     | 40  | 40
Jill | Wednesday  | 110 | 150

I tried various combos of df.groupby and df.agg(lambda x: cumsum(x)) to no avail.

Python Solutions


Solution 1 - Python

This should do it, need groupby() twice:

df.groupby(['name', 'day']).sum() \
  .groupby(level=0).cumsum().reset_index()

Explanation:

print(df)
   name        day   no
0  Jack     Monday   10
1  Jack    Tuesday   20
2  Jack    Tuesday   10
3  Jack  Wednesday   50
4  Jill     Monday   40
5  Jill  Wednesday  110

# sum per name/day
print( df.groupby(['name', 'day']).sum() )
                 no
name day           
Jack Monday      10
     Tuesday     30
     Wednesday   50
Jill Monday      40
      Wednesday  110

# cumulative sum per name/day
print( df.groupby(['name', 'day']).sum() \
         .groupby(level=0).cumsum() )
				 no
name day           
Jack Monday      10
	 Tuesday     40
	 Wednesday   90
Jill Monday      40
	 Wednesday  150

The dataframe resulting from the first sum is indexed by 'name' and by 'day'. You can see it by printing

df.groupby(['name', 'day']).sum().index 

When computing the cumulative sum, you want to do so by 'name', corresponding to the first index (level 0).

Finally, use reset_index to have the names repeated.

df.groupby(['name', 'day']).sum().groupby(level=0).cumsum().reset_index()

   name        day   no
0  Jack     Monday   10
1  Jack    Tuesday   40
2  Jack  Wednesday   90
3  Jill     Monday   40
4  Jill  Wednesday  150

Solution 2 - Python

Modification to @Dmitry's answer. This is simpler and works in pandas 0.19.0:

print(df) 

 name        day   no
0  Jack     Monday   10
1  Jack    Tuesday   20
2  Jack    Tuesday   10
3  Jack  Wednesday   50
4  Jill     Monday   40
5  Jill  Wednesday  110

df['no_csum'] = df.groupby(['name'])['no'].cumsum()

print(df)
   name        day   no  no_csum
0  Jack     Monday   10       10
1  Jack    Tuesday   20       30
2  Jack    Tuesday   10       40
3  Jack  Wednesday   50       90
4  Jill     Monday   40       40
5  Jill  Wednesday  110      150

Solution 3 - Python

This works in pandas 0.16.2

In[23]: print df
        name          day   no
0      Jack       Monday    10
1      Jack      Tuesday    20
2      Jack      Tuesday    10
3      Jack    Wednesday    50
4      Jill       Monday    40
5      Jill    Wednesday   110
In[24]: df['no_cumulative'] = df.groupby(['name'])['no'].apply(lambda x: x.cumsum())
In[25]: print df
        name          day   no  no_cumulative
0      Jack       Monday    10             10
1      Jack      Tuesday    20             30
2      Jack      Tuesday    10             40
3      Jack    Wednesday    50             90
4      Jill       Monday    40             40
5      Jill    Wednesday   110            150

Solution 4 - Python

you should use

df['cum_no'] = df.no.cumsum()

http://pandas.pydata.org/pandas-docs/version/0.19.2/generated/pandas.DataFrame.cumsum.html

Another way of doing it

import pandas as pd
df = pd.DataFrame({'C1' : ['a','a','a','b','b'],
           'C2' : [1,2,3,4,5]})
df['cumsum'] = df.groupby(by=['C1'])['C2'].transform(lambda x: x.cumsum())
df

enter image description here

Solution 5 - Python

Instead of df.groupby(by=['name','day']).sum().groupby(level=[0]).cumsum() (see above) you could also do a df.set_index(['name', 'day']).groupby(level=0, as_index=False).cumsum()

  • df.groupby(by=['name','day']).sum() is actually just moving both columns to a MultiIndex
  • as_index=False means you do not need to call reset_index afterwards

Solution 6 - Python

data.csv:

name,day,no
Jack,Monday,10
Jack,Tuesday,20
Jack,Tuesday,10
Jack,Wednesday,50
Jill,Monday,40
Jill,Wednesday,110

Code:

import numpy as np
import pandas as pd

df = pd.read_csv('data.csv')
print(df)
df = df.groupby(['name', 'day'])['no'].sum().reset_index()
print(df)
df['cumsum'] = df.groupby(['name'])['no'].apply(lambda x: x.cumsum())
print(df)

Output:

   name        day   no
0  Jack     Monday   10
1  Jack    Tuesday   20
2  Jack    Tuesday   10
3  Jack  Wednesday   50
4  Jill     Monday   40
5  Jill  Wednesday  110
   name        day   no
0  Jack     Monday   10
1  Jack    Tuesday   30
2  Jack  Wednesday   50
3  Jill     Monday   40
4  Jill  Wednesday  110
   name        day   no  cumsum
0  Jack     Monday   10      10
1  Jack    Tuesday   30      40
2  Jack  Wednesday   50      90
3  Jill     Monday   40      40
4  Jill  Wednesday  110     150

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
Questionkc2819View Question on Stackoverflow
Solution 1 - PythonCT ZhuView Answer on Stackoverflow
Solution 2 - PythonvjaykyView Answer on Stackoverflow
Solution 3 - PythonDmitry AndreevView Answer on Stackoverflow
Solution 4 - PythonsushmitView Answer on Stackoverflow
Solution 5 - PythonChristophView Answer on Stackoverflow
Solution 6 - PythonAaj KaalView Answer on Stackoverflow