Group by index + column in pandas

Python Problem Overview

I have a dataframe that has two columns, user_id and item_bought. Here user_id is the index of the dataframe. I want to group by both user_id and item_bought and get the item wise count for the user.

How do I do that?

Python Solutions

Solution 1 - Python

From version 0.20.1 it is simplier:

>Strings passed to DataFrame.groupby() as the by parameter may now reference either column names or index level names

arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
          ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]

index = pd.MultiIndex.from_arrays(arrays, names=['first', 'second'])

df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 3, 3],
                   'B': np.arange(8)}, index=index)

print (df)

              A  B
first second      
bar   one     1  0
      two     1  1
baz   one     1  2
      two     1  3
foo   one     2  4
      two     2  5
qux   one     3  6
      two     3  7
      
print (df.groupby(['second', 'A']).sum())
          B
second A   
one    1  2
       2  4
       3  6
two    1  4
       2  5
       3  7

Solution 2 - Python

this should work:

>>> df = pd.DataFrame(np.random.randint(0,5,(6, 2)), columns=['col1','col2'])
>>> df['ind1'] = list('AAABCC')
>>> df['ind2'] = range(6)
>>> df.set_index(['ind1','ind2'], inplace=True)
>>> df

           col1  col2
ind1 ind2            
A    0        3     2
     1        2     0
     2        2     3
B    3        2     4
C    4        3     1
     5        0     0


>>> df.groupby([df.index.get_level_values(0),'col1']).count()

           col2
ind1 col1      
A    2        2
     3        1
B    2        1
C    0        1
     3        1

I had the same problem using one of the columns from multiindex. with multiindex, you cannot use df.index.levels[0] since it has only distinct values from that particular index level and will be most likely of different size than whole dataframe...

check http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.get_level_values.html - get_level_values "Return vector of label values for requested level, equal to the length of the index"

Solution 3 - Python

import pandas as pd

import numpy as np

In [11]:

df = pd.DataFrame()

In [12]:

df['user_id'] = ['b','b','b','c']

In [13]:

df['item_bought'] = ['x','x','y','y']

In [14]:

df['ct'] = 1

In [15]:

df

Out[15]:
	user_id 	item_bought 	ct
0 	b 	x 	1
1 	b 	x 	1
2 	b 	y 	1
3 	c 	y 	1
In [16]:

pd.pivot_table(df,values='ct',index=['user_id','item_bought'],aggfunc=np.sum)

Out[16]:

user_id  item_bought
b        x              2
         y              1
c        y              1

Solution 4 - Python

I had the same problem - imported a bunch of data and I wanted to groupby a field that was the index. I didn't have a multi-index or any of that jazz and nor do you.

I figured the problem is that the field I want is the index, so at first I just reset the index - but this gives me a useless index field that I don't want. So now I do the following (two levels of grouping):

grouped = df.reset_index().groupby(by=['Field1','Field2'])

then I can use 'grouped' in a bunch of ways for different reports

grouped[['Field3','Field4']].agg([np.mean, np.std])

(which was what I wanted, giving me Field4 and Field3 averages, grouped by Field1 (the index) and Field2

For you, if you just want to do the count of items per user, in one simple line using groupby, the code could be

df.reset_index().groupby(by=['user_id']).count()

If you want to do more things then you can (like me) create 'grouped' and then use that. As a beginner, I find it easier to follow that way.

Please note, that the "reset_index" is not 'in place' and so will not mess up your original dataframe

Content Type	Original Author	Original Content on Stackoverflow
Question	vumaasha	View Question on Stackoverflow
Solution 1 - Python	jezrael	View Answer on Stackoverflow
Solution 2 - Python	kekert	View Answer on Stackoverflow
Solution 3 - Python	howMuchCheeseIsTooMuchCheese	View Answer on Stackoverflow
Solution 4 - Python	Burgertron	View Answer on Stackoverflow