How to make a pandas crosstab with percentages?

PythonPandasCrosstab

Python Problem Overview


Given a dataframe with different categorical variables, how do I return a cross-tabulation with percentages instead of frequencies?

df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 6,
                   'B' : ['A', 'B', 'C'] * 8,
                   'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 4,
                   'D' : np.random.randn(24),
                   'E' : np.random.randn(24)})


pd.crosstab(df.A,df.B)


B	    A	 B    C
A				
one	    4	 4    4
three   2	 2    2
two	    2	 2    2

Using the margins option in crosstab to compute row and column totals gets us close enough to think that it should be possible using an aggfunc or groupby, but my meager brain can't think it through.

B	    A	  B    C
A				
one	    .33	 .33  .33
three   .33	 .33  .33
two	    .33	 .33  .33

Python Solutions


Solution 1 - Python

From Pandas 0.18.1 onwards, there's a normalize option:

In [1]: pd.crosstab(df.A,df.B, normalize='index')
Out[1]:

B	           A	       B	       C
A			
one	    0.333333	0.333333	0.333333
three	0.333333	0.333333	0.333333
two	    0.333333	0.333333	0.333333

Where you can normalise across either all, index (rows), or columns.

More details are available in the documentation.

Solution 2 - Python

pd.crosstab(df.A, df.B).apply(lambda r: r/r.sum(), axis=1)

Basically you just have the function that does row/row.sum(), and you use apply with axis=1 to apply it by row.

(If doing this in Python 2, you should use from __future__ import division to make sure division always returns a float.)

Solution 3 - Python

We can show it as percentages by multiplying by 100:

pd.crosstab(df.A,df.B, normalize='index')\
    .round(4)*100

B          A      B      C
A                         
one    33.33  33.33  33.33
three  33.33  33.33  33.33
two    33.33  33.33  33.33

Where I've rounded for convenience.

Solution 4 - Python

If you're looking for a percentage of the total, you can divide by the len of the df instead of the row sum:

pd.crosstab(df.A, df.B).apply(lambda r: r/len(df), axis=1)

Solution 5 - Python

Another option is to use div rather than apply:

In [11]: res = pd.crosstab(df.A, df.B)

Divide by the sum over the index:

In [12]: res.sum(axis=1)
Out[12]: 
A
one      12
three     6
two       6
dtype: int64

Similar to above, you need to do something about integer division (I use astype('float')):

In [13]: res.astype('float').div(res.sum(axis=1), axis=0)
Out[13]: 
B             A         B         C
A                                  
one    0.333333  0.333333  0.333333
three  0.333333  0.333333  0.333333
two    0.333333  0.333333  0.333333

Solution 6 - Python

Normalizing the index will simply work out. Use parameter, normalize = "index" in pd.crosstab().

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionBrian KeeganView Question on Stackoverflow
Solution 1 - PythonHarryView Answer on Stackoverflow
Solution 2 - PythonBrenBarnView Answer on Stackoverflow
Solution 3 - PythongabraView Answer on Stackoverflow
Solution 4 - PythonhowMuchCheeseIsTooMuchCheeseView Answer on Stackoverflow
Solution 5 - PythonAndy HaydenView Answer on Stackoverflow
Solution 6 - PythonShivam AranyaView Answer on Stackoverflow