Get the mean across multiple Pandas DataFrames

PythonRNumpyPandas

Python Problem Overview


I'm generating a number of dataframes with the same shape, and I want to compare them to one another. I want to be able to get the mean and median across the dataframes.

         Source.0  Source.1  Source.2  Source.3
cluster                                        
0        0.001182  0.184535  0.814230  0.000054
1        0.000001  0.160490  0.839508  0.000001
2        0.000001  0.173829  0.826114  0.000055
3        0.000432  0.180065  0.819502  0.000001
4        0.000152  0.157041  0.842694  0.000113
5        0.000183  0.174142  0.825674  0.000001
6        0.000001  0.151556  0.848405  0.000038
7        0.000771  0.177583  0.821645  0.000001
8        0.000001  0.202059  0.797939  0.000001
9        0.000025  0.189537  0.810410  0.000028
10       0.006142  0.003041  0.493912  0.496905
11       0.003739  0.002367  0.514216  0.479678
12       0.002334  0.001517  0.529041  0.467108
13       0.003458  0.000001  0.532265  0.464276
14       0.000405  0.005655  0.527576  0.466364
15       0.002557  0.003233  0.507954  0.486256
16       0.004161  0.000001  0.491271  0.504568
17       0.001364  0.001330  0.528311  0.468996
18       0.002886  0.000001  0.506392  0.490721
19       0.001823  0.002498  0.509620  0.486059

         Source.0  Source.1  Source.2  Source.3
cluster                                        
0        0.000001  0.197108  0.802495  0.000396
1        0.000001  0.157860  0.842076  0.000063
2        0.094956  0.203057  0.701662  0.000325
3        0.000001  0.181948  0.817841  0.000210
4        0.000003  0.169680  0.830316  0.000001
5        0.000362  0.177194  0.822443  0.000001
6        0.000001  0.146807  0.852924  0.000268
7        0.001087  0.178994  0.819564  0.000354
8        0.000001  0.202182  0.797333  0.000485
9        0.000348  0.181399  0.818252  0.000001
10       0.003050  0.000247  0.506777  0.489926
11       0.004420  0.000001  0.513927  0.481652
12       0.006488  0.001396  0.527197  0.464919
13       0.001510  0.000001  0.525987  0.472502
14       0.000001  0.000001  0.520737  0.479261
15       0.000001  0.001765  0.515658  0.482575
16       0.000001  0.000001  0.492550  0.507448
17       0.002855  0.000199  0.526535  0.470411
18       0.000001  0.001952  0.498303  0.499744
19       0.001232  0.000001  0.506612  0.492155

Then I want to get the mean of these two dataframes.

What is the easiest way to do this?

Just to clarify I want to get the mean for each particular cell when the indexes and columns of all the dataframes are exactly the same.

So in the example I gave, the average for [0,Source.0] would be (0.001182 + 0.000001) / 2 = 0.0005915.

Python Solutions


Solution 1 - Python

Assuming the two dataframes have the same columns, you could just concatenate them and compute your summary stats on the concatenated frames:

import numpy as np
import pandas as pd

# some random data frames
df1 = pd.DataFrame(dict(x=np.random.randn(100), y=np.random.randint(0, 5, 100)))
df2 = pd.DataFrame(dict(x=np.random.randn(100), y=np.random.randint(0, 5, 100)))

# concatenate them
df_concat = pd.concat((df1, df2))

print df_concat.mean()
# x   -0.163044
# y    2.120000
# dtype: float64

print df_concat.median()
# x   -0.192037
# y    2.000000
# dtype: float64

Update

If you want to compute stats across each set of rows with the same index in the two datasets, you can use .groupby() to group the data by row index, then apply the mean, median etc.:

by_row_index = df_concat.groupby(df_concat.index)
df_means = by_row_index.mean()

print df_means.head()
#           x    y
# 0 -0.850794  1.5
# 1  0.159038  1.5
# 2  0.083278  1.0
# 3 -0.540336  0.5
# 4  0.390954  3.5

This method will work even when your dataframes have unequal numbers of rows - if a particular row index is missing in one of the two dataframes, the mean/median will be computed on the single existing row.

Solution 2 - Python

I go similar as @ali_m, but since you want one mean per row-column combination, I conclude differently:

df1 = pd.DataFrame(dict(x=np.random.randn(100), y=np.random.randint(0, 5, 100)))
df2 = pd.DataFrame(dict(x=np.random.randn(100), y=np.random.randint(0, 5, 100)))
df = pd.concat([df1, df2])
foo = df.groupby(level=1).mean()
foo.head()

          x    y
0  0.841282  2.5
1  0.716749  1.0
2 -0.551903  2.5
3  1.240736  1.5
4  1.227109  2.0

Solution 3 - Python

As per Niklas' comment, the solution to the question is panel.mean(axis=0).

As a more complete example:

import pandas as pd
import numpy as np

dfs = {}
nrows = 4
ncols = 3
for i in range(4):
    dfs[i] = pd.DataFrame(np.arange(i, nrows*ncols+i).reshape(nrows, ncols),
                          columns=list('abc'))
    print('DF{i}:\n{df}\n'.format(i=i, df=dfs[i]))
    
panel = pd.Panel(dfs)
print('Mean of stacked DFs:\n{df}'.format(df=panel.mean(axis=0)))

Will give the following output:

DF0:
   a   b   c
0  0   1   2
1  3   4   5
2  6   7   8
3  9  10  11

DF1:
    a   b   c
0   1   2   3
1   4   5   6
2   7   8   9
3  10  11  12

DF2:
    a   b   c
0   2   3   4
1   5   6   7
2   8   9  10
3  11  12  13

DF3:
    a   b   c
0   3   4   5
1   6   7   8
2   9  10  11
3  12  13  14

Mean of stacked DFs:
      a     b     c
0   1.5   2.5   3.5
1   4.5   5.5   6.5
2   7.5   8.5   9.5
3  10.5  11.5  12.5

Solution 4 - Python

Here is a solution first unstack both dataframes so they are series with multiindexes(cluster, colnames)... then you can use Series addition and division, which automattically do the operation on the indexes, finally unstack them... here it is in code...

averages = (df1.stack()+df2.stack())/2
averages = averages.unstack()

And your done...

Or for more general purposes...

dfs = [df1,df2]
averages = pd.concat([each.stack() for each in dfs],axis=1)\
             .apply(lambda x:x.mean(),axis=1)\
             .unstack()

Solution 5 - Python

You can simply assign a label to each frame, call it group and then concat and groupby to do what you want:

In [57]: df = DataFrame(np.random.randn(10, 4), columns=list('abcd'))

In [58]: df2 = df.copy()

In [59]: dfs = [df, df2]

In [60]: df
Out[60]:
        a       b       c       d
0  0.1959  0.1260  0.1464  0.1631
1  0.9344 -1.8154  1.4529 -0.6334
2  0.0390  0.4810  1.1779 -1.1799
3  0.3542  0.3819 -2.0895  0.8877
4 -2.2898 -1.0585  0.8083 -0.2126
5  0.3727 -0.6867 -1.3440 -1.4849
6 -1.1785  0.0885  1.0945 -1.6271
7 -1.7169  0.3760 -1.4078  0.8994
8  0.0508  0.4891  0.0274 -0.6369
9 -0.7019  1.0425 -0.5476 -0.5143

In [61]: for i, d in enumerate(dfs):
   ....:     d['group'] = i
   ....:

In [62]: dfs[0]
Out[62]:
        a       b       c       d  group
0  0.1959  0.1260  0.1464  0.1631      0
1  0.9344 -1.8154  1.4529 -0.6334      0
2  0.0390  0.4810  1.1779 -1.1799      0
3  0.3542  0.3819 -2.0895  0.8877      0
4 -2.2898 -1.0585  0.8083 -0.2126      0
5  0.3727 -0.6867 -1.3440 -1.4849      0
6 -1.1785  0.0885  1.0945 -1.6271      0
7 -1.7169  0.3760 -1.4078  0.8994      0
8  0.0508  0.4891  0.0274 -0.6369      0
9 -0.7019  1.0425 -0.5476 -0.5143      0

In [63]: final = pd.concat(dfs, ignore_index=True)

In [64]: final
Out[64]:
         a       b       c       d  group
0   0.1959  0.1260  0.1464  0.1631      0
1   0.9344 -1.8154  1.4529 -0.6334      0
2   0.0390  0.4810  1.1779 -1.1799      0
3   0.3542  0.3819 -2.0895  0.8877      0
4  -2.2898 -1.0585  0.8083 -0.2126      0
5   0.3727 -0.6867 -1.3440 -1.4849      0
6  -1.1785  0.0885  1.0945 -1.6271      0
..     ...     ...     ...     ...    ...
13  0.3542  0.3819 -2.0895  0.8877      1
14 -2.2898 -1.0585  0.8083 -0.2126      1
15  0.3727 -0.6867 -1.3440 -1.4849      1
16 -1.1785  0.0885  1.0945 -1.6271      1
17 -1.7169  0.3760 -1.4078  0.8994      1
18  0.0508  0.4891  0.0274 -0.6369      1
19 -0.7019  1.0425 -0.5476 -0.5143      1

[20 rows x 5 columns]

In [65]: final.groupby('group').mean()
Out[65]:
           a       b       c       d
group
0     -0.394 -0.0576 -0.0682 -0.4339
1     -0.394 -0.0576 -0.0682 -0.4339

Here, each group is the same, but that's only because df == df2.

Alternatively, you can throw the frames into a Panel:

In [69]: df = DataFrame(np.random.randn(10, 4), columns=list('abcd'))

In [70]: df2 = DataFrame(np.random.randn(10, 4), columns=list('abcd'))

In [71]: panel = pd.Panel({0: df, 1: df2})

In [72]: panel
Out[72]:
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 10 (major_axis) x 4 (minor_axis)
Items axis: 0 to 1
Major_axis axis: 0 to 9
Minor_axis axis: a to d

In [73]: panel.mean()
Out[73]:
        0       1
a  0.3839  0.2956
b  0.1855 -0.3164
c -0.1167 -0.0627
d -0.2338 -0.0450

Solution 6 - Python

With Pandas version 1.3.4 this works for me:

import numpy as np
df1 = pd.DataFrame(dict(x=np.random.randn(100), y=np.random.randint(0, 5, 100), z=np.random.randint(-3, 2, 100)))
df2 = pd.DataFrame(dict(x=np.random.randn(100), y=np.random.randint(0, 2, 100), z=np.random.randint(-1, 2, 100)))
pd.concat([df1, df2]).groupby(level=0).mean()

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionTimView Question on Stackoverflow
Solution 1 - Pythonali_mView Answer on Stackoverflow
Solution 2 - PythonFooBarView Answer on Stackoverflow
Solution 3 - Pythonuser394430View Answer on Stackoverflow
Solution 4 - PythonZJSView Answer on Stackoverflow
Solution 5 - PythonPhillip CloudView Answer on Stackoverflow
Solution 6 - PythonamcView Answer on Stackoverflow