Can pandas groupby aggregate into a list, rather than sum, mean, etc?

PythonPandas

Python Problem Overview


I've had success using the groupby function to sum or average a given variable by groups, but is there a way to aggregate into a list of values, rather than to get a single result? (And would this still be called aggregation?)

I am not entirely sure this is the approach I should be taking anyhow, so below is an example of the transformation I'd like to make, with toy data.

That is, if the data look something like this:

    A    B    C  
    1    10   22
    1    12   20
    1    11   8
    1    10   10
    2    11   13
    2    12   10 
    3    14   0

What I am trying to end up with is something like the following. I am not totally sure whether this can be done through groupby aggregating into lists, and am rather lost as to where to go from here.

Hypothetical output:

     A    B    C  New1  New2  New3  New4  New5  New6
    1    10   22  12    20    11    8     10    10
    2    11   13  12    10 
    3    14   0

Perhaps I should be pursuing pivots instead? The order by which the data are put into columns does not matter - all columns B through New6 in this example are equivalent. All suggestions/corrections are much appreciated.

Python Solutions


Solution 1 - Python

I used the following

grouped = df.groupby('A')

df = grouped.aggregate(lambda x: tuple(x))

df['grouped'] = df['B'] + df['C']

Solution 2 - Python

I am answering the question as stated in its title and first sentence: the following aggregates values to lists:

df.groupby('A').aggregate(lambda tdf: tdf.unique().tolist())

Below this is demonstrated in a simple example:

import pandas as pd

df = pd.DataFrame( {'A' : [1, 1, 1, 1, 2, 2, 3], 'B' : [10, 12, 11, 10, 11, 12, 14], 'C' : [22, 20,     8, 10, 13, 10, 0]})
print df

df2 = df.groupby('A').aggregate(lambda tdf: tdf.unique().tolist())
print df2
# Old version:
# df2=df.groupby(['A']).apply(lambda tdf: pd.Series(  dict([[vv,tdf[vv].unique().tolist()] for vv in tdf if vv not in ['A']])  )) 

The output is as follows:

In [3]: run tmp
   A   B   C
0  1  10  22
1  1  12  20
2  1  11   8
3  1  10  10
4  2  11  13
5  2  12  10
6  3  14   0

[7 rows x 3 columns]
              B                C
A                               
1  [10, 12, 11]  [22, 20, 8, 10]
2      [11, 12]         [13, 10]
3          [14]              [0]

[3 rows x 2 columns]

Solution 3 - Python

Here is a one liner

# if list of unique items is desired, use set
df.groupby('A',as_index=False)['B'].aggregate(lambda x: set(x))

# if duplicate items are okay, use list
df.groupby('A',as_index=False)['B'].aggregate(lambda x: list(x))

Solution 4 - Python

Similar solution, but fairly transparent (I think). you can get full list or unique lists.

df = pd.DataFrame({'A':[1,1,2,2,2,3,3,3,4,5], 
                   'B':[6,7, 8,8,9, 9,9,10,11,12], 
                   'C':['foo']*10})

df
Out[24]: 
   A   B    C
0  1   6  foo
1  1   7  foo
2  2   8  foo
3  2   8  foo
4  2   9  foo
5  3   9  foo
6  3   9  foo
7  3  10  foo
8  4  11  foo
9  5  12  foo

list_agg = df.groupby(by='A').agg({'B':lambda x: list(x), 
                                   'C':lambda x: tuple(x)})

list_agg
Out[26]: 
                 C           B
A                             
1       (foo, foo)      [6, 7]
2  (foo, foo, foo)   [8, 8, 9]
3  (foo, foo, foo)  [9, 9, 10]
4           (foo,)        [11]
5           (foo,)        [12]

unique_list_agg = df.groupby(by='A').agg({'B':lambda x: list(pd.unique(x)), 
                                          'C':lambda x: tuple(pd.unique(x))})

unique_list_agg
Out[28]: 
        C        B
A                 
1  (foo,)   [6, 7]
2  (foo,)   [8, 9]
3  (foo,)  [9, 10]
4  (foo,)     [11]
5  (foo,)     [12]

Solution 5 - Python

my solution is a bit longer than you may expect, I'm sure it could be shortened, but:

g = df.groupby("A").apply(lambda x: pd.concat((x["B"], x["C"])))
k = g.reset_index()
k["i"] = k1.index
k["rn"] = k1.groupby("A")["i"].rank()
k.pivot_table(rows="A", cols="rn", values=0)

# output
# rn   1   2   3   4   5   6
# A                         
# 1   10  12  11  22  20   8
# 2   10  11  10  13 NaN NaN
# 3   14  10 NaN NaN NaN NaN

A bit of explanation. First line, g = df.groupby("A").apply(lambda x: pd.concat((x["B"], x["C"]))). This one group df by A and then put columns B and C into one column:

A   
1  0    10
   1    12
   2    11
   0    22
   1    20
   2     8
2  3    10
   4    11
   3    10
   4    13
3  5    14
   5    10

Then k = g.reset_index(), creating sequential index, result is:

    A  level_1   0
0   1        0  10
1   1        1  12
2   1        2  11
3   1        0  22
4   1        1  20
5   1        2   8
6   2        3  10
7   2        4  11
8   2        3  10
9   2        4  13
10  3        5  14
11  3        5  10

Now I want to move this index into column (I'd like to hear how I can make a sequential column without resetting index), k["i"] = k1.index:

    A  level_1   0   i
0   1        0  10   0
1   1        1  12   1
2   1        2  11   2
3   1        0  22   3
4   1        1  20   4
5   1        2   8   5
6   2        3  10   6
7   2        4  11   7
8   2        3  10   8
9   2        4  13   9
10  3        5  14  10
11  3        5  10  11

Now, k["rn"] = k1.groupby("A")["i"].rank() will add row_number inside each A (like row_number() over(partition by A order by i) in SQL:

    A  level_1   0   i  rn
0   1        0  10   0   1
1   1        1  12   1   2
2   1        2  11   2   3
3   1        0  22   3   4
4   1        1  20   4   5
5   1        2   8   5   6
6   2        3  10   6   1
7   2        4  11   7   2
8   2        3  10   8   3
9   2        4  13   9   4
10  3        5  14  10   1
11  3        5  10  11   2

And finally, just pivoting with k.pivot_table(rows="A", cols="rn", values=0):

rn   1   2   3   4   5   6
A                         
1   10  12  11  22  20   8
2   10  11  10  13 NaN NaN
3   14  10 NaN NaN NaN NaN

Solution 6 - Python

I have been struggling with the exact same issues, and the answer is that yes you can use grouby to obtain lists. I am not 100% sure I am doing this in the most pythonic way, but here for what its worth is my attempt to get at your question. You can create lists of the data contained in the bygroups like this:

import pandas as pd
import numpy as np
from itertools import chain

Data = {'A' : [1, 1, 1, 1, 2, 2, 3], 'B' : [10, 12, 11, 10, 11, 12, 14], 'C' : [22, 20,     8, 10, 13, 10, 0]}
DF = pd.DataFrame(Data)
DFGrouped = DF.groupby('A')

OutputLists = []

for group in DFGrouped:
    AList = list(group[1].A)
    BList = list(group[1].B)
    CList = list(group[1].C)
    print list(group[1].A)
    print list(group[1].B)
    print list(group[1].C)
    ZIP =  zip(BList, CList)
    print ZIP
    OutputLists.append(list(chain(*ZIP)))

OutputLists

This outputs your data in a list of lists, in the way that I think you want it. You can then make it a data frame. The above print statements are for illustrative purposes only clearly. The most efficient (in terms of code) way to do this using my method is as follows:

import pandas as pd
import numpy as np
from itertools import chain

Data = {'A' : [1, 1, 1, 1, 2, 2, 3], 'B' : [10, 12, 11, 10, 11, 12, 14], 'C' : [22, 20, 8, 10, 13, 10, 0]}
DF = pd.DataFrame(Data)
DFGrouped = DF.groupby('A')
OutputLists = []
for group in DFGrouped:
    ZIPPED = zip(group[1].B, group[1].C)
    OutputLists.append(list(chain(*ZIPPED)))
OutputLists

The key to getting lists out of grouped data as far as I can tell is to recognise that the data themselves are stored in group[1] for each group in your grouped data.

hope this helps!

Solution 7 - Python

df2 = df.groupby('A').aggregate(lambda tdf: tdf.unique().tolist())

This seems to work perfect, but the resultant dataframe has two layers of columns and df.columns shows only one column in the dataframe. To correct this, use:

df2_copy=df2.copy()
df2_copy = df2_copy.reset_index(col_level=0)

You can view the column levels using: df2_copy.columns=df2_copy.columns.get_level_values(0)

the df2_copy() should solve this.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionM.A.KlineView Question on Stackoverflow
Solution 1 - Pythonuser2623954View Answer on Stackoverflow
Solution 2 - PythonCPBLView Answer on Stackoverflow
Solution 3 - PythonmuonView Answer on Stackoverflow
Solution 4 - PythonData-phileView Answer on Stackoverflow
Solution 5 - PythonRoman PekarView Answer on Stackoverflow
Solution 6 - PythonWoody PrideView Answer on Stackoverflow
Solution 7 - PythonSummmerFortView Answer on Stackoverflow