Python Pandas How to assign groupby operation results back to columns in parent dataframe?

PythonGroup ByDataframePandas

Python Problem Overview


I have the following data frame in IPython, where each row is a single stock:

In [261]: bdata
Out[261]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 21210 entries, 0 to 21209
Data columns:
BloombergTicker      21206  non-null values
Company              21210  non-null values
Country              21210  non-null values
MarketCap            21210  non-null values
PriceReturn          21210  non-null values
SEDOL                21210  non-null values
yearmonth            21210  non-null values
dtypes: float64(2), int64(1), object(4)

I want to apply a groupby operation that computes cap-weighted average return across everything, per each date in the "yearmonth" column.

This works as expected:

In [262]: bdata.groupby("yearmonth").apply(lambda x: (x["PriceReturn"]*x["MarketCap"]/x["MarketCap"].sum()).sum())
Out[262]:
yearmonth
201204      -0.109444
201205      -0.290546

But then I want to sort of "broadcast" these values back to the indices in the original data frame, and save them as constant columns where the dates match.

In [263]: dateGrps = bdata.groupby("yearmonth")

In [264]: dateGrps["MarketReturn"] = dateGrps.apply(lambda x: (x["PriceReturn"]*x["MarketCap"]/x["MarketCap"].sum()).sum())
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/mnt/bos-devrnd04/usr6/home/espears/ws/Research/Projects/python-util/src/util/<ipython-input-264-4a68c8782426> in <module>()
----> 1 dateGrps["MarketReturn"] = dateGrps.apply(lambda x: (x["PriceReturn"]*x["MarketCap"]/x["MarketCap"].sum()).sum())

TypeError: 'DataFrameGroupBy' object does not support item assignment

I realize this naive assignment should not work. But what is the "right" Pandas idiom for assigning the result of a groupby operation into a new column on the parent dataframe?

In the end, I want a column called "MarketReturn" than will be a repeated constant value for all indices that have matching date with the output of the groupby operation.

One hack to achieve this would be the following:

marketRetsByDate  = dateGrps.apply(lambda x: (x["PriceReturn"]*x["MarketCap"]/x["MarketCap"].sum()).sum())

bdata["MarketReturn"] = np.repeat(np.NaN, len(bdata))

for elem in marketRetsByDate.index.values:
    bdata["MarketReturn"][bdata["yearmonth"]==elem] = marketRetsByDate.ix[elem]

But this is slow, bad, and unPythonic.

Python Solutions


Solution 1 - Python

In [97]: df = pandas.DataFrame({'month': np.random.randint(0,11, 100), 'A': np.random.randn(100), 'B': np.random.randn(100)})

In [98]: df.join(df.groupby('month')['A'].sum(), on='month', rsuffix='_r')
Out[98]:
           A         B  month       A_r
0  -0.040710  0.182269      0 -0.331816
1  -0.004867  0.642243      1  2.448232
2  -0.162191  0.442338      4  2.045909
3  -0.979875  1.367018      5 -2.736399
4  -1.126198  0.338946      5 -2.736399
5  -0.992209 -1.343258      1  2.448232
6  -1.450310  0.021290      0 -0.331816
7  -0.675345 -1.359915      9  2.722156

Solution 2 - Python

While I'm still exploring all of the incredibly smart ways that apply concatenates the pieces it's given, here's another way to add a new column in the parent after a groupby operation.

In [236]: df
Out[236]: 
  yearmonth    return
0    201202  0.922132
1    201202  0.220270
2    201202  0.228856
3    201203  0.277170
4    201203  0.747347

In [237]: def add_mkt_return(grp):
   .....:     grp['mkt_return'] = grp['return'].sum()
   .....:     return grp
   .....: 

In [238]: df.groupby('yearmonth').apply(add_mkt_return)
Out[238]: 
  yearmonth    return  mkt_return
0    201202  0.922132    1.371258
1    201202  0.220270    1.371258
2    201202  0.228856    1.371258
3    201203  0.277170    1.024516
4    201203  0.747347    1.024516

Solution 3 - Python

As a general rule when using groupby(), if you use the .transform() function pandas will return a table with the same length as your original. When you use other functions like .sum() or .first() then pandas will return a table where each row is a group.

I'm not sure how this works with apply but implementing elaborate lambda functions with transform can be fairly tricky so the strategy that I find most helpful is to create the variables I need, place them in the original dataset and then do my operations there.

If I understand what you're trying to do correctly first you can calculate the total market cap for each group:

bdata['group_MarketCap'] = bdata.groupby('yearmonth')['MarketCap'].transform('sum')

This will add a column called "group_MarketCap" to your original data which would contain the sum of market caps for each group. Then you can calculate the weighted values directly:

bdata['weighted_P'] = bdata['PriceReturn'] * (bdata['MarketCap']/bdata['group_MarketCap'])

And finally you would calculate the weighted average for each group using the same transform function:

bdata['MarketReturn'] = bdata.groupby('yearmonth')['weighted_P'].transform('sum')

I tend to build my variables this way. Sometimes you can pull off putting it all in a single command but that doesn't always work with groupby() because most of the time pandas needs to instantiate the new object to operate on it at the full dataset scale (i.e. you can't add two columns together if one doesn't exist yet).

Hope this helps :)

Solution 4 - Python

May I suggest the transform method (instead of aggregate)? If you use it in your original example it should do what you want (the broadcasting).

Solution 5 - Python

I did not find a way to make assignment to the original dataframe. So I just store the results from the groups and concatenate them. Then we sort the concatenated dataframe by index to get the original order as the input dataframe. Here is a sample code:

In [10]: df = pd.DataFrame({'month': np.random.randint(0,11, 100), 'A': np.random.randn(100), 'B': np.random.randn(100)})

In [11]: df.head()
Out[11]:
   month         A         B
0      4 -0.029106 -0.904648
1      2 -2.724073  0.492751
2      7  0.732403  0.689530
3      2  0.487685 -1.017337
4      1  1.160858 -0.025232

In [12]: res = []

In [13]: for month, group in df.groupby('month'):
    ...:     new_df = pd.DataFrame({
    ...:         'A^2+B': group.A ** 2 + group.B,
    ...:         'A+B^2': group.A + group.B**2
    ...:     })
    ...:     res.append(new_df)
    ...:

In [14]: res = pd.concat(res).sort_index()

In [15]: res.head()
Out[15]:
      A^2+B     A+B^2
0 -0.903801  0.789282
1  7.913327 -2.481270
2  1.225944  1.207855
3 -0.779501  1.522660
4  1.322360  1.161495

This method is pretty fast and extensible. You can derive any feature here.

> Note: If the dataframe is too large, concat may cause you MMO error.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionelyView Question on Stackoverflow
Solution 1 - PythonWouter OvermeireView Answer on Stackoverflow
Solution 2 - PythonGarrettView Answer on Stackoverflow
Solution 3 - PythonseeiespiView Answer on Stackoverflow
Solution 4 - PythonWes McKinneyView Answer on Stackoverflow
Solution 5 - PythonHan ZhangView Answer on Stackoverflow