group by pandas dataframe and select latest in each group

PythonPandasGroup ByPandas Groupby

Python Problem Overview


How to group values of pandas dataframe and select the latest(by date) from each group?

For example, given a dataframe sorted by date:

	id     product   date
0   220    6647     2014-09-01 
1   220    6647     2014-09-03 
2   220    6647     2014-10-16
3   826    3380     2014-11-11
4   826    3380     2014-12-09
5   826    3380     2015-05-19
6   901    4555     2014-09-01
7   901    4555     2014-10-05
8   901    4555     2014-11-01

grouping by id or product, and selecting the earliest gives:

	id     product   date
2   220    6647     2014-10-16
5   826    3380     2015-05-19
8   901    4555     2014-11-01

Python Solutions


Solution 1 - Python

You can also use tail with groupby to get the last n values of the group:

df.sort_values('date').groupby('id').tail(1)

	id	product	date
2	220	6647	2014-10-16
8	901	4555	2014-11-01
5	826	3380	2015-05-19

Solution 2 - Python

use idxmax in groupby and slice df with loc

df.loc[df.groupby('id').date.idxmax()]

    id  product       date
2  220     6647 2014-10-16
5  826     3380 2015-05-19
8  901     4555 2014-11-01

Solution 3 - Python

I had a similar problem and ended up using drop_duplicates rather than groupby.

It seems to run significatively faster on large datasets when compared with other methods suggested above.

df.sort_values(by="date").drop_duplicates(subset=["id"], keep="last")

    id  product        date
2  220     6647  2014-10-16
8  901     4555  2014-11-01
5  826     3380  2015-05-19

Solution 4 - Python

Given a dataframe sorted by date, you can obtain what you ask for in a number of ways:

Like this:

df.groupby(['id','product']).last()

like this:

df.groupby(['id','product']).nth(-1)

or like this:

df.groupby(['id','product']).max()

If you don't want id and product to appear as index use groupby(['id', 'product'], as_index=False). Alternatively use:

df.groupby(['id','product']).tail(1)

Solution 5 - Python

To use .tail() as an aggregation method and keep your grouping intact:

df.sort_values('date').groupby('id').apply(lambda x: x.tail(1))

        id	product	date
id				
220	2	220	6647	2014-10-16
826	5	826	3380	2015-05-19
901	8	901	4555	2014-11-01

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionDevExView Question on Stackoverflow
Solution 1 - PythonnipyView Answer on Stackoverflow
Solution 2 - PythonpiRSquaredView Answer on Stackoverflow
Solution 3 - PythonDamien MarlierView Answer on Stackoverflow
Solution 4 - PythonSandu UrsuView Answer on Stackoverflow
Solution 5 - PythonKristin QView Answer on Stackoverflow