Remove duplicates by columns A, keeping the row with the highest value in column B

Python Problem Overview

I have a dataframe with repeat values in column A. I want to drop duplicates, keeping the row with the highest value in column B.

So this:

Should turn into this:

I'm guessing there's probably an easy way to do this—maybe as easy as sorting the DataFrame before dropping duplicates—but I don't know groupby's internal logic well enough to figure it out. Any suggestions?

Python Solutions

Solution 1 - Python

This takes the last. Not the maximum though:

In [10]: df.drop_duplicates(subset='A', keep="last")
Out[10]: 
   A   B
1  1  20
3  2  40
4  3  10

You can do also something like:

In [12]: df.groupby('A', group_keys=False).apply(lambda x: x.loc[x.B.idxmax()])
Out[12]: 
   A   B
A       
1  1  20
2  2  40
3  3  10

Solution 2 - Python

The top answer is doing too much work and looks to be very slow for larger data sets. apply is slow and should be avoided if possible. ix is deprecated and should be avoided as well.

df.sort_values('B', ascending=False).drop_duplicates('A').sort_index()

   A   B
1  1  20
3  2  40
4  3  10

Or simply group by all the other columns and take the max of the column you need. df.groupby('A', as_index=False).max()

Solution 3 - Python

Simplest solution:

To drop duplicates based on one column:

df = df.drop_duplicates('column_name', keep='last')

To drop duplicates based on multiple columns:

df = df.drop_duplicates(['col_name1','col_name2','col_name3'], keep='last')

Solution 4 - Python

I would sort the dataframe first with Column B descending, then drop duplicates for Column A and keep first

df = df.sort_values(by='B', ascending=False)
df = df.drop_duplicates(subset='A', keep="first")

without any groupby

Solution 5 - Python

Try this:

df.groupby(['A']).max()

Solution 6 - Python

I was brought here by a link from a duplicate question.

For just two columns, wouldn't it be simpler to do:

df.groupby('A')['B'].max().reset_index()

And to retain a full row (when there are more columns, which is what the "duplicate question" that brought me here was asking):

df.loc[df.groupby(...)[column].idxmax()]

For example, to retain the full row where 'C' takes its max, for each group of ['A', 'B'], we would do:

out = df.loc[df.groupby(['A', 'B')['C'].idxmax()]

When there are relatively few groups (i.e., lots of duplicates), this is faster than the drop_duplicates() solution (less sorting):

Setup:

n = 1_000_000
df = pd.DataFrame({
    'A': np.random.randint(0, 20, n),
    'B': np.random.randint(0, 20, n),
    'C': np.random.uniform(size=n),
    'D': np.random.choice(list('abcdefghijklmnopqrstuvwxyz'), size=n),
})

(Adding sort_index() to ensure equal solution):

%timeit df.loc[df.groupby(['A', 'B'])['C'].idxmax()].sort_index()
# 101 ms ± 98.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit df.sort_values(['C', 'A', 'B'], ascending=False).drop_duplicates(['A', 'B']).sort_index()
# 667 ms ± 784 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

Solution 7 - Python

I think in your case you don't really need a groupby. I would sort by descending order your B column, then drop duplicates at column A and if you want you can also have a new nice and clean index like that:

df.sort_values('B', ascending=False).drop_duplicates('A').sort_index().reset_index(drop=True)

Solution 8 - Python

You can try this as well

df.drop_duplicates(subset='A', keep='last')

I referred this from https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html

Solution 9 - Python

Here's a variation I had to solve that's worth sharing: for each unique string in columnA I wanted to find the most common associated string in columnB.

df.groupby('columnA').agg({'columnB': lambda x: x.mode().any()}).reset_index()

The .any() picks one if there's a tie for the mode. (Note that using .any() on a Series of ints returns a boolean rather than picking one of them.)

For the original question, the corresponding approach simplifies to

df.groupby('columnA').columnB.agg('max').reset_index().

Solution 10 - Python

Easiest way to do this:

# First you need to sort this DF as Column A as ascending and column B as descending 
# Then you can drop the duplicate values in A column 
# Optional - you can reset the index and get the nice data frame again
# I'm going to show you all in one step. 

d = {'A': [1,1,2,3,1,2,3,1], 'B': [30, 40,50,42,38,30,25,32]}
df = pd.DataFrame(data=d)
df

	A	B
0	1	30
1	1	40
2	2	50
3	3	42
4	1	38
5	2	30
6	3	25
7	1	32


df = df.sort_values(['A','B'], ascending =[True,False]).drop_duplicates(['A']).reset_index(drop=True)

df

	A	B
0	1	40
1	2	50
2	3	42

Solution 11 - Python

When already given posts answer the question, I made a small change by adding the column name on which the max() function is applied for better code readability.

df.groupby('A', as_index=False)['B'].max()

Solution 12 - Python

this also works:

a=pd.DataFrame({'A':a.groupby('A')['B'].max().index,'B':a.groupby('A')       ['B'].max().values})

Solution 13 - Python

I am not going to give you the whole answer (I don't think you're looking for the parsing and writing to file part anyway), but a pivotal hint should suffice: use python's set() function, and then sorted() or .sort() coupled with .reverse():

>>> a=sorted(set([10,60,30,10,50,20,60,50,60,10,30]))
>>> a
[10, 20, 30, 50, 60]
>>> a.reverse()
>>> a
[60, 50, 30, 20, 10]

Content Type	Original Author	Original Content on Stackoverflow
Question	Abe	View Question on Stackoverflow
Solution 1 - Python	Wes McKinney	View Answer on Stackoverflow
Solution 2 - Python	Ted Petrou	View Answer on Stackoverflow
Solution 3 - Python	Gil Baggio	View Answer on Stackoverflow
Solution 4 - Python	Nobel	View Answer on Stackoverflow
Solution 5 - Python	eumiro	View Answer on Stackoverflow
Solution 6 - Python	Pierre D	View Answer on Stackoverflow
Solution 7 - Python	whateveros	View Answer on Stackoverflow
Solution 8 - Python	Venkat	View Answer on Stackoverflow
Solution 9 - Python	mistaben	View Answer on Stackoverflow
Solution 10 - Python	rra	View Answer on Stackoverflow
Solution 11 - Python	Bhagabat Behera	View Answer on Stackoverflow
Solution 12 - Python	Mahesh	View Answer on Stackoverflow
Solution 13 - Python	Abhranil Das	View Answer on Stackoverflow