Remove duplicates by columns A, keeping the row with the highest value in column B

PythonDuplicatesPandas

Python Problem Overview


I have a dataframe with repeat values in column A. I want to drop duplicates, keeping the row with the highest value in column B.

So this:

A B
1 10
1 20
2 30
2 40
3 10

Should turn into this:

A B
1 20
2 40
3 10

I'm guessing there's probably an easy way to do this—maybe as easy as sorting the DataFrame before dropping duplicates—but I don't know groupby's internal logic well enough to figure it out. Any suggestions?

Python Solutions


Solution 1 - Python

This takes the last. Not the maximum though:

In [10]: df.drop_duplicates(subset='A', keep="last")
Out[10]: 
   A   B
1  1  20
3  2  40
4  3  10

You can do also something like:

In [12]: df.groupby('A', group_keys=False).apply(lambda x: x.loc[x.B.idxmax()])
Out[12]: 
   A   B
A       
1  1  20
2  2  40
3  3  10

Solution 2 - Python

The top answer is doing too much work and looks to be very slow for larger data sets. apply is slow and should be avoided if possible. ix is deprecated and should be avoided as well.

df.sort_values('B', ascending=False).drop_duplicates('A').sort_index()

   A   B
1  1  20
3  2  40
4  3  10

Or simply group by all the other columns and take the max of the column you need. df.groupby('A', as_index=False).max()

Solution 3 - Python

Simplest solution:

To drop duplicates based on one column:

df = df.drop_duplicates('column_name', keep='last')

To drop duplicates based on multiple columns:

df = df.drop_duplicates(['col_name1','col_name2','col_name3'], keep='last')

Solution 4 - Python

I would sort the dataframe first with Column B descending, then drop duplicates for Column A and keep first

df = df.sort_values(by='B', ascending=False)
df = df.drop_duplicates(subset='A', keep="first")

without any groupby

Solution 5 - Python

Try this:

df.groupby(['A']).max()

Solution 6 - Python

I was brought here by a link from a duplicate question.

For just two columns, wouldn't it be simpler to do:

df.groupby('A')['B'].max().reset_index()

And to retain a full row (when there are more columns, which is what the "duplicate question" that brought me here was asking):

df.loc[df.groupby(...)[column].idxmax()]

For example, to retain the full row where 'C' takes its max, for each group of ['A', 'B'], we would do:

out = df.loc[df.groupby(['A', 'B')['C'].idxmax()]

When there are relatively few groups (i.e., lots of duplicates), this is faster than the drop_duplicates() solution (less sorting):

Setup:

n = 1_000_000
df = pd.DataFrame({
    'A': np.random.randint(0, 20, n),
    'B': np.random.randint(0, 20, n),
    'C': np.random.uniform(size=n),
    'D': np.random.choice(list('abcdefghijklmnopqrstuvwxyz'), size=n),
})

(Adding sort_index() to ensure equal solution):

%timeit df.loc[df.groupby(['A', 'B'])['C'].idxmax()].sort_index()
# 101 ms ± 98.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit df.sort_values(['C', 'A', 'B'], ascending=False).drop_duplicates(['A', 'B']).sort_index()
# 667 ms ± 784 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

Solution 7 - Python

I think in your case you don't really need a groupby. I would sort by descending order your B column, then drop duplicates at column A and if you want you can also have a new nice and clean index like that:

df.sort_values('B', ascending=False).drop_duplicates('A').sort_index().reset_index(drop=True)

Solution 8 - Python

You can try this as well

df.drop_duplicates(subset='A', keep='last')

I referred this from https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html

Solution 9 - Python

Here's a variation I had to solve that's worth sharing: for each unique string in columnA I wanted to find the most common associated string in columnB.

df.groupby('columnA').agg({'columnB': lambda x: x.mode().any()}).reset_index()

The .any() picks one if there's a tie for the mode. (Note that using .any() on a Series of ints returns a boolean rather than picking one of them.)

For the original question, the corresponding approach simplifies to

df.groupby('columnA').columnB.agg('max').reset_index().

Solution 10 - Python

Easiest way to do this:

# First you need to sort this DF as Column A as ascending and column B as descending 
# Then you can drop the duplicate values in A column 
# Optional - you can reset the index and get the nice data frame again
# I'm going to show you all in one step. 

d = {'A': [1,1,2,3,1,2,3,1], 'B': [30, 40,50,42,38,30,25,32]}
df = pd.DataFrame(data=d)
df

	A	B
0	1	30
1	1	40
2	2	50
3	3	42
4	1	38
5	2	30
6	3	25
7	1	32


df = df.sort_values(['A','B'], ascending =[True,False]).drop_duplicates(['A']).reset_index(drop=True)

df

	A	B
0	1	40
1	2	50
2	3	42

Solution 11 - Python

When already given posts answer the question, I made a small change by adding the column name on which the max() function is applied for better code readability.

df.groupby('A', as_index=False)['B'].max()

Solution 12 - Python

this also works:

a=pd.DataFrame({'A':a.groupby('A')['B'].max().index,'B':a.groupby('A')       ['B'].max().values})

Solution 13 - Python

I am not going to give you the whole answer (I don't think you're looking for the parsing and writing to file part anyway), but a pivotal hint should suffice: use python's set() function, and then sorted() or .sort() coupled with .reverse():

>>> a=sorted(set([10,60,30,10,50,20,60,50,60,10,30]))
>>> a
[10, 20, 30, 50, 60]
>>> a.reverse()
>>> a
[60, 50, 30, 20, 10]

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionAbeView Question on Stackoverflow
Solution 1 - PythonWes McKinneyView Answer on Stackoverflow
Solution 2 - PythonTed PetrouView Answer on Stackoverflow
Solution 3 - PythonGil BaggioView Answer on Stackoverflow
Solution 4 - PythonNobelView Answer on Stackoverflow
Solution 5 - PythoneumiroView Answer on Stackoverflow
Solution 6 - PythonPierre DView Answer on Stackoverflow
Solution 7 - PythonwhateverosView Answer on Stackoverflow
Solution 8 - PythonVenkatView Answer on Stackoverflow
Solution 9 - PythonmistabenView Answer on Stackoverflow
Solution 10 - PythonrraView Answer on Stackoverflow
Solution 11 - PythonBhagabat BeheraView Answer on Stackoverflow
Solution 12 - PythonMaheshView Answer on Stackoverflow
Solution 13 - PythonAbhranil DasView Answer on Stackoverflow