How to count duplicate rows in pandas dataframe?

PythonPandas

Python Problem Overview


I am trying to count the duplicates of each type of row in my dataframe. For example, say that I have a dataframe in pandas as follows:

df = pd.DataFrame({'one': pd.Series([1., 1, 1]),
                   'two': pd.Series([1., 2., 1])})

I get a df that looks like this:

	one	two
0	1	1
1	1	2
2	1	1

I imagine the first step is to find all the different unique rows, which I do by:

df.drop_duplicates()

This gives me the following df:

	one	two
0	1	1
1	1	2

Now I want to take each row from the above df ([1 1] and [1 2]) and get a count of how many times each is in the initial df. My result would look something like this:

Row     Count
[1 1]     2
[1 2]     1

How should I go about doing this last step?

Edit:

Here's a larger example to make it more clear:

df = pd.DataFrame({'one': pd.Series([True, True, True, False]),
                   'two': pd.Series([True, False, False, True]),
                   'three': pd.Series([True, False, False, False])})

gives me:

	one	three	two
0	True	True	True
1	True	False	False
2	True	False	False
3	False	False	True

I want a result that tells me:

       Row           Count
[True True True]       1
[True False	False]     2
[False False True]     1

Python Solutions


Solution 1 - Python

You can groupby on all the columns and call size the index indicates the duplicate values:

In [28]:
df.groupby(df.columns.tolist(),as_index=False).size()

Out[28]:
one    three  two  
False  False  True     1
True   False  False    2
       True   True     1
dtype: int64

Solution 2 - Python

df.groupby(df.columns.tolist()).size().reset_index().\
    rename(columns={0:'records'})

   one  two  records
0    1    1        2
1    1    2        1

Solution 3 - Python

Specific to your question, as the others mentioned fast and easy way would be:

df.groupby(df.columns.tolist(),as_index=False).size()

If you like to count duplicates on particular column(s):

len(df['one'])-len(df['one'].drop_duplicates())

If you want to count duplicates on entire dataframe:

len(df)-len(df.drop_duplicates())

Or simply you can use DataFrame.duplicated(subset=None, keep='first'):

df.duplicated(subset='one', keep='first').sum()

where

subset : column label or sequence of labels(by default use all of the columns)

keep : {‘first’, ‘last’, False}, default ‘first’

  • first : Mark duplicates as True except for the first occurrence.

  • last : Mark duplicates as True except for the last occurrence.

  • False : Mark all duplicates as True.

Solution 4 - Python

I use:

used_features =[    "one",    "two",    "three"]

df['is_duplicated'] = df.duplicated(used_features)
df['is_duplicated'].sum()

which gives count of duplicated rows, and then you can analyse them by a new column. I didn't see such solution here.

Solution 5 - Python

None of the existing answers quite offers a simple solution that returns "the number of rows that are just duplicates and should be cut out". This is a one-size-fits-all solution that does:

# generate a table of those culprit rows which are duplicated:
dups = df.groupby(df.columns.tolist()).size().reset_index().rename(columns={0:'count'})

# sum the final col of that table, and subtract the number of culprits:
dups['count'].sum() - dups.shape[0]

Solution 6 - Python

df = pd.DataFrame({'one' : pd.Series([1., 1, 1, 3]), 'two' : pd.Series([1., 2., 1, 3] ), 'three' : pd.Series([1., 2., 1, 2] )})
df['str_list'] = df.apply(lambda row: ' '.join([str(int(val)) for val in row]), axis=1)
df1 = pd.DataFrame(df['str_list'].value_counts().values, index=df['str_list'].value_counts().index, columns=['Count'])

Produces:

>>> df1
       Count
1 1 1      2
3 2 3      1
1 2 2      1

If the index values must be a list, you could take the above code a step further with:

df1.index = df1.index.str.split()

Produces:

           Count
[1, 1, 1]      2
[3, 2, 3]      1
[1, 2, 2]      1

Solution 7 - Python

If you find some counts missing or get error: ValueError: Length mismatch: Expected axis has nnnn elements, new values have mmmm elements, read here:

1. Count duplicate rows with NaN entries:

The accepted solution is great and believed to have been helpful to many members. In a recent task, I found it can be further fine-tuned to support complete counting of a dataframe with NaN entries. Pandas supports missing entries or null values as NaN values. Let's see what's the output for this use case when our dataframe contains NaN entries:

  Col1  Col2 Col3 Col4
0  ABC   123  XYZ  NaN       # group #1 of 3
1  ABC   123  XYZ  NaN       # group #1 of 3
2  ABC   678  PQR  def           # group #2 of 1
3  MNO   890  EFG  abc               # group #3 of 4 
4  MNO   890  EFG  abc               # group #3 of 4 
5  CDE   234  567  xyz                   # group #4 of 2 
6  ABC   123  XYZ  NaN       # group #1 of 3
7  CDE   234  567  xyz                   # group #4 of 2 
8  MNO   890  EFG  abc               # group #3 of 4 
9  MNO   890  EFG  abc               # group #3 of 4 

Applying the code:

df.groupby(df.columns.tolist(),as_index=False).size()

gives:

  Col1  Col2 Col3 Col4  size
0  ABC   678  PQR  def     1
1  CDE   234  567  xyz     2
2  MNO   890  EFG  abc     4

Oh, how come the count of Group #1 with 3 duplicate rows is missing?!

For some Pandas version, you may get an error instead: ValueError: Length mismatch: Expected axis has nnnn elements, new values have mmmm elements

Solution:

Use the parameter dropna= for the .groupby() function, as follows:

df.groupby(df.columns.tolist(), as_index=False, dropna=False).size()

gives:

  Col1  Col2 Col3 Col4  size
0  ABC   123  XYZ  NaN     3          # <===  count of rows with `NaN`
1  ABC   678  PQR  def     1
2  CDE   234  567  xyz     2
3  MNO   890  EFG  abc     4

The count of duplicate rows with NaN can be successfully output with dropna=False. This parameter has been supported since Pandas version 1.1.0


2. Alternative Solution

Another way to count duplicate rows with NaN entries is as follows:

df.value_counts(dropna=False).reset_index(name='count')

gives:

  Col1  Col2 Col3 Col4  count
0  MNO   890  EFG  abc      4
1  ABC   123  XYZ  NaN      3
2  CDE   234  567  xyz      2
3  ABC   678  PQR  def      1

Here, we use the .value_counts() function with also the parameter dropna=False. However, this parameter has been supported only recently since Pandas version 1.3.0 If your version is older than this, you'll need to use the .groupby() solution if you want to get complete counts for rows with NaN entries.

You will see that the output is in different sequence than the previous result. The counts are sorted in descending order. If you want to get unsorted result, you can specify sort=False:

df.value_counts(dropna=False, sort=False).reset_index(name='count')

it gives the same result as the df.groupby(df.columns.tolist(), as_index=False, dropna=False).size() solution:

  Col1  Col2 Col3 Col4  count
0  ABC   123  XYZ  NaN      3
1  ABC   678  PQR  def      1
2  CDE   234  567  xyz      2
3  MNO   890  EFG  abc      4

Note that this .value_counts() solution supports dataframes both with and without NaN entries and can be used as a general solution.

In fact, in the underlying implementation codes .value_counts() calls GroupBy.size to get the counts: click the link to see the underlying codes: counts = self.groupby(subset, dropna=dropna).grouper.size()

Hence, for this use case, .value_counts() and the .groupby() solution in the accepted solution are actually doing the same thing. We should be able to use the .value_counts() function to get the desired counts of duplicate rows equally well.

Use of .value_counts() function to get counts of duplicate rows has the additional benefit that its syntax is simpler. You can simply use df.value_counts() or df.value_counts(dropna=False) depending on whether your dataframe contains NaN or not. Chain with .reset_index() if you want the result as a dataframe instead of a Series.

Solution 8 - Python

ran into this problem today and wanted to include NaNs so I replace them temporarily with "" (empty string). Please comment if you do not understand something :). This solution assumes that "" is not a relevant value for you. It should also work with numerical data (I have tested it sucessfully but not extensively) since pandas will infer the data type again after replacing "" with np.nan.

import pandas as pd

# create test data
df = pd.DataFrame({'test':['foo','bar',None,None,'foo'],
                  'test2':['bar',None,None,None,'bar'],
                  'test3':[None, 'foo','bar',None,None]})

# fill null values with '' to not lose them during groupby
# groupby all columns and calculate the length of the resulting groups
# rename the series obtained with groupby to "group_count"
# reset the index to get a DataFrame
# replace '' with np.nan (this reverts our first operation)
# sort DataFrame by "group_count" descending
df = (df.fillna('')\
      .groupby(df.columns.tolist()).apply(len)\
      .rename('group_count')\
      .reset_index()\
      .replace('',np.nan)\
      .sort_values(by = ['group_count'], ascending = False))
df
  test test2 test3  group_count
3  foo   bar   NaN            2
0  NaN   NaN   NaN            1
1  NaN   NaN   bar            1
2  bar   NaN   foo            1

Solution 9 - Python

To count rows in DataFrame you can use the method value_counts (Pandas 1.1.0):

df = pd.DataFrame({'A': [1, 1, 2, 2, 3], 'B': [10, 10, 20, 20, 30]})

df.value_counts().reset_index(name='counts').query('counts > 1')

Output:

   A   B  counts
0  1  10       2
1  2  20       2

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
Questionjss367View Question on Stackoverflow
Solution 1 - PythonEdChumView Answer on Stackoverflow
Solution 2 - PythonDenisView Answer on Stackoverflow
Solution 3 - PythonArashView Answer on Stackoverflow
Solution 4 - PythonSergey ZaitsevView Answer on Stackoverflow
Solution 5 - PythonolisteadmanView Answer on Stackoverflow
Solution 6 - PythonJaradView Answer on Stackoverflow
Solution 7 - PythonSeaBeanView Answer on Stackoverflow
Solution 8 - PythonThibTripView Answer on Stackoverflow
Solution 9 - PythonMykola ZotkoView Answer on Stackoverflow