pandas get rows which are NOT in other dataframe

PythonPandasDataframe

Python Problem Overview


I've two pandas data frames that have some rows in common.

Suppose dataframe2 is a subset of dataframe1.

How can I get the rows of dataframe1 which are not in dataframe2?

df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5], 'col2' : [10, 11, 12, 13, 14]}) 
df2 = pandas.DataFrame(data = {'col1' : [1, 2, 3], 'col2' : [10, 11, 12]})

df1

   col1  col2
0     1    10
1     2    11
2     3    12
3     4    13
4     5    14

df2

   col1  col2
0     1    10
1     2    11
2     3    12

Expected result:

   col1  col2
3     4    13
4     5    14

Python Solutions


Solution 1 - Python

The currently selected solution produces incorrect results. To correctly solve this problem, we can perform a left-join from df1 to df2, making sure to first get just the unique rows for df2.

First, we need to modify the original DataFrame to add the row with data [3, 10].

df1 = pd.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 3], 
                           'col2' : [10, 11, 12, 13, 14, 10]}) 
df2 = pd.DataFrame(data = {'col1' : [1, 2, 3],
                           'col2' : [10, 11, 12]})

df1

   col1  col2
0     1    10
1     2    11
2     3    12
3     4    13
4     5    14
5     3    10

df2

   col1  col2
0     1    10
1     2    11
2     3    12

Perform a left-join, eliminating duplicates in df2 so that each row of df1 joins with exactly 1 row of df2. Use the parameter indicator to return an extra column indicating which table the row was from.

df_all = df1.merge(df2.drop_duplicates(), on=['col1','col2'], 
                   how='left', indicator=True)
df_all

   col1  col2     _merge
0     1    10       both
1     2    11       both
2     3    12       both
3     4    13  left_only
4     5    14  left_only
5     3    10  left_only

Create a boolean condition:

df_all['_merge'] == 'left_only'

0    False
1    False
2    False
3     True
4     True
5     True
Name: _merge, dtype: bool

Why other solutions are wrong

A few solutions make the same mistake - they only check that each value is independently in each column, not together in the same row. Adding the last row, which is unique but has the values from both columns from df2 exposes the mistake:

common = df1.merge(df2,on=['col1','col2'])
(~df1.col1.isin(common.col1))&(~df1.col2.isin(common.col2))
0    False
1    False
2    False
3     True
4     True
5    False
dtype: bool

This solution gets the same wrong result:

df1.isin(df2.to_dict('l')).all(1)

Solution 2 - Python

One method would be to store the result of an inner merge form both dfs, then we can simply select the rows when one column's values are not in this common:

In [119]:

common = df1.merge(df2,on=['col1','col2'])
print(common)
df1[(~df1.col1.isin(common.col1))&(~df1.col2.isin(common.col2))]
   col1  col2
0     1    10
1     2    11
2     3    12
Out[119]:
   col1  col2
3     4    13
4     5    14

EDIT

Another method as you've found is to use isin which will produce NaN rows which you can drop:

In [138]:

df1[~df1.isin(df2)].dropna()
Out[138]:
   col1  col2
3     4    13
4     5    14

However if df2 does not start rows in the same manner then this won't work:

df2 = pd.DataFrame(data = {'col1' : [2, 3,4], 'col2' : [11, 12,13]})

will produce the entire df:

In [140]:

df1[~df1.isin(df2)].dropna()
Out[140]:
   col1  col2
0     1    10
1     2    11
2     3    12
3     4    13
4     5    14

Solution 3 - Python

Assuming that the indexes are consistent in the dataframes (not taking into account the actual col values):

df1[~df1.index.isin(df2.index)]

Solution 4 - Python

As already hinted at, isin requires columns and indices to be the same for a match. If match should only be on row contents, one way to get the mask for filtering the rows present is to convert the rows to a (Multi)Index:

In [77]: df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 3], 'col2' : [10, 11, 12, 13, 14, 10]})
In [78]: df2 = pandas.DataFrame(data = {'col1' : [1, 3, 4], 'col2' : [10, 12, 13]})
In [79]: df1.loc[~df1.set_index(list(df1.columns)).index.isin(df2.set_index(list(df2.columns)).index)]
Out[79]:
   col1  col2
1     2    11
4     5    14
5     3    10

If index should be taken into account, set_index has keyword argument append to append columns to existing index. If columns do not line up, list(df.columns) can be replaced with column specifications to align the data.

pandas.MultiIndex.from_tuples(df<N>.to_records(index = False).tolist())

could alternatively be used to create the indices, though I doubt this is more efficient.

Solution 5 - Python

Suppose you have two dataframes, df_1 and df_2 having multiple fields(column_names) and you want to find the only those entries in df_1 that are not in df_2 on the basis of some fields(e.g. fields_x, fields_y), follow the following steps.

Step1.Add a column key1 and key2 to df_1 and df_2 respectively.

Step2.Merge the dataframes as shown below. field_x and field_y are our desired columns.

Step3.Select only those rows from df_1 where key1 is not equal to key2.

Step4.Drop key1 and key2.

This method will solve your problem and works fast even with big data sets. I have tried it for dataframes with more than 1,000,000 rows.

df_1['key1'] = 1
df_2['key2'] = 1
df_1 = pd.merge(df_1, df_2, on=['field_x', 'field_y'], how = 'left')
df_1 = df_1[~(df_1.key2 == df_1.key1)]
df_1 = df_1.drop(['key1','key2'], axis=1)

Solution 6 - Python

a bit late, but it might be worth checking the "indicator" parameter of pd.merge.

See this other question for an example: https://stackoverflow.com/questions/33349797/compare-pandas-dataframes-and-return-rows-that-are-missing-from-the-first-one/42004293#42004293

Solution 7 - Python

This is the best way to do it:

df = df1.drop_duplicates().merge(df2.drop_duplicates(), on=df2.columns.to_list(), 
                   how='left', indicator=True)
df.loc[df._merge=='left_only',df.columns!='_merge']

Note that drop duplicated is used to minimize the comparisons. It would work without them as well. The best way is to compare the row contents themselves and not the index or one/two columns and same code can be used for other filters like 'both' and 'right_only' as well to achieve similar results. For this syntax dataframes can have any number of columns and even different indices. Only the columns should occur in both the dataframes.

Why this is the best way?

  1. index.difference only works for unique index based comparisons

  2. pandas.concat() coupled with drop_duplicated() is not ideal because it will also get rid of the rows which may be only in the dataframe you want to keep and are duplicated for valid reasons.

Solution 8 - Python

I think those answers containing merging are extremely slow. Therefore I would suggest another way of getting those rows which are different between the two dataframes:

df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5], 'col2' : [10, 11, 12, 13, 14]}) 
df2 = pandas.DataFrame(data = {'col1' : [1, 2, 3], 'col2' : [10, 11, 12]})

DISCLAIMER: My solution works if you're interested in one specific column where the two dataframes differ. If you are interested only in those rows, where all columns are equal do not use this approach.

Let's say, col1 is a kind of ID, and you only want to get those rows, which are not contained in both dataframes:

ids_in_df2 = df2.col1.unique()
not_found_ids = df[~df['col1'].isin(ids_in_df2 )]

And that's it. You get a dataframe containing only those rows where col1 isn't appearent in both dataframes.

Solution 9 - Python

you can do it using isin(dict) method:

In [74]: df1[~df1.isin(df2.to_dict('l')).all(1)]
Out[74]:
   col1  col2
3     4    13
4     5    14

Explanation:

In [75]: df2.to_dict('l')
Out[75]: {'col1': [1, 2, 3], 'col2': [10, 11, 12]}

In [76]: df1.isin(df2.to_dict('l'))
Out[76]:
    col1   col2
0   True   True
1   True   True
2   True   True
3  False  False
4  False  False

In [77]: df1.isin(df2.to_dict('l')).all(1)
Out[77]:
0     True
1     True
2     True
3    False
4    False
dtype: bool

Solution 10 - Python

Here is another way of solving this:

df1[~df1.index.isin(df1.merge(df2, how='inner', on=['col1', 'col2']).index)]

Or:

df1.loc[df1.index.difference(df1.merge(df2, how='inner', on=['col1', 'col2']).index)]

Solution 11 - Python

I have an easier way in 2 simple steps: As the OP mentioned Suppose dataframe2 is a subset of dataframe1, columns in the 2 dataframes are the same,

df1 = pd.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 3], 
                           'col2' : [10, 11, 12, 13, 14, 10]}) 
df2 = pd.DataFrame(data = {'col1' : [1, 2, 3],
                           'col2' : [10, 11, 12]})

### Step 1: just append the 2nd df at the end of the 1st df 
df_both = df1.append(df2)

### Step 2: drop rows which contain duplicates, Drop all duplicates.
df_dif = df_both.drop_duplicates(keep=False)

## mission accompliched!
df_dif
Out[20]: 
   col1  col2
3     4    13
4     5    14
5     3    10

Solution 12 - Python

You can also concat df1, df2:

x = pd.concat([df1, df2])

and then remove all duplicates:

y = x.drop_duplicates(keep=False, inplace=False)

Solution 13 - Python

####extract the dissimilar rows using the merge function

df = df.merge(same.drop_duplicates(), on=['col1','col2'], 
               how='left', indicator=True)

####save the dissimilar rows in CSV

df[df['_merge'] == 'left_only'].to_csv('output.csv')

Solution 14 - Python

My way of doing this involves adding a new column that is unique to one dataframe and using this to choose whether to keep an entry

df2[col3] = 1
df1 = pd.merge(df_1, df_2, on=['field_x', 'field_y'], how = 'outer')
df1['Empt'].fillna(0, inplace=True)

This makes it so every entry in df1 has a code - 0 if it is unique to df1, 1 if it is in both dataFrames. You then use this to restrict to what you want

answer = nonuni[nonuni['Empt'] == 0]

Solution 15 - Python

How about this:

df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5], 
                               'col2' : [10, 11, 12, 13, 14]}) 
df2 = pandas.DataFrame(data = {'col1' : [1, 2, 3], 
                               'col2' : [10, 11, 12]})
records_df2 = set([tuple(row) for row in df2.values])
in_df2_mask = np.array([tuple(row) in records_df2 for row in df1.values])
result = df1[~in_df2_mask]

Solution 16 - Python

Easier, simpler and elegant

uncommon_indices = np.setdiff1d(df1.index.values, df2.index.values)
new_df = df1.loc[uncommon_indices,:]

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
Questionthink nice thingsView Question on Stackoverflow
Solution 1 - PythonTed PetrouView Answer on Stackoverflow
Solution 2 - PythonEdChumView Answer on Stackoverflow
Solution 3 - PythonDennis GolomazovView Answer on Stackoverflow
Solution 4 - PythonRune LyngsoeView Answer on Stackoverflow
Solution 5 - PythonPragalbh kulshresthaView Answer on Stackoverflow
Solution 6 - PythonjabellcuView Answer on Stackoverflow
Solution 7 - PythonHamzaView Answer on Stackoverflow
Solution 8 - Pythonlschmidt90View Answer on Stackoverflow
Solution 9 - PythonMaxU - stop genocide of UAView Answer on Stackoverflow
Solution 10 - PythonSergey ZakharovView Answer on Stackoverflow
Solution 11 - PythonneutralnameView Answer on Stackoverflow
Solution 12 - PythonSemeon BalagulaView Answer on Stackoverflow
Solution 13 - PythonGajanan KothawadeView Answer on Stackoverflow
Solution 14 - Pythonr.rzView Answer on Stackoverflow
Solution 15 - PythonadamwlevView Answer on Stackoverflow
Solution 16 - PythonMNKView Answer on Stackoverflow