How to remove a pandas dataframe from another dataframe
PythonPandasDataframeSubtractionPython Problem Overview
How to remove a pandas dataframe from another dataframe, just like the set subtraction:
a=[1,2,3,4,5]
b=[1,5]
a-b=[2,3,4]
And now we have two pandas dataframe, how to remove df2 from df1:
In [5]: df1=pd.DataFrame([[1,2],[3,4],[5,6]],columns=['a','b'])
In [6]: df1
Out[6]:
a b
0 1 2
1 3 4
2 5 6
In [9]: df2=pd.DataFrame([[1,2],[5,6]],columns=['a','b'])
In [10]: df2
Out[10]:
a b
0 1 2
1 5 6
Then we expect df1-df2 result will be:
In [14]: df
Out[14]:
a b
0 3 4
How to do it?
Thank you.
Python Solutions
Solution 1 - Python
Solution
Use pd.concat
followed by drop_duplicates(keep=False)
pd.concat([df1, df2, df2]).drop_duplicates(keep=False)
It looks like
a b
1 3 4
Explanation
pd.concat
adds the two DataFrame
s together by appending one right after the other. if there is any overlap, it will be captured by the drop_duplicates
method. However, drop_duplicates
by default leaves the first observation and removes every other observation. In this case, we want every duplicate removed. Hence, the keep=False
parameter which does exactly that.
A special note to the repeated df2
. With only one df2
any row in df2
not in df1
won't be considered a duplicate and will remain. This solution with only one df2
only works when df2
is a subset of df1
. However, if we concat df2
twice, it is guaranteed to be a duplicate and will subsequently be removed.
Solution 2 - Python
You can use .duplicated
, which has the benefit of being fairly expressive:
%%timeit
combined = df1.append(df2)
combined[~combined.index.duplicated(keep=False)]
1000 loops, best of 3: 875 µs per loop
For comparison:
%timeit df1.loc[pd.merge(df1, df2, on=['a','b'], how='left', indicator=True)['_merge'] == 'left_only']
100 loops, best of 3: 4.57 ms per loop
%timeit pd.concat([df1, df2, df2]).drop_duplicates(keep=False)
1000 loops, best of 3: 987 µs per loop
%timeit df2[df2.apply(lambda x: x.value not in df2.values, axis=1)]
1000 loops, best of 3: 546 µs per loop
In sum, using the np.array
comparison is fastest. Don't need the .tolist()
there.
Solution 3 - Python
A set logic approach. Turn the rows of df1
and df2
into sets. Then use set
subtraction to define new DataFrame
idx1 = set(df1.set_index(['a', 'b']).index)
idx2 = set(df2.set_index(['a', 'b']).index)
pd.DataFrame(list(idx1 - idx2), columns=df1.columns)
a b
0 3 4
Solution 4 - Python
To get dataframe with all records which are in DF1 but not in DF2
DF=DF1[~DF1.isin(DF2)].dropna(how = 'all')
Solution 5 - Python
My shot with merge df1 and df2 from the question.
Using 'indicator' parameter
In [74]: df1.loc[pd.merge(df1, df2, on=['a','b'], how='left', indicator=True)['_merge'] == 'left_only']
Out[74]:
a b
1 3 4
Solution 6 - Python
A masking approach
df1[df1.apply(lambda x: x.values.tolist() not in df2.values.tolist(), axis=1)]
a b
1 3 4
Solution 7 - Python
I think the first tolist()
needs to be removed, but keep the second one:
df1[df1.apply(lambda x: x.values() not in df2.values.tolist(), axis=1)]
Solution 8 - Python
An easiest option is to use indexes.
-
Append df1 and df2 and reset their indexes.
df = df1.concat(df2)
df.reset_index(inplace=True)
-
e.g:
This will give df2 indexesindexes_df2 = df.index[ (df["a"].isin(df2["a"]) ) & (df["b"].isin(df2["b"]) ) result_index = df.index[~index_df2] result_data = df.iloc[ result_index,:]
Hope it will help to new readers, although the question posted a little time ago :)
Solution 9 - Python
Solution if df1
contains duplicates + keeps the index.
A modified version of piRSquared's answer to keep the duplicates in df1
that do not appear in df2
, while maintaining the index.
df1[df1.apply(lambda x: (x == pd.concat([df1.drop_duplicates(), df2, df2]).drop_duplicates(keep=False)).all(1).any(), axis=1)]
If your dataframes are big, you may want to store the result of
pd.concat([df1.drop_duplicates(), df2, df2]).drop_duplicates(keep=False)
in a variable before the df1.apply
call.
Solution 10 - Python
This solution works when your df_to_drop is a subset of main data frame data.
data_clean = data.drop(df_to_drop.index)