How to filter Pandas dataframe using 'in' and 'not in' like in SQL

PythonPandasDataframeSql Function

Python Problem Overview


How can I achieve the equivalents of SQL's IN and NOT IN?

I have a list with the required values. Here's the scenario:

df = pd.DataFrame({'country': ['US', 'UK', 'Germany', 'China']})
countries_to_keep = ['UK', 'China']

# pseudo-code:
df[df['country'] not in countries_to_keep]

My current way of doing this is as follows:

df = pd.DataFrame({'country': ['US', 'UK', 'Germany', 'China']})
df2 = pd.DataFrame({'country': ['UK', 'China'], 'matched': True})

# IN
df.merge(df2, how='inner', on='country')

# NOT IN
not_in = df.merge(df2, how='left', on='country')
not_in = not_in[pd.isnull(not_in['matched'])]

But this seems like a horrible kludge. Can anyone improve on it?

Python Solutions


Solution 1 - Python

You can use pd.Series.isin.

For "IN" use: something.isin(somewhere)

Or for "NOT IN": ~something.isin(somewhere)

As a worked example:

import pandas as pd

>>> df
  country
0        US
1        UK
2   Germany
3     China
>>> countries_to_keep
['UK', 'China']
>>> df.country.isin(countries_to_keep)
0    False
1     True
2    False
3     True
Name: country, dtype: bool
>>> df[df.country.isin(countries_to_keep)]
  country
1        UK
3     China
>>> df[~df.country.isin(countries_to_keep)]
  country
0        US
2   Germany

Solution 2 - Python

Alternative solution that uses .query() method:

In [5]: df.query("countries in @countries_to_keep")
Out[5]:
  countries
1        UK
3     China

In [6]: df.query("countries not in @countries_to_keep")
Out[6]:
  countries
0        US
2   Germany

Solution 3 - Python

> ## How to implement 'in' and 'not in' for a pandas DataFrame?

Pandas offers two methods: Series.isin and DataFrame.isin for Series and DataFrames, respectively.


Filter DataFrame Based on ONE Column (also applies to Series)

The most common scenario is applying an isin condition on a specific column to filter rows in a DataFrame.

df = pd.DataFrame({'countries': ['US', 'UK', 'Germany', np.nan, 'China']})
df
  countries
0        US
1        UK
2   Germany
3     China

c1 = ['UK', 'China']             # list
c2 = {'Germany'}                 # set
c3 = pd.Series(['China', 'US'])  # Series
c4 = np.array(['US', 'UK'])      # array

Solution 4 - Python

I've been usually doing generic filtering over rows like this:

criterion = lambda row: row['countries'] not in countries
not_in = df[df.apply(criterion, axis=1)]

Solution 5 - Python

Collating possible solutions from the answers:

For IN: df[df['A'].isin([3, 6])]

For NOT IN:

  1. df[-df["A"].isin([3, 6])]

  2. df[~df["A"].isin([3, 6])]

  3. df[df["A"].isin([3, 6]) == False]

  4. df[np.logical_not(df["A"].isin([3, 6]))]

Solution 6 - Python

I wanted to filter out dfbc rows that had a BUSINESS_ID that was also in the BUSINESS_ID of dfProfilesBusIds

dfbc = dfbc[~dfbc['BUSINESS_ID'].isin(dfProfilesBusIds['BUSINESS_ID'])]

Solution 7 - Python

Why is no one talking about the performance of various filtering methods? In fact, this topic often pops up here (see the example). I did my own performance test for a large data set. It is very interesting and instructive.

df = pd.DataFrame({'animals': np.random.choice(['cat', 'dog', 'mouse', 'birds'], size=10**7), 
                   'number': np.random.randint(0,100, size=(10**7,))})

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000000 entries, 0 to 9999999
Data columns (total 2 columns):
 #   Column   Dtype 
---  ------   ----- 
 0   animals  object
 1   number   int64 
dtypes: int64(1), object(1)
memory usage: 152.6+ MB
%%timeit
# .isin() by one column
conditions = ['cat', 'dog']
df[df.animals.isin(conditions)]
367 ms ± 2.34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
# .query() by one column
conditions = ['cat', 'dog']
df.query('animals in @conditions')
395 ms ± 3.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
# .loc[]
df.loc[(df.animals=='cat')|(df.animals=='dog')]
987 ms ± 5.17 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
df[df.apply(lambda x: x['animals'] in ['cat', 'dog'], axis=1)]
41.9 s ± 490 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
new_df = df.set_index('animals')
new_df.loc[['cat', 'dog'], :]
3.64 s ± 62.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
new_df = df.set_index('animals')
new_df[new_df.index.isin(['cat', 'dog'])]
469 ms ± 8.98 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
s = pd.Series(['cat', 'dog'], name='animals')
df.merge(s, on='animals', how='inner')
796 ms ± 30.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Thus, the isin method turned out to be the fastest and the method with apply() was the slowest, which is not surprising.

Solution 8 - Python

df = pd.DataFrame({'countries':['US','UK','Germany','China']})
countries = ['UK','China']

implement in:

df[df.countries.isin(countries)]

implement not in as in of rest countries:

df[df.countries.isin([x for x in np.unique(df.countries) if x not in countries])]

Solution 9 - Python

You can also use .isin() inside .query():

df.query('country.isin(@countries_to_keep).values')

# Or alternatively:
df.query('country.isin(["UK", "China"]).values')

To negate your query, use ~:

df.query('~country.isin(@countries_to_keep).values')

Solution 10 - Python

A trick if you want to keep the order of the list:

df = pd.DataFrame({'country': ['US', 'UK', 'Germany', 'China']})
countries_to_keep = ['Germany', 'US']


ind=[df.index[df['country']==i].tolist() for i in countries_to_keep]
flat_ind=[item for sublist in ind for item in sublist]

df.reindex(flat_ind)

   country
2  Germany
0       US

Solution 11 - Python

My 2c worth: I needed a combination of in and ifelse statements for a dataframe, and this worked for me.

sale_method = pd.DataFrame(model_data["Sale Method"].str.upper())
sale_method["sale_classification"] = np.where(
    sale_method["Sale Method"].isin(["PRIVATE"]),
    "private",
    np.where(
        sale_method["Sale Method"].str.contains("AUCTION"), "auction", "other"
    ),
)

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionLondonRobView Question on Stackoverflow
Solution 1 - PythonDSMView Answer on Stackoverflow
Solution 2 - PythonMaxU - stop genocide of UAView Answer on Stackoverflow
Solution 3 - Pythoncs95View Answer on Stackoverflow
Solution 4 - PythonKosView Answer on Stackoverflow
Solution 5 - PythonAbhishek GaurView Answer on Stackoverflow
Solution 6 - PythonSam HendersonView Answer on Stackoverflow
Solution 7 - PythonpaduView Answer on Stackoverflow
Solution 8 - PythonIoannis NasiosView Answer on Stackoverflow
Solution 9 - PythonrachwaView Answer on Stackoverflow
Solution 10 - PythonBilly BonarosView Answer on Stackoverflow
Solution 11 - PythonGenDemoView Answer on Stackoverflow