Search for "does-not-contain" on a DataFrame in pandas

PythonPandasContains

Python Problem Overview


I've done some searching and can't figure out how to filter a dataframe by

df["col"].str.contains(word)

however I'm wondering if there is a way to do the reverse: filter a dataframe by that set's compliment. eg: to the effect of

!(df["col"].str.contains(word))

Can this be done through a DataFrame method?

Python Solutions


Solution 1 - Python

You can use the invert (~) operator (which acts like a not for boolean data):

new_df = df[~df["col"].str.contains(word)]

where new_df is the copy returned by RHS.

contains also accepts a regular expression...


If the above throws a ValueError, the reason is likely because you have mixed datatypes, so use na=False:

new_df = df[~df["col"].str.contains(word, na=False)]

Or,

new_df = df[df["col"].str.contains(word) == False]

Solution 2 - Python

I was having trouble with the not (~) symbol as well, so here's another way from another StackOverflow thread:

df[df["col"].str.contains('this|that')==False]

Solution 3 - Python

You can use Apply and Lambda :

df[df["col"].apply(lambda x: word not in x)]

Or if you want to define more complex rule, you can use AND:

df[df["col"].apply(lambda x: word_1 not in x and word_2 not in x)]

Solution 4 - Python

I hope the answers are already posted

I am adding the framework to find multiple words and negate those from dataFrame.

Here 'word1','word2','word3','word4' = list of patterns to search

df = DataFrame

column_a = A column name from DataFrame df

values_to_remove = ['word1','word2','word3','word4'] 

pattern = '|'.join(values_to_remove)

result = df.loc[~df['column_a'].str.contains(pattern, case=False)]

Solution 5 - Python

I had to get rid of the NULL values before using the command recommended by Andy above. An example:

df = pd.DataFrame(index = [0, 1, 2], columns=['first', 'second', 'third'])
df.ix[:, 'first'] = 'myword'
df.ix[0, 'second'] = 'myword'
df.ix[2, 'second'] = 'myword'
df.ix[1, 'third'] = 'myword'
df

    first	second	third
0	myword	myword	 NaN
1	myword	NaN	     myword 
2	myword	myword	 NaN

Now running the command:

~df["second"].str.contains(word)

I get the following error:

TypeError: bad operand type for unary ~: 'float'

I got rid of the NULL values using dropna() or fillna() first and retried the command with no problem.

Solution 6 - Python

Additional to nanselm2's answer, you can use 0 instead of False:

df["col"].str.contains(word)==0

Solution 7 - Python

To compliment to the above question, if someone wants to remove all the rows with strings, one could do:

df_new=df[~df['col_name'].apply(lambda x: isinstance(x, str))]

Solution 8 - Python

To negate your query use ~. Using query has the advantage of returning the valid observations of df directly:

df.query('~col.str.contains("word").values')

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionstitesView Question on Stackoverflow
Solution 1 - PythonAndy HaydenView Answer on Stackoverflow
Solution 2 - Pythonnanselm2View Answer on Stackoverflow
Solution 3 - PythonArashView Answer on Stackoverflow
Solution 4 - PythonNursnaazView Answer on Stackoverflow
Solution 5 - PythonShoreshView Answer on Stackoverflow
Solution 6 - PythonU12-ForwardView Answer on Stackoverflow
Solution 7 - PythonvasanthView Answer on Stackoverflow
Solution 8 - PythonrachwaView Answer on Stackoverflow