How to test if a string contains one of the substrings in a list, in pandas?

Python Problem Overview

Is there any function that would be the equivalent of a combination of df.isin() and df[col].str.contains()?

For example, say I have the series s = pd.Series(['cat','hat','dog','fog','pet']), and I want to find all places where s contains any of ['og', 'at'], I would want to get everything but 'pet'.

I have a solution, but it's rather inelegant:

searchfor = ['og', 'at']
found = [s.str.contains(x) for x in searchfor]
result = pd.DataFrame[found]
result.any()

Is there a better way to do this?

Python Solutions

Solution 1 - Python

One option is just to use the regex | character to try to match each of the substrings in the words in your Series s (still using str.contains).

You can construct the regex by joining the words in searchfor with |:

>>> searchfor = ['og', 'at']
>>> s[s.str.contains('|'.join(searchfor))]
0    cat
1    hat
2    dog
3    fog
dtype: object

As @AndyHayden noted in the comments below, take care if your substrings have special characters such as $ and ^ which you want to match literally. These characters have specific meanings in the context of regular expressions and will affect the matching.

You can make your list of substrings safer by escaping non-alphanumeric characters with re.escape:

>>> import re
>>> matches = ['$money', 'x^y']
>>> safe_matches = [re.escape(m) for m in matches]
>>> safe_matches
['\\$money', 'x\\^y']

The strings with in this new list will match each character literally when used with str.contains.

Solution 2 - Python

You can use str.contains alone with a regex pattern using OR (|):

s[s.str.contains('og|at')]

Or you could add the series to a dataframe then use str.contains:

df = pd.DataFrame(s)
df[s.str.contains('og|at')]

Output:

0 cat
1 hat
2 dog
3 fog

Solution 3 - Python

Here is a one line lambda that also works:

df["TrueFalse"] = df['col1'].apply(lambda x: 1 if any(i in x for i in searchfor) else 0)

Input:

searchfor = ['og', 'at']

df = pd.DataFrame([('cat', 1000.0), ('hat', 2000000.0), ('dog', 1000.0), ('fog', 330000.0),('pet', 330000.0)], columns=['col1', 'col2'])

   col1	 col2
0	cat	1000.0
1	hat	2000000.0
2	dog	1000.0
3	fog	330000.0
4	pet	330000.0

Apply Lambda:

df["TrueFalse"] = df['col1'].apply(lambda x: 1 if any(i in x for i in searchfor) else 0)

Output:

	col1	col2	    TrueFalse
0	cat	    1000.0	    1
1	hat	    2000000.0	1
2	dog	    1000.0	    1
3	fog	    330000.0	1
4	pet	    330000.0	0

Content Type	Original Author	Original Content on Stackoverflow
Question	ari	View Question on Stackoverflow
Solution 1 - Python	Alex Riley	View Answer on Stackoverflow
Solution 2 - Python	l'L'l	View Answer on Stackoverflow
Solution 3 - Python	Grant Shannon	View Answer on Stackoverflow

How to test if a string contains one of the substrings in a list, in pandas?

Python Problem Overview

Python Solutions

Solution 1 - Python

Solution 2 - Python

Solution 3 - Python

iOS 8 removed "minimal-ui" viewport property, are there other "soft fullscreen" solutions?

How to enable C# 6.0 feature in Visual Studio 2013?

Attributions