Search for String in all Pandas DataFrame columns and filter

PythonPandas

Python Problem Overview


Thought this would be straight forward but had some trouble tracking down an elegant way to search all columns in a dataframe at same time for a partial string match. Basically how would I apply df['col1'].str.contains('^') to an entire dataframe at once and filter down to any rows that have records containing the match?

Python Solutions


Solution 1 - Python

The Series.str.contains method expects a regex pattern (by default), not a literal string. Therefore str.contains("^") matches the beginning of any string. Since every string has a beginning, everything matches. Instead use str.contains("\^") to match the literal ^ character.

To check every column, you could use for col in df to iterate through the column names, and then call str.contains on each column:

mask = np.column_stack([df[col].str.contains(r"\^", na=False) for col in df])
df.loc[mask.any(axis=1)]

Alternatively, you could pass regex=False to str.contains to make the test use the Python in operator; but (in general) using regex is faster.

Solution 2 - Python

Try with :

df.apply(lambda row: row.astype(str).str.contains('TEST').any(), axis=1)

Solution 3 - Python

Here's a function to solve the problem of doing text search in all column of a dataframe df:

def search(regex: str, df, case=False):
    """Search all the text columns of `df`, return rows with any matches."""
    textlikes = df.select_dtypes(include=[object, "string"])
    return df[
        textlikes.apply(
            lambda column: column.str.contains(regex, regex=True, case=case, na=False)
        ).any(axis=1)
    ]

It differs from the existing answers by both staying in the pandas API and embracing that pandas is more efficient in column processing than row processing. Also, this is packed as a pure function :-)

Relevant docs:

Solution 4 - Python

posting my findings in case someone would need.

i had a Dataframe (360 000 rows), needed to search across the whole dataframe to find the rows (just a few) that contained word 'TOTAL' (any variation eg 'TOTAL PRICE', 'TOTAL STEMS' etc) and delete those rows.

i finally processed the dataframe in two-steps:

FIND COLUMNS THAT CONTAIN THE WORD:

for i in df.columns:
df[i].astype('str').apply(lambda x: print(df[i].name) if x.startswith('TOTAL') else 'pass')

DELETE THE ROWS:

df[df['LENGTH/ CMS'].str.contains('TOTAL') != True]

Solution 5 - Python

Here is an example using applymap. I found other answers didn't work for me since they assumed that all data in a column would be strings causing Attribute Errors. Also it is surprisingly fast.

def search(dataFrame, item):
  mask = (dataFrame.applymap(lambda x: isinstance(x, str) and item in x)).any(1)
  return dataFrame[mask]

You can easily change the lambda to use regex if needed.

Solution 6 - Python

Yet another solution. This selects for columns of type object, which is Panda's type for strings. Other solutions that coerce to str with .astype(str) could give false positives if you're searching for a number (and want to exclude numeric columns and only search in strings -- but if you want to include searching numeric columns it may be the better approach).

As an added benefit, filtering the columns in this way seems to have a performance benefit; on my dataframe of shape (15807, 35), with only 17 of those 35 being strings, I see 4.74 s ± 108 ms per loop as compared to 5.72 s ± 155 ms.

df[
    df.select_dtypes(object)
    .apply(lambda row: row.str.contains("with"), axis=1)
    .any(axis=1)
]

Solution 7 - Python

Building on top of @unutbu's answer https://stackoverflow.com/a/26641085/2839786

I use something like this:

>>> import pandas as pd
>>> import numpy as np
>>>
>>> def search(df: pd.DataFrame, substring: str, case: bool = False) -> pd.DataFrame:
...     mask = np.column_stack([df[col].astype(str).str.contains(substring.lower(), case=case, na=False) for col in df])
...     return df.loc[mask.any(axis=1)]
>>>
>>> # test
>>> df = pd.DataFrame({'col1':['hello', 'world', 'Sun'], 'col2': ['today', 'sunny', 'foo'], 'col3': ['WORLD', 'NEWS', 'bar']})
>>> df
    col1   col2   col3
0  hello  today  WORLD
1  world  sunny   NEWS
2    Sun    foo    bar
>>>
>>> search(df, 'sun')
    col1   col2  col3
1  world  sunny  NEWS
2    Sun    foo   bar

Solution 8 - Python

Alternatively you can use eq and any:

df[df.eq('^').any(axis=1)]

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
Questionhoratio1701dView Question on Stackoverflow
Solution 1 - PythonunutbuView Answer on Stackoverflow
Solution 2 - PythonPuneet SinhaView Answer on Stackoverflow
Solution 3 - PythonthorbjornwolfView Answer on Stackoverflow
Solution 4 - PythonCiroView Answer on Stackoverflow
Solution 5 - PythonGage MillerView Answer on Stackoverflow
Solution 6 - Pythonn8henrieView Answer on Stackoverflow
Solution 7 - PythonAziz AltoView Answer on Stackoverflow
Solution 8 - PythonrachwaView Answer on Stackoverflow