How to filter rows in pandas by regex
PythonRegexPandasPython Problem Overview
I would like to cleanly filter a dataframe using regex on one of the columns.
For a contrived example:
In [210]: foo = pd.DataFrame({'a' : [1,2,3,4], 'b' : ['hi', 'foo', 'fat', 'cat']})
In [211]: foo
Out[211]:
a b
0 1 hi
1 2 foo
2 3 fat
3 4 cat
I want to filter the rows to those that start with f
using a regex. First go:
In [213]: foo.b.str.match('f.*')
Out[213]:
0 []
1 ()
2 ()
3 []
That's not too terribly useful. However this will get me my boolean index:
In [226]: foo.b.str.match('(f.*)').str.len() > 0
Out[226]:
0 False
1 True
2 True
3 False
Name: b
So I could then do my restriction by:
In [229]: foo[foo.b.str.match('(f.*)').str.len() > 0]
Out[229]:
a b
1 2 foo
2 3 fat
That makes me artificially put a group into the regex though, and seems like maybe not the clean way to go. Is there a better way to do this?
Python Solutions
Solution 1 - Python
Use contains instead:
In [10]: df.b.str.contains('^f')
Out[10]:
0 False
1 True
2 True
3 False
Name: b, dtype: bool
Solution 2 - Python
There is already a string handling function Series.str.startswith()
.
You should try foo[foo.b.str.startswith('f')]
.
Result:
a b
1 2 foo
2 3 fat
I think what you expect.
Alternatively you can use contains with regex option. For example:
foo[foo.b.str.contains('oo', regex= True, na=False)]
Result:
a b
1 2 foo
na=False
is to prevent Errors in case there is nan, null etc. values
Solution 3 - Python
It may be a bit late, but this is now easier to do in Pandas by calling Series.str.match
. The docs explain the difference between match
, fullmatch
and contains
.
Note that in order to use the results for indexing, set the na=False
argument (or True
if you want to include NANs in the results).
Solution 4 - Python
Multiple column search with dataframe:
frame[frame.filename.str.match('*.'+MetaData+'.*') & frame.file_path.str.match('C:\test\test.txt')]
Solution 5 - Python
Building off of the great answer by user3136169, here is an example of how that might be done also removing NoneType values.
def regex_filter(val):
if val:
mo = re.search(regex,val)
if mo:
return True
else:
return False
else:
return False
df_filtered = df[df['col'].apply(regex_filter)]
You can also add regex as an arg:
def regex_filter(val,myregex):
...
df_filtered = df[df['col'].apply(regex_filter,regex=myregex)]
Solution 6 - Python
Write a Boolean function that checks the regex and use apply on the column
foo[foo['b'].apply(regex_function)]
Solution 7 - Python
Using Python's built-in ability to write lambda expressions, we could filter by an arbitrary regex operation as follows:
import re
# with foo being our pd dataframe
foo[foo['b'].apply(lambda x: True if re.search('^f', x) else False)]
By using re.search you can filter by complex regex style queries, which is more powerful in my opinion. (as str.contains
is rather limited)
Also important to mention: You want your string to start with a small 'f'. By using the regex f.*
you match your f on an arbitrary location within your text. By using the ^
symbol you explicitly state that you want it to be at the beginning of your content. So using ^f
would probably be a better idea :)
Solution 8 - Python
Using str
slice
foo[foo.b.str[0]=='f']
Out[18]:
a b
1 2 foo
2 3 fat
Solution 9 - Python
You can use query
in combination with contains
:
foo.query('b.str.contains("^f").values')
Alternatively you can also use startswith
:
foo.query('b.str.startswith("f").values')
However I prefer the first alternative since it allows you to search for multiple patterns using the |
operator.