Filter df when values matches part of a string in pyspark

PythonApache SparkPysparkApache Spark-Sql

Python Problem Overview


I have a large pyspark.sql.dataframe.DataFrame and I want to keep (so filter) all rows where the URL saved in the location column contains a pre-determined string, e.g. 'google.com'.

I have tried:

import pyspark.sql.functions as sf
df.filter(sf.col('location').contains('google.com')).show(5)

but this throws a

TypeError: _TypeError: 'Column' object is not callable'

How do I go around and filter my df properly? Many thanks in advance!

Python Solutions


Solution 1 - Python

Spark 2.2 onwards

> df.filter(df.location.contains('google.com'))

> Spark 2.2 documentation link


Spark 2.1 and before

> You can use plain SQL in filter > df.filter("location like '%google.com%'")

> or with DataFrame column methods > df.filter(df.location.like('%google.com%'))

> Spark 2.1 documentation link

Solution 2 - Python

pyspark.sql.Column.contains() is only available in pyspark version 2.2 and above.

df.where(df.location.contains('google.com'))

Solution 3 - Python

When filtering a DataFrame with string values, I find that the pyspark.sql.functions lower and upper come in handy, if your data could have column entries like "foo" and "Foo":

import pyspark.sql.functions as sql_fun
result = source_df.filter(sql_fun.lower(source_df.col_name).contains("foo"))

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestiongaatjeniksaanView Question on Stackoverflow
Solution 1 - PythonmrsrinivasView Answer on Stackoverflow
Solution 2 - PythonjoaofbsmView Answer on Stackoverflow
Solution 3 - PythoncaffreydView Answer on Stackoverflow