Pyspark dataframe operator "IS NOT IN"

Pyspark

Pyspark Problem Overview


I would like to rewrite this from R to Pyspark, any nice looking suggestions?

array <- c(1,2,3)
dataset <- filter(!(column %in% array))

Pyspark Solutions


Solution 1 - Pyspark

In pyspark you can do it like this:

array = [1, 2, 3]
dataframe.filter(dataframe.column.isin(array) == False)

Or using the binary NOT operator:

dataframe.filter(~dataframe.column.isin(array))

Solution 2 - Pyspark

Take the operator ~ which means contrary :

df_filtered = df.filter(~df["column_name"].isin([1, 2, 3]))

Solution 3 - Pyspark

df_result = df[df.column_name.isin([1, 2, 3]) == False]

Solution 4 - Pyspark

slightly different syntax and a "date" data set:

toGetDates={'2017-11-09', '2017-11-11', '2017-11-12'}
df= df.filter(df['DATE'].isin(toGetDates) == False)

Solution 5 - Pyspark

* is not needed. So:

list = [1, 2, 3]
dataframe.filter(~dataframe.column.isin(list))

Solution 6 - Pyspark

You can use the .subtract() buddy.

Example:

df1 = df.select(col(1),col(2),col(3)) 
df2 = df.subtract(df1)

This way, df2 will be defined as everything that is df that is not df1.

Solution 7 - Pyspark

You can also loop the array and filter:

array = [1, 2, 3]
for i in array:
    df = df.filter(df["column"] != i)

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionBabuView Question on Stackoverflow
Solution 1 - PysparkRyan WidmaierView Answer on Stackoverflow
Solution 2 - PysparkLaSulView Answer on Stackoverflow
Solution 3 - Pysparkuser7438406View Answer on Stackoverflow
Solution 4 - PysparkGrant ShannonView Answer on Stackoverflow
Solution 5 - PysparkJohnny MView Answer on Stackoverflow
Solution 6 - Pysparkraphael dayanView Answer on Stackoverflow
Solution 7 - PysparkShadowtrooperView Answer on Stackoverflow