Pyspark dataframe operator "IS NOT IN"
PysparkPyspark Problem Overview
I would like to rewrite this from R to Pyspark, any nice looking suggestions?
array <- c(1,2,3)
dataset <- filter(!(column %in% array))
Pyspark Solutions
Solution 1 - Pyspark
In pyspark you can do it like this:
array = [1, 2, 3]
dataframe.filter(dataframe.column.isin(array) == False)
Or using the binary NOT operator:
dataframe.filter(~dataframe.column.isin(array))
Solution 2 - Pyspark
Take the operator ~ which means contrary :
df_filtered = df.filter(~df["column_name"].isin([1, 2, 3]))
Solution 3 - Pyspark
df_result = df[df.column_name.isin([1, 2, 3]) == False]
Solution 4 - Pyspark
slightly different syntax and a "date" data set:
toGetDates={'2017-11-09', '2017-11-11', '2017-11-12'}
df= df.filter(df['DATE'].isin(toGetDates) == False)
Solution 5 - Pyspark
*
is not needed. So:
list = [1, 2, 3]
dataframe.filter(~dataframe.column.isin(list))
Solution 6 - Pyspark
You can use the .subtract()
buddy.
Example:
df1 = df.select(col(1),col(2),col(3))
df2 = df.subtract(df1)
This way, df2 will be defined as everything that is df that is not df1.
Solution 7 - Pyspark
You can also loop the array and filter:
array = [1, 2, 3]
for i in array:
df = df.filter(df["column"] != i)