Spark: subtract two DataFrames

Apache SparkDataframeRdd

Apache Spark Problem Overview


In Spark version 1.2.0 one could use subtract with 2 SchemRDDs to end up with only the different content from the first one

val onlyNewData = todaySchemaRDD.subtract(yesterdaySchemaRDD)

onlyNewData contains the rows in todaySchemRDD that do not exist in yesterdaySchemaRDD.

How can this be achieved with DataFrames in Spark version 1.3.0?

Apache Spark Solutions


Solution 1 - Apache Spark

According to the Scala API docs, doing:

dataFrame1.except(dataFrame2)

will return a new DataFrame containing rows in dataFrame1 but not in dataframe2.

Solution 2 - Apache Spark

In PySpark it would be subtract

df1.subtract(df2)

or exceptAll if duplicates need to be preserved

df1.exceptAll(df2)

Solution 3 - Apache Spark

I tried subtract, but the result was not consistent. If I run df1.subtract(df2), not all lines of df1 are shown on the result dataframe, probably due distinct cited on the docs.

exceptAll solved my problem: df1.exceptAll(df2)

Solution 4 - Apache Spark

From Spark 1.3.0, you can use join with 'left_anti' option:

df1.join(df2, on='key_column', how='left_anti')

These are Pyspark APIs, but I guess there is a correspondent function in Scala too.

Solution 5 - Apache Spark

For me, df1.subtract(df2) was inconsistent. Worked correctly on one dataframe, but not on the other. That was because of duplicates. df1.exceptAll(df2) returns a new dataframe with the records from df1 that do not exist in df2, including any duplicates.

Solution 6 - Apache Spark

From Spark 2.4.0 - exceptAll

data_cl = reg_data.exceptAll(data_fr)

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionInterfectorView Question on Stackoverflow
Solution 1 - Apache SparkEric EijkelenboomView Answer on Stackoverflow
Solution 2 - Apache SparkTejView Answer on Stackoverflow
Solution 3 - Apache SparkArthur JuliãoView Answer on Stackoverflow
Solution 4 - Apache SparkRic SView Answer on Stackoverflow
Solution 5 - Apache SparkVidyaView Answer on Stackoverflow
Solution 6 - Apache Sparkder FotikView Answer on Stackoverflow