Spark: subtract two DataFrames
Apache SparkDataframeRddApache Spark Problem Overview
In Spark version 1.2.0 one could use subtract
with 2 SchemRDD
s to end up with only the different content from the first one
val onlyNewData = todaySchemaRDD.subtract(yesterdaySchemaRDD)
onlyNewData
contains the rows in todaySchemRDD
that do not exist in yesterdaySchemaRDD
.
How can this be achieved with DataFrames
in Spark version 1.3.0?
Apache Spark Solutions
Solution 1 - Apache Spark
According to the Scala API docs, doing:
dataFrame1.except(dataFrame2)
will return a new DataFrame containing rows in dataFrame1 but not in dataframe2.
Solution 2 - Apache Spark
In PySpark it would be subtract
df1.subtract(df2)
or exceptAll
if duplicates need to be preserved
df1.exceptAll(df2)
Solution 3 - Apache Spark
I tried subtract, but the result was not consistent.
If I run df1.subtract(df2)
, not all lines of df1 are shown on the result dataframe, probably due distinct
cited on the docs.
exceptAll
solved my problem:
df1.exceptAll(df2)
Solution 4 - Apache Spark
From Spark 1.3.0, you can use join
with 'left_anti'
option:
df1.join(df2, on='key_column', how='left_anti')
These are Pyspark APIs, but I guess there is a correspondent function in Scala too.
Solution 5 - Apache Spark
For me, df1.subtract(df2)
was inconsistent. Worked correctly on one dataframe, but not on the other. That was because of duplicates. df1.exceptAll(df2)
returns a new dataframe with the records from df1 that do not exist in df2, including any duplicates.
Solution 6 - Apache Spark
From Spark 2.4.0 - exceptAll
data_cl = reg_data.exceptAll(data_fr)