What's the difference between join and cogroup in Apache Spark

ScalaApache Spark

Scala Problem Overview


What's the difference between join and cogroup in Apache Spark? What's the use case for each method?

Scala Solutions


Solution 1 - Scala

Let me help you to clarify them, both are common to use and important!

def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]

This is prototype of join, please carefully look at it. For example,

val rdd1 = sc.makeRDD(Array(("A","1"),("B","2"),("C","3")),2)
val rdd2 = sc.makeRDD(Array(("A","a"),("C","c"),("D","d")),2)
 
scala> rdd1.join(rdd2).collect
res0: Array[(String, (String, String))] = Array((A,(1,a)), (C,(3,c)))

All keys that will appear in the final result is common to rdd1 and rdd2. This is similar to relation database operation INNER JOIN.

But cogroup is different,

def cogroup[W](other: RDD[(K, W)]): RDD[(K, (Iterable[V], Iterable[W]))]

as one key at least appear in either of the two rdds, it will appear in the final result, let me clarify it:

val rdd1 = sc.makeRDD(Array(("A","1"),("B","2"),("C","3")),2)
val rdd2 = sc.makeRDD(Array(("A","a"),("C","c"),("D","d")),2)

scala> var rdd3 = rdd1.cogroup(rdd2).collect
res0: Array[(String, (Iterable[String], Iterable[String]))] = Array(
(B,(CompactBuffer(2),CompactBuffer())), 
(D,(CompactBuffer(),CompactBuffer(d))), 
(A,(CompactBuffer(1),CompactBuffer(a))), 
(C,(CompactBuffer(3),CompactBuffer(c)))
)

This is very similar to relation database operation FULL OUTER JOIN, but instead of flattening the result per line per record, it will give you the iterable interface to you, the following operation is up to you as convenient!

Good Luck!

Spark docs is: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionmiaoiaoView Question on Stackoverflow
Solution 1 - ScalaashburshuiView Answer on Stackoverflow