Which operations preserve RDD order?

Apache Spark Problem Overview

RDD has a meaningful (as opposed to some random order imposed by the storage model) order if it was processed by sortBy(), as explained in this reply.

Now, which operations preserve that order?

E.g., is it guaranteed that (after a.sortBy())

a.map(f).zip(a) === 
a.map(x => (f(x),x))

How about

a.filter(f).map(g) === 
a.map(x => (x,g(x))).filter(f(_._1)).map(_._2)

what about

a.filter(f).flatMap(g) === 
a.flatMap(x => g(x).map((x,_))).filter(f(_._1)).map(_._2)

Here "equality" === is understood as "functional equivalence", i.e., there is no way to distinguish the outcome using user-level operations (i.e., without reading logs &c).

Apache Spark Solutions

Solution 1 - Apache Spark

All operations preserve the order, except those that explicitly do not. Ordering is always "meaningful", not just after a sortBy. For example, if you read a file (sc.textFile) the lines of the RDD will be in the order that they were in the file.

Without trying to give a complete list, map, filter and flatMap do preserve the order. sortBy, partitionBy, join do not preserve the order.

The reason is that most RDD operations work on Iterators inside the partitions. So map or filter just has no way to mess up the order. You can take a look at the code to see for yourself.

You may now ask: What if I have an RDD with a HashPartitioner. What happens when I use map to change the keys? Well, they will stay in place, and now the RDD is not partitioned by the key. You can use partitionBy to restore the partitioning with a shuffle.

Solution 2 - Apache Spark

In Spark 2.0.0+ coalesce doesn't guarantee partitions order during merge. DefaultPartitionCoalescer has optimization algorithm which is based on partition locality. When a partition contains information about its locality DefaultPartitionCoalescer tries to merge partitions on the same host. And only when there is no locality information it simply splits partition based on their index and preserves partitions order.

UPDATE:

If you load DataFrame from files, like parquet, Spark breaks order when it plans file splits. You can see it in DataSourceScanExec.scala#L629 or in new Spark 3.x FileScan#L152 if you use it. It just sorts partitions by size and the splits which are less than spark.sql.files.maxPartitionBytes gets to last partitions.

So, if you need to load sorted dataset from files you need to implement your own reader.

Content Type	Original Author	Original Content on Stackoverflow
Question	sds	View Question on Stackoverflow
Solution 1 - Apache Spark	Daniel Darabos	View Answer on Stackoverflow
Solution 2 - Apache Spark	Avseiytsev Dmitriy	View Answer on Stackoverflow

Which operations preserve RDD order?

Apache Spark Problem Overview

Apache Spark Solutions

Solution 1 - Apache Spark

Solution 2 - Apache Spark

Android Studio Rendering Problems : The following classes could not be found

Pyinstaller setting icons don't change

Attributions