How to print the contents of RDD?

ScalaApache Spark

Scala Problem Overview


I'm attempting to print the contents of a collection to the Spark console.

I have a type:

linesWithSessionId: org.apache.spark.rdd.RDD[String] = FilteredRDD[3]

And I use the command:

scala> linesWithSessionId.map(line => println(line))

But this is printed :

> res1: org.apache.spark.rdd.RDD[Unit] = MappedRDD[4] at map at :19

How can I write the RDD to console or save it to disk so I can view its contents?

Scala Solutions


Solution 1 - Scala

If you want to view the content of a RDD, one way is to use collect():

myRDD.collect().foreach(println)

That's not a good idea, though, when the RDD has billions of lines. Use take() to take just a few to print out:

myRDD.take(n).foreach(println)

Solution 2 - Scala

The map function is a transformation, which means that Spark will not actually evaluate your RDD until you run an action on it.

To print it, you can use foreach (which is an action):

linesWithSessionId.foreach(println)

To write it to disk you can use one of the saveAs... functions (still actions) from the RDD API

Solution 3 - Scala

You can convert your RDD to a DataFrame then show() it.

// For implicit conversion from RDD to DataFrame
import spark.implicits._

fruits = sc.parallelize([("apple", 1), ("banana", 2), ("orange", 17)])

// convert to DF then show it
fruits.toDF().show()

This will show the top 20 lines of your data, so the size of your data should not be an issue.

+------+---+                                                                    
|    _1| _2|
+------+---+
| apple|  1|
|banana|  2|
|orange| 17|
+------+---+

Solution 4 - Scala

If you're running this on a cluster then println won't print back to your context. You need to bring the RDD data to your session. To do this you can force it to local array and then print it out:

linesWithSessionId.toArray().foreach(line => println(line))

Solution 5 - Scala

There are probably many architectural differences between myRDD.foreach(println) and myRDD.collect().foreach(println) (not only 'collect', but also other actions). One the differences I saw is when doing myRDD.foreach(println), the output will be in a random order. For ex: if my rdd is coming from a text file where each line has a number, the output will have a different order. But when I did myRDD.collect().foreach(println), order remains just like the text file.

Solution 6 - Scala

In python

   linesWithSessionIdCollect = linesWithSessionId.collect()
   linesWithSessionIdCollect

This will printout all the contents of the RDD

Solution 7 - Scala

Instead of typing each time, you can;

[1] Create a generic print method inside Spark Shell.

def p(rdd: org.apache.spark.rdd.RDD[_]) = rdd.foreach(println)

[2] Or even better, using implicits, you can add the function to RDD class to print its contents.

implicit class Printer(rdd: org.apache.spark.rdd.RDD[_]) {
    def print = rdd.foreach(println)
}

Example usage:

val rdd = sc.parallelize(List(1,2,3,4)).map(_*2)

p(rdd) // 1
rdd.print // 2

Output:

2
6
4
8

Important

This only makes sense if you are working in local mode and with a small amount of data set. Otherwise, you either will not be able to see the results on the client or run out of memory because of the big dataset result.

Solution 8 - Scala

c.take(10)

and Spark newer version will show table nicely.

Solution 9 - Scala

You can also save as a file: rdd.saveAsTextFile("alicia.txt")

Solution 10 - Scala

In java syntax:

rdd.collect().forEach(line -> System.out.println(line));

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
Questionblue-skyView Question on Stackoverflow
Solution 1 - ScalaOussamaView Answer on Stackoverflow
Solution 2 - ScalafedragonView Answer on Stackoverflow
Solution 3 - ScalaSamView Answer on Stackoverflow
Solution 4 - ScalaNoahView Answer on Stackoverflow
Solution 5 - ScalaKaran GuptaView Answer on Stackoverflow
Solution 6 - ScalaNiranjan MolkeriView Answer on Stackoverflow
Solution 7 - ScalaeaorakView Answer on Stackoverflow
Solution 8 - ScalaHrvojeView Answer on Stackoverflow
Solution 9 - ScalaThomas DecauxView Answer on Stackoverflow
Solution 10 - ScalaForeverLearnerView Answer on Stackoverflow