How to convert Row of a Scala DataFrame into case class most efficiently?

ScalaApache SparkApache Spark-Sql

Scala Problem Overview


Once I have got in Spark some Row class, either Dataframe or Catalyst, I want to convert it to a case class in my code. This can be done by matching

someRow match {case Row(a:Long,b:String,c:Double) => myCaseClass(a,b,c)}

But it becomes ugly when the row has a huge number of columns, say a dozen of Doubles, some Booleans and even the occasional null.

I would like just to be able to -sorry- cast Row to myCaseClass. Is it possible, or have I already got the most economical syntax?

Scala Solutions


Solution 1 - Scala

DataFrame is simply a type alias of Dataset[Row] . These operations are also referred as “untyped transformations” in contrast to “typed transformations” that come with strongly typed Scala/Java Datasets.

The conversion from Dataset[Row] to Dataset[Person] is very simple in spark

val DFtoProcess = SQLContext.sql("SELECT * FROM peoples WHERE name='test'")

At this point, Spark converts your data into DataFrame = Dataset[Row], a collection of generic Row object, since it does not know the exact type.

// Create an Encoders for Java class (In my eg. Person is a JAVA class)
// For scala case class you can pass Person without .class reference
val personEncoder = Encoders.bean(Person.class) 

val DStoProcess = DFtoProcess.as[Person](personEncoder)

Now, Spark converts the Dataset[Row] -> Dataset[Person] type-specific Scala / Java JVM object, as dictated by the class Person.

Please refer to below link provided by databricks for further details

https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html<link>

Solution 2 - Scala

As far as I know you cannot cast a Row to a case class, but I sometimes chose to access the row fields directly, like

map(row => myCaseClass(row.getLong(0), row.getString(1), row.getDouble(2))

I find this to be easier, especially if the case class constructor only needs some of the fields from the row.

Solution 3 - Scala

scala> import spark.implicits._    
scala> val df = Seq((1, "james"), (2, "tony")).toDF("id", "name")
df: org.apache.spark.sql.DataFrame = [id: int, name: string]

scala> case class Student(id: Int, name: String)
defined class Student

scala> df.as[Student].collectAsList
res6: java.util.List[Student] = [Student(1,james), Student(2,tony)]

Here the spark in spark.implicits._ is your SparkSession. If you are inside the REPL the session is already defined as spark otherwise you need to adjust the name accordingly to correspond to your SparkSession.

Solution 4 - Scala

Of course you can match a Row object into a case class. Let's suppose your SchemaType has many fields and you want to match a few of them into your case class. If you don't have null fields you can simply do:

case class MyClass(a: Long, b: String, c: Int, d: String, e: String)

dataframe.map {
  case Row(a: java.math.BigDecimal, 
    b: String, 
    c: Int, 
    _: String,
    _: java.sql.Date, 
    e: java.sql.Date,
    _: java.sql.Timestamp, 
    _: java.sql.Timestamp, 
    _: java.math.BigDecimal, 
    _: String) => MyClass(a = a.longValue(), b = b, c = c, d = d.toString, e = e.toString)
}

This approach will fail in case of null values and also require you do explicitly define the type of each single field. If you have to handle null values you need to either discard all the rows containing null values by doing

dataframe.na.drop()

That will drop records even if the null fields are not the ones used in your pattern matching for your case class. Or if you want to handle it you could turn the Row object into a List and then use the option pattern:

case class MyClass(a: Long, b: String, c: Option[Int], d: String, e: String)

dataframe.map(_.toSeq.toList match {
  case List(a: java.math.BigDecimal, 
    b: String, 
    c: Int, 
    _: String,
    _: java.sql.Date, 
    e: java.sql.Date,
    _: java.sql.Timestamp, 
    _: java.sql.Timestamp, 
    _: java.math.BigDecimal, 
    _: String) => MyClass(
      a = a.longValue(), b = b, c = Option(c), d = d.toString, e = e.toString)
}

Check this github project Sparkz () which will soon introduce a lot of libraries for simplifying the Spark and DataFrame APIs and make them more functional programming oriented.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionariveroView Question on Stackoverflow
Solution 1 - ScalaRahulView Answer on Stackoverflow
Solution 2 - ScalaGlennie Helles SindholtView Answer on Stackoverflow
Solution 3 - ScalasecfreeView Answer on Stackoverflow
Solution 4 - ScalaGianmario SpacagnaView Answer on Stackoverflow