Upacking a list to select multiple columns from a spark data frame

Apache SparkApache Spark-SqlSpark Dataframe

Apache Spark Problem Overview


I have a spark data frame df. Is there a way of sub selecting a few columns using a list of these columns?

scala> df.columns
res0: Array[String] = Array("a", "b", "c", "d")

I know I can do something like df.select("b", "c") . But suppose I have a list containing a few column names val cols = List("b", "c"), is there a way to pass this to df.select? df.select(cols) throws an error. Something like df.select(*cols) as in python

Apache Spark Solutions


Solution 1 - Apache Spark

Use df.select(cols.head, cols.tail: _*)

Let me know if it works :)

Explanation from @Ben:

The key is the method signature of select:

select(col: String, cols: String*)

The cols:String* entry takes a variable number of arguments. :_* unpacks arguments so that they can be handled by this argument. Very similar to unpacking in python with *args. See here and here for other examples.

Solution 2 - Apache Spark

You can typecast String to spark column like this:

import org.apache.spark.sql.functions._
df.select(cols.map(col): _*)

Solution 3 - Apache Spark

Another option that I've just learnt.

import org.apache.spark.sql.functions.col
val columns = Seq[String]("col1", "col2", "col3")
val colNames = columns.map(name => col(name))
val df = df.select(colNames:_*)

Solution 4 - Apache Spark

First convert the String Array to a List of Spark dataset Column type as below

String[] strColNameArray = new String[]{"a", "b", "c", "d"};

List<Column> colNames = new ArrayList<>();

for(String strColName : strColNameArray){
    colNames.add(new Column(strColName));
}

then convert the List using JavaConversions functions within the select statement as below. You need the following import statement.

import scala.collection.JavaConversions;

Dataset<Row> selectedDF = df.select(JavaConversions.asScalaBuffer(colNames ));

Solution 5 - Apache Spark

You can pass arguments of type Column* to select:

val df = spark.read.json("example.json")
val cols: List[String] = List("a", "b")
//convert string to Column
val col: List[Column] = cols.map(df(_))
df.select(col:_*)

Solution 6 - Apache Spark

you can do like this

String[] originCols = ds.columns();
ds.selectExpr(originCols)

> spark selectExp source code

     /**
   * Selects a set of SQL expressions. This is a variant of `select` that accepts
   * SQL expressions.
   *
   * {{{
   *   // The following are equivalent:
   *   ds.selectExpr("colA", "colB as newName", "abs(colC)")
   *   ds.select(expr("colA"), expr("colB as newName"), expr("abs(colC)"))
   * }}}
   *
   * @group untypedrel
   * @since 2.0.0
   */
  @scala.annotation.varargs
  def selectExpr(exprs: String*): DataFrame = {
    select(exprs.map { expr =>
      Column(sparkSession.sessionState.sqlParser.parseExpression(expr))
    }: _*)
  }

Solution 7 - Apache Spark

Yes , You can make use of .select in scala.

Use .head and .tail to select the whole values mentioned in the List()

Example

val cols = List("b", "c")
df.select(cols.head,cols.tail: _*)

Explanation

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionBenView Question on Stackoverflow
Solution 1 - Apache SparkShagun SodhaniView Answer on Stackoverflow
Solution 2 - Apache SparkKshitij KulshresthaView Answer on Stackoverflow
Solution 3 - Apache SparkvEdwardpcView Answer on Stackoverflow
Solution 4 - Apache SparkEranga AtugodaView Answer on Stackoverflow
Solution 5 - Apache Sparkraam86View Answer on Stackoverflow
Solution 6 - Apache SparkgeosmartView Answer on Stackoverflow
Solution 7 - Apache SparkUnmesha Sreeveni U.BView Answer on Stackoverflow