Is there a way to take the first 1000 rows of a Spark Dataframe?

ScalaApache Spark

Scala Problem Overview


I am using the randomSplitfunction to get a small amount of a dataframe to use in dev purposes and I end up just taking the first df that is returned by this function.

val df_subset = data.randomSplit(Array(0.00000001, 0.01), seed = 12345)(0)

If I use df.take(1000) then I end up with an array of rows- not a dataframe, so that won't work for me.

Is there a better, simpler way to take say the first 1000 rows of the df and store it as another df?

Scala Solutions


Solution 1 - Scala

The method you are looking for is .limit.

> Returns a new Dataset by taking the first n rows. The difference between this function and head is that head returns an array while limit returns a new Dataset.

Example usage:

df.limit(1000)

Solution 2 - Scala

Limit is very simple, example limit first 50 rows

val df_subset = data.limit(50)

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionMichael DiscenzaView Question on Stackoverflow
Solution 1 - ScalaMarkonView Answer on Stackoverflow
Solution 2 - ScalaPaulo VictorView Answer on Stackoverflow