Convert a spark DataFrame to pandas DF

PandasApache SparkApache Spark-Sql

Pandas Problem Overview


Is there a way to convert a Spark Df (not RDD) to pandas DF

I tried the following:

var some_df = Seq(
 ("A", "no"),
 ("B", "yes"),
 ("B", "yes"),
 ("B", "no")

 ).toDF(
"user_id", "phone_number")

Code:

%pyspark
pandas_df = some_df.toPandas()

Error:

 NameError: name 'some_df' is not defined

Any suggestions.

Pandas Solutions


Solution 1 - Pandas

following should work

some_df = sc.parallelize([
 ("A", "no"),
 ("B", "yes"),
 ("B", "yes"),
 ("B", "no")]
 ).toDF(["user_id", "phone_number"])
pandas_df = some_df.toPandas()

Solution 2 - Pandas

In my case the following conversion from spark dataframe to pandas dataframe worked:

pandas_df = spark_df.select("*").toPandas()

Solution 3 - Pandas

Converting spark data frame to pandas can take time if you have large data frame. So you can use something like below:

spark.conf.set("spark.sql.execution.arrow.enabled", "true")

pd_df = df_spark.toPandas()

I have tried this in DataBricks.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
Questiondata_personView Question on Stackoverflow
Solution 1 - PandasGaurang ShahView Answer on Stackoverflow
Solution 2 - PandasInna View Answer on Stackoverflow
Solution 3 - PandasShikhaView Answer on Stackoverflow