Convert a spark DataFrame to pandas DF
PandasApache SparkApache Spark-SqlPandas Problem Overview
Is there a way to convert a Spark Df (not RDD) to pandas DF
I tried the following:
var some_df = Seq(
("A", "no"),
("B", "yes"),
("B", "yes"),
("B", "no")
).toDF(
"user_id", "phone_number")
Code:
%pyspark
pandas_df = some_df.toPandas()
Error:
NameError: name 'some_df' is not defined
Any suggestions.
Pandas Solutions
Solution 1 - Pandas
following should work
some_df = sc.parallelize([
("A", "no"),
("B", "yes"),
("B", "yes"),
("B", "no")]
).toDF(["user_id", "phone_number"])
pandas_df = some_df.toPandas()
Solution 2 - Pandas
In my case the following conversion from spark dataframe to pandas dataframe worked:
pandas_df = spark_df.select("*").toPandas()
Solution 3 - Pandas
Converting spark data frame to pandas can take time if you have large data frame. So you can use something like below:
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
pd_df = df_spark.toPandas()
I have tried this in DataBricks.