How to find the size or shape of a DataFrame in PySpark?

PythonDataframePyspark

Python Problem Overview


I am trying to find out the size/shape of a DataFrame in PySpark. I do not see a single function that can do this.

In Python, I can do this:

data.shape()

Is there a similar function in PySpark? This is my current solution, but I am looking for an element one

row_number = data.count()
column_number = len(data.dtypes)

The computation of the number of columns is not ideal...

Python Solutions


Solution 1 - Python

You can get its shape with:

print((df.count(), len(df.columns)))

Solution 2 - Python

Use df.count() to get the number of rows.

Solution 3 - Python

Add this to the your code:

import pyspark
def spark_shape(self):
    return (self.count(), len(self.columns))
pyspark.sql.dataframe.DataFrame.shape = spark_shape

Then you can do

>>> df.shape()
(10000, 10)

But just remind you that .count() can be very slow for very large table that has not been persisted.

Solution 4 - Python

print((df.count(), len(df.columns)))

is easier for smaller datasets.

However if the dataset is huge, an alternative approach would be to use pandas and arrows to convert the dataframe to pandas df and call shape

spark.conf.set("spark.sql.execution.arrow.enabled", "true")
spark.conf.set("spark.sql.crossJoin.enabled", "true")
print(df.toPandas().shape)

Solution 5 - Python

I think there is not similar function like data.shape in Spark. But I will use len(data.columns) rather than len(data.dtypes)

Solution 6 - Python

I have solved this problem using this code block. Please try it, it works.

import pyspark
def sparkShape(dataFrame):
    return (dataFrame.count(), len(dataFrame.columns))
pyspark.sql.dataframe.DataFrame.shape = sparkShape

print(<Input the Dataframe name which you want the output of>.shape())

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionXi LiangView Question on Stackoverflow
Solution 1 - PythonGeorge FisherView Answer on Stackoverflow
Solution 2 - PythonVMEscoliView Answer on Stackoverflow
Solution 3 - PythonLouis YangView Answer on Stackoverflow
Solution 4 - PythonVenzu251720View Answer on Stackoverflow
Solution 5 - PythonYungChunView Answer on Stackoverflow
Solution 6 - PythonSahaj Raj MallaView Answer on Stackoverflow