How to find count of Null and Nan values for each column in a PySpark dataframe efficiently?

Apache SparkPysparkApache Spark-Sql

Apache Spark Problem Overview


import numpy as np

data = [
    (1, 1, None), 
    (1, 2, float(5)), 
    (1, 3, np.nan), 
    (1, 4, None), 
    (1, 5, float(10)), 
    (1, 6, float("nan")), 
    (1, 6, float("nan")),
]
df = spark.createDataFrame(data, ("session", "timestamp1", "id2"))

Expected output

dataframe with count of nan/null for each column

Note: The previous questions I found in stack overflow only checks for null & not nan. That's why I have created a new question.

I know I can use isnull() function in Spark to find number of Null values in Spark column but how to find Nan values in Spark dataframe?

Apache Spark Solutions


Solution 1 - Apache Spark

You can use method shown here and replace isNull with isnan:

from pyspark.sql.functions import isnan, when, count, col

df.select([count(when(isnan(c), c)).alias(c) for c in df.columns]).show()
+-------+----------+---+
|session|timestamp1|id2|
+-------+----------+---+
|      0|         0|  3|
+-------+----------+---+

or

df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show()
+-------+----------+---+
|session|timestamp1|id2|
+-------+----------+---+
|      0|         0|  5|
+-------+----------+---+

Solution 2 - Apache Spark

For null values in the dataframe of pyspark

Dict_Null = {col:df.filter(df[col].isNull()).count() for col in df.columns}
Dict_Null

# The output in dict where key is column name and value is null values in that column

{'#': 0,
 'Name': 0,
 'Type 1': 0,
 'Type 2': 386,
 'Total': 0,
 'HP': 0,
 'Attack': 0,
 'Defense': 0,
 'Sp_Atk': 0,
 'Sp_Def': 0,
 'Speed': 0,
 'Generation': 0,
 'Legendary': 0}

Solution 3 - Apache Spark

To make sure it does not fail for string, date and timestamp columns:

import pyspark.sql.functions as F
def count_missings(spark_df,sort=True):
    """
    Counts number of nulls and nans in each column
    """
    df = spark_df.select([F.count(F.when(F.isnan(c) | F.isnull(c), c)).alias(c) for (c,c_type) in spark_df.dtypes if c_type not in ('timestamp', 'string', 'date')]).toPandas()
    
    if len(df) == 0:
        print("There are no any missing values!")
        return None
    
    if sort:
        return df.rename(index={0: 'count'}).T.sort_values("count",ascending=False)
    
    return df

If you want to see the columns sorted based on the number of nans and nulls in descending:

count_missings(spark_df)

# | Col_A | 10 |
# | Col_C | 2  |
# | Col_B | 1  | 

If you don't want ordering and see them as a single row:

count_missings(spark_df, False)
# | Col_A | Col_B | Col_C |
# |  10   |   1   |   2   |

Solution 4 - Apache Spark

Here is my one liner. Here 'c' is the name of the column

from pyspark.sql.functions import isnan, when, count, col, isNull
    
df.select('c').withColumn('isNull_c',F.col('c').isNull()).where('isNull_c = True').count()

Solution 5 - Apache Spark

An alternative to the already provided ways is to simply filter on the column like so

import pyspark.sql.functions as F
df = df.where(F.col('columnNameHere').isNull())

This has the added benefit that you don't have to add another column to do the filtering and it's quick on larger data sets.

Solution 6 - Apache Spark

I prefer this solution:

df = spark.table(selected_table).filter(condition)

counter = df.count()

df = df.select([(counter - count(c)).alias(c) for c in df.columns])

Solution 7 - Apache Spark

from pyspark.sql import DataFrame
import pyspark.sql.functions as fn

# compatiable with fn.isnan. Sourced from
# https://github.com/apache/spark/blob/13fd272cd3/python/pyspark/sql/functions.py#L4818-L4836
NUMERIC_DTYPES = (
    'decimal',
    'double',
    'float',
    'int',
    'bigint',
    'smallilnt',
    'tinyint',
)


def count_nulls(df: DataFrame) -> DataFrame:
    isnan_compat_cols = {c for (c, t) in df.dtypes if any(t.startswith(num_dtype) for num_dtype in NUMERIC_DTYPES)}

    return df.select(
        [fn.count(fn.when(fn.isnan(c) | fn.isnull(c), c)).alias(c) for c in isnan_compat_cols]
        + [fn.count(fn.when(fn.isnull(c), c)).alias(c) for c in set(df.columns) - isnan_compat_cols]
    )

Builds off of gench and user8183279's answers, but checks via only isnull for columns where isnan is not possible, rather than just ignoring them.

The source code of pyspark.sql.functions seemed to have the only documentation I could really find enumerating these names — if others know of some public docs I'd be delighted.

Solution 8 - Apache Spark

if you are writing spark sql, then the following will also work to find null value and count subsequently.

spark.sql('select * from table where isNULL(column_value)')

Solution 9 - Apache Spark

Yet another alternative (improved upon Vamsi Krishna's solutions above):

def check_for_null_or_nan(df):
    null_or_nan = lambda x: isnan(x) | isnull(x)
    func = lambda x: df.filter(null_or_nan(x)).count()
    print(*[f'{i} has {func(i)} nans/nulls' for i in df.columns if func(i)!=0],sep='\n')

check_for_null_or_nan(df)


id2 has 5 nans/nulls

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionGeorgeOfTheRFView Question on Stackoverflow
Solution 1 - Apache Sparkuser8183279View Answer on Stackoverflow
Solution 2 - Apache SparkVamsi KrishnaView Answer on Stackoverflow
Solution 3 - Apache SparkgenchView Answer on Stackoverflow
Solution 4 - Apache SparkGabrielView Answer on Stackoverflow
Solution 5 - Apache SparkPatrik IselindView Answer on Stackoverflow
Solution 6 - Apache SparkEric BelletView Answer on Stackoverflow
Solution 7 - Apache SparkijosephView Answer on Stackoverflow
Solution 8 - Apache SparkDivineCoderView Answer on Stackoverflow
Solution 9 - Apache SparkRajesh RamachanderView Answer on Stackoverflow