How to replace all Null values of a dataframe in Pyspark

DataframeNullPyspark

Dataframe Problem Overview


I have a data frame in pyspark with more than 300 columns. In these columns there are some columns with values null.

For example:

Column_1 column_2
null     null
null     null
234      null
125      124
365      187
and so on

When I want to do a sum of column_1 I am getting a Null as a result, instead of 724.

Now I want to replace the null in all columns of the data frame with empty space. So when I try to do a sum of these columns I don't get a null value but I will get a numerical value.

How can we achieve that in pyspark

Dataframe Solutions


Solution 1 - Dataframe

You can use df.na.fill to replace nulls with zeros, for example:

>>> df = spark.createDataFrame([(1,), (2,), (3,), (None,)], ['col'])
>>> df.show()
+----+
| col|
+----+
|   1|
|   2|
|   3|
|null|
+----+

>>> df.na.fill(0).show()
+---+
|col|
+---+
|  1|
|  2|
|  3|
|  0|
+---+

Solution 2 - Dataframe

You can use fillna() func.

>>> df = spark.createDataFrame([(1,), (2,), (3,), (None,)], ['col'])
>>> df.show()
+----+
| col|
+----+
|   1|
|   2|
|   3|
|null|
+----+

>>> df = df.fillna({'col':'4'})
>>> df.show()

or df.fillna({'col':'4'}).show()

+---+
|col|
+---+
|  1|
|  2|
|  3|
|  4|
+---+

Solution 3 - Dataframe

Using fillna there are 3 options...

Documentation:

> def fillna(self, value, subset=None): > """Replace null values, alias for na.fill(). > :func:DataFrame.fillna and :func:DataFrameNaFunctions.fill are aliases of each other. > > :param value: int, long, float, string, bool or dict. > Value to replace null values with. > If the value is a dict, then subset is ignored and value must be a mapping > from column name (string) to replacement value. The replacement value must be > an int, long, float, boolean, or string. > :param subset: optional list of column names to consider. > Columns specified in subset that do not have matching data type are ignored. > For example, if value is a string, and subset contains a non-string column, > then the non-string column is simply ignored.

So you can:

  1. fill all columns with the same value: df.fillna(value)
  2. pass a dictionary of column --> value: df.fillna(dict_of_col_to_value)
  3. pass a list of columns to fill with the same value: df.fillna(value, subset=list_of_cols)

fillna() is an alias for na.fill() so they are the same.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
Questionuser7543621View Question on Stackoverflow
Solution 1 - DataframeMariuszView Answer on Stackoverflow
Solution 2 - DataframeDugini VijayView Answer on Stackoverflow
Solution 3 - DataframeDanny VarodView Answer on Stackoverflow