How to replace all Null values of a dataframe in Pyspark
DataframeNullPysparkDataframe Problem Overview
I have a data frame in pyspark with more than 300 columns. In these columns there are some columns with values null.
For example:
Column_1 column_2
null null
null null
234 null
125 124
365 187
and so on
When I want to do a sum of column_1 I am getting a Null as a result, instead of 724.
Now I want to replace the null in all columns of the data frame with empty space. So when I try to do a sum of these columns I don't get a null value but I will get a numerical value.
How can we achieve that in pyspark
Dataframe Solutions
Solution 1 - Dataframe
You can use df.na.fill
to replace nulls with zeros, for example:
>>> df = spark.createDataFrame([(1,), (2,), (3,), (None,)], ['col'])
>>> df.show()
+----+
| col|
+----+
| 1|
| 2|
| 3|
|null|
+----+
>>> df.na.fill(0).show()
+---+
|col|
+---+
| 1|
| 2|
| 3|
| 0|
+---+
Solution 2 - Dataframe
You can use fillna() func.
>>> df = spark.createDataFrame([(1,), (2,), (3,), (None,)], ['col'])
>>> df.show()
+----+
| col|
+----+
| 1|
| 2|
| 3|
|null|
+----+
>>> df = df.fillna({'col':'4'})
>>> df.show()
or df.fillna({'col':'4'}).show()
+---+
|col|
+---+
| 1|
| 2|
| 3|
| 4|
+---+
Solution 3 - Dataframe
Using fillna
there are 3 options...
Documentation:
> def fillna(self, value, subset=None):
> """Replace null values, alias for na.fill()
.
> :func:DataFrame.fillna
and :func:DataFrameNaFunctions.fill
are aliases of each other.
>
> :param value: int, long, float, string, bool or dict.
> Value to replace null values with.
> If the value is a dict, then subset
is ignored and value
must be a mapping
> from column name (string) to replacement value. The replacement value must be
> an int, long, float, boolean, or string.
> :param subset: optional list of column names to consider.
> Columns specified in subset that do not have matching data type are ignored.
> For example, if value
is a string, and subset contains a non-string column,
> then the non-string column is simply ignored.
So you can:
- fill all columns with the same value:
df.fillna(value)
- pass a dictionary of column --> value:
df.fillna(dict_of_col_to_value)
- pass a list of columns to fill with the same value:
df.fillna(value, subset=list_of_cols)
fillna()
is an alias for na.fill()
so they are the same.