Replace invalid values with None in Pandas DataFrame

Python Problem Overview

Is there any method to replace values with None in Pandas in Python?

You can use df.replace('pre', 'post') and can replace a value with another, but this can't be done if you want to replace with None value, which if you try, you get a strange result.

So here's an example:

df = DataFrame(['-',3,2,5,1,-5,-1,'-',9])
df.replace('-', 0)

which returns a successful result.

But,

df.replace('-', None)

which returns a following result:

0
0   - // this isn't replaced
1   3
2   2
3   5
4   1
5  -5
6  -1
7  -1 // this is changed to `-1`...
8   9

Why does such a strange result be returned?

Since I want to pour this data frame into MySQL database, I can't put NaN values into any element in my data frame and instead want to put None. Surely, you can first change '-' to NaN and then convert NaN to None, but I want to know why the dataframe acts in such a terrible way.

> Tested on pandas 0.12.0 dev on Python 2.7 and OS X 10.8. Python is a > pre-installed version on OS X and I installed pandas by using SciPy > Superpack script, for your information.

Python Solutions

Solution 1 - Python

Actually in later versions of pandas this will give a TypeError:

df.replace('-', None)
TypeError: If "to_replace" and "value" are both None then regex must be a mapping

You can do it by passing either a list or a dictionary:

In [11]: df.replace('-', df.replace(['-'], [None]) # or .replace('-', {0: None})
Out[11]:
      0
0  None
1     3
2     2
3     5
4     1
5    -5
6    -1
7  None
8     9

But I recommend using NaNs rather than None:

In [12]: df.replace('-', np.nan)
Out[12]:
     0
0  NaN
1    3
2    2
3    5
4    1
5   -5
6   -1
7  NaN
8    9

Solution 2 - Python

I prefer the solution using replace with a dict because of its simplicity and elegance:

df.replace({'-': None})

You can also have more replacements:

df.replace({'-': None, 'None': None})

And even for larger replacements, it is always obvious and clear what is replaced by what - which is way harder for long lists, in my opinion.

Solution 3 - Python

where is probably what you're looking for. So

data=data.where(data=='-', None)

From the [panda docs][1]:

> where [returns] an object of same shape as self and whose corresponding entries are from self where cond is True and otherwise are from other).

[1]: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.where.html "docs"

Solution 4 - Python

Before proceeding with this post, it is important to understand the difference between NaN and None. One is a float type, the other is an object type. Pandas is better suited to working with scalar types as many methods on these types can be vectorised. Pandas does try to handle None and NaN consistently, but NumPy cannot.

My suggestion (and Andy's) is to stick with NaN.

But to answer your question...

pandas >= 0.18: Use `na_values=['-']` argument with `read_csv`

If you loaded this data from CSV/Excel, I have good news for you. You can quash this at the root during data loading instead of having to write a fix with code as a subsequent step.

Most of the pd.read_* functions (such as read_csv and read_excel) accept a na_values attribute.

file.csv

A,B
-,1
3,-
2,-
5,3
1,-2
-5,4
-1,-1
-,0
9,0

Now, to convert the - characters into NaNs, do,

import pandas as pd
df = pd.read_csv('file.csv', na_values=['-'])
df

     A    B
0  NaN  1.0
1  3.0  NaN
2  2.0  NaN
3  5.0  3.0
4  1.0 -2.0
5 -5.0  4.0
6 -1.0 -1.0
7  NaN  0.0
8  9.0  0.0

And similar for other functions/file formats.

P.S.: On v0.24+, you can preserve integer type even if your column has NaNs (yes, talk about having the cake and eating it too). You can specify dtype='Int32'

df = pd.read_csv('file.csv', na_values=['-'], dtype='Int32')
df

     A    B
0  NaN    1
1    3  NaN
2    2  NaN
3    5    3
4    1   -2
5   -5    4
6   -1   -1
7  NaN    0
8    9    0

df.dtypes

A    Int32
B    Int32
dtype: object

The dtype is not a conventional int type... but rather, a Nullable Integer Type. There are other options.

Handling Numeric Data: `pd.to_numeric` with `errors='coerce`

If you're dealing with numeric data, a faster solution is to use pd.to_numeric with the errors='coerce' argument, which coerces invalid values (values that cannot be cast to numeric) to NaN.

pd.to_numeric(df['A'], errors='coerce')

0    NaN
1    3.0
2    2.0
3    5.0
4    1.0
5   -5.0
6   -1.0
7    NaN
8    9.0
Name: A, dtype: float64

To retain (nullable) integer dtype, use

pd.to_numeric(df['A'], errors='coerce').astype('Int32')

0    NaN
1      3
2      2
3      5
4      1
5     -5
6     -1
7    NaN
8      9
Name: A, dtype: Int32

To coerce multiple columns, use apply:

df[['A', 'B']].apply(pd.to_numeric, errors='coerce').astype('Int32')

     A    B
0  NaN    1
1    3  NaN
2    2  NaN
3    5    3
4    1   -2
5   -5    4
6   -1   -1
7  NaN    0
8    9    0

...and assign the result back after.

More information can be found in this answer.

Solution 5 - Python

With Pandas version ≥1.0.0, I would use DataFrame.replace or Series.replace:

df.replace(old_val, pd.NA, inplace=True)

This is better for two reasons:

It uses pd.NA instead of None or np.nan.
It optionally works in-place which could be more memory efficient depending upon the internal implementation.

Solution 6 - Python

df = pd.DataFrame(['-',3,2,5,1,-5,-1,'-',9])
df = df.where(df!='-', None)

Solution 7 - Python

Setting null values can be done with np.nan:

import numpy as np
df.replace('-', np.nan)

Advantage is that df.last_valid_index() recognizes these as invalid.

Solution 8 - Python

Using replace and assigning a new df:

import pandas as pd
df = pd.DataFrame(['-',3,2,5,1,-5,-1,'-',9])
dfnew = df.replace('-', 0)
print(dfnew)


(venv) D:\assets>py teste2.py
   0
0  0
1  3
2  2
3  5
4  1
5 -5

Solution 9 - Python

df.replace('-', np.nan).astype("object")

This will ensure that you can use isnull() later on your dataframe

Solution 10 - Python

Alternatively you can also use mask:

df.mask(df=='-', None)

Content Type	Original Author	Original Content on Stackoverflow
Question	Blaszard	View Question on Stackoverflow
Solution 1 - Python	Andy Hayden	View Answer on Stackoverflow
Solution 2 - Python	Michael Dorner	View Answer on Stackoverflow
Solution 3 - Python	user2966041	View Answer on Stackoverflow
Solution 4 - Python	cs95	View Answer on Stackoverflow
Solution 5 - Python	Asclepius	View Answer on Stackoverflow
Solution 6 - Python	Shravan kp	View Answer on Stackoverflow
Solution 7 - Python	Freek Wiekmeijer	View Answer on Stackoverflow
Solution 8 - Python	daniel rocha	View Answer on Stackoverflow
Solution 9 - Python	Keng Chan	View Answer on Stackoverflow
Solution 10 - Python	rachwa	View Answer on Stackoverflow

Replace invalid values with None in Pandas DataFrame

Python Problem Overview

Python Solutions

Solution 1 - Python

Solution 2 - Python

Solution 3 - Python

Solution 4 - Python

pandas >= 0.18: Use `na_values=['-']` argument with `read_csv`

Handling Numeric Data: `pd.to_numeric` with `errors='coerce`

Solution 5 - Python

Solution 6 - Python

Solution 7 - Python

Solution 8 - Python

Solution 9 - Python

Solution 10 - Python

Best way to organize jQuery/JavaScript code (2013)

Can You Specify Multiple Type Constraints For TypeScript Generics

Attributions

Python Problem Overview

Python Solutions

Solution 1 - Python

Solution 2 - Python

Solution 3 - Python

Solution 4 - Python

pandas >= 0.18: Use na_values=['-'] argument with read_csv

Handling Numeric Data: pd.to_numeric with errors='coerce

Solution 5 - Python

Solution 6 - Python

Solution 7 - Python

Solution 8 - Python

Solution 9 - Python

Solution 10 - Python

Best way to organize jQuery/JavaScript code (2013)

Can You Specify Multiple Type Constraints For TypeScript Generics

Attributions

pandas >= 0.18: Use `na_values=['-']` argument with `read_csv`

Handling Numeric Data: `pd.to_numeric` with `errors='coerce`