NumPy or Pandas: Keeping array type as integer while having a NaN value

PythonNumpyIntPandasType Conversion

Python Problem Overview


Is there a preferred way to keep the data type of a numpy array fixed as int (or int64 or whatever), while still having an element inside listed as numpy.NaN?

In particular, I am converting an in-house data structure to a Pandas DataFrame. In our structure, we have integer-type columns that still have NaN's (but the dtype of the column is int). It seems to recast everything as a float if we make this a DataFrame, but we'd really like to be int.

Thoughts?

Things tried:

I tried using the from_records() function under pandas.DataFrame, with coerce_float=False and this did not help. I also tried using NumPy masked arrays, with NaN fill_value, which also did not work. All of these caused the column data type to become a float.

Python Solutions


Solution 1 - Python

NaN can't be stored in an integer array. This is a known limitation of pandas at the moment; I have been waiting for progress to be made with NA values in NumPy (similar to NAs in R), but it will be at least 6 months to a year before NumPy gets these features, it seems:

http://pandas.pydata.org/pandas-docs/stable/gotchas.html#support-for-integer-na

(This feature has been added beginning with version 0.24 of pandas, but note it requires the use of extension dtype Int64 (capitalized), rather than the default dtype int64 (lower case): https://pandas.pydata.org/pandas-docs/version/0.24/whatsnew/v0.24.0.html#optional-integer-na-support )

Solution 2 - Python

This capability has been added to pandas (beginning with version 0.24): https://pandas.pydata.org/pandas-docs/version/0.24/whatsnew/v0.24.0.html#optional-integer-na-support

At this point, it requires the use of extension dtype Int64 (capitalized), rather than the default dtype int64 (lowercase).

Solution 3 - Python

If performance is not the main issue, you can store strings instead.

df.col = df.col.dropna().apply(lambda x: str(int(x)) )

Then you can mix then with NaN as much as you want. If you really want to have integers, depending on your application, you can use -1, or 0, or 1234567890, or some other dedicated value to represent NaN.

You can also temporarily duplicate the columns: one as you have, with floats; the other one experimental, with ints or strings. Then inserts asserts in every reasonable place checking that the two are in sync. After enough testing you can let go of the floats.

Solution 4 - Python

This is not a solution for all cases, but mine (genomic coordinates) I've resorted to using 0 as NaN

a3['MapInfo'] = a3['MapInfo'].fillna(0).astype(int)

This at least allows for the proper 'native' column type to be used, operations like subtraction, comparison etc work as expected

Solution 5 - Python

In case you are trying to convert a float (1.143) vector to integer (1), and that vector has NAs, converting it to the new 'Int64' dtype will give you an error. In order to solve this you have to round the numbers and then do ".astype('Int64')"

s1 = pd.Series([1.434, 2.343, np.nan])
#without round() the next line returns an error 
s1.astype('Int64')
#cannot safely cast non-equivalent float64 to int64
##with round() it works
s1.round().astype('Int64')
0      1
1      2
2    NaN
dtype: Int64

My use case is that I have a float series that I want to round to int, but when you do .round() still has decimals, you need to convert to int to remove decimals.

Solution 6 - Python

Pandas v0.24+

Functionality to support NaN in integer series will be available in v0.24 upwards. There's information on this in the v0.24 "What's New" section, and more details under Nullable Integer Data Type.

Pandas v0.23 and earlier

In general, it's best to work with float series where possible, even when the series is upcast from int to float due to inclusion of NaN values. This enables vectorised NumPy-based calculations where, otherwise, Python-level loops would be processed.

The docs do suggest : "One possibility is to use dtype=object arrays instead." For example:

s = pd.Series([1, 2, 3, np.nan])

print(s.astype(object))

0      1
1      2
2      3
3    NaN
dtype: object

For cosmetic reasons, e.g. output to a file, this may be preferable.

Pandas v0.23 and earlier: background

NaN is considered a float. The docs currently (as of v0.23) specify the reason why integer series are upcasted to float:

> In the absence of high performance NA support being built into NumPy > from the ground up, the primary casualty is the ability to represent > NAs in integer arrays. > > This trade-off is made largely for memory and performance reasons, and > also so that the resulting Series continues to be “numeric”.

The docs also provide rules for upcasting due to NaN inclusion:

Typeclass	Promotion dtype for storing NAs
floating	no change
object	    no change
integer	    cast to float64
boolean	    cast to object

Solution 7 - Python

New for Pandas v1.00 +

You do not (and can not) use numpy.nan any more. Now you have pandas.NA.

Please read: https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html

> IntegerArray is currently experimental. Its API or implementation may > change without warning. > > Changed in version 1.0.0: Now uses pandas.NA as the missing value > rather than numpy.nan. > > In Working with missing data, we saw that pandas primarily uses NaN to > represent missing data. Because NaN is a float, this forces an array > of integers with any missing values to become floating point. In some > cases, this may not matter much. But if your integer column is, say, > an identifier, casting to float can be problematic. Some integers > cannot even be represented as floating point numbers.

Solution 8 - Python

This is now possible, since pandas v 0.24.0

pandas 0.24.x release notes Quote: "Pandas has gained the ability to hold integer dtypes with missing values.

Solution 9 - Python

If there are blanks in the text data, columns that would normally be integers will be cast to floats as float64 dtype because int64 dtype cannot handle nulls. This can cause inconsistent schema if you are loading multiple files some with blanks (which will end up as float64 and others without which will end up as int64

This code will attempt to convert any number type columns to Int64 (as opposed to int64) since Int64 can handle nulls

import pandas as pd
import numpy as np

#show datatypes before transformation
mydf.dtypes

for c in mydf.select_dtypes(np.number).columns:
    try:
        mydf[c] = mydf[c].astype('Int64')
        print('casted {} as Int64'.format(c))
    except:
        print('could not cast {} to Int64'.format(c))

#show datatypes after transformation
mydf.dtypes
        

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionelyView Question on Stackoverflow
Solution 1 - PythonWes McKinneyView Answer on Stackoverflow
Solution 2 - PythontechvslifeView Answer on Stackoverflow
Solution 3 - PythonSergey OrshanskiyView Answer on Stackoverflow
Solution 4 - PythonpufferfishView Answer on Stackoverflow
Solution 5 - PythonPedro Moisés Camacho UreñaView Answer on Stackoverflow
Solution 6 - PythonjppView Answer on Stackoverflow
Solution 7 - PythonChananel PView Answer on Stackoverflow
Solution 8 - PythonmorkView Answer on Stackoverflow
Solution 9 - PythonKynrekView Answer on Stackoverflow