Convert floats to ints in Pandas?
PythonPandasFloating PointIntegerDatasetPython Problem Overview
I've been working with data imported from a CSV. Pandas changed some columns to float, so now the numbers in these columns get displayed as floating points! However, I need them to be displayed as integers or without comma. Is there a way to convert them to integers or not display the comma?
Python Solutions
Solution 1 - Python
To modify the float output do this:
df= pd.DataFrame(range(5), columns=['a'])
df.a = df.a.astype(float)
df
Out[33]:
a
0 0.0000000
1 1.0000000
2 2.0000000
3 3.0000000
4 4.0000000
pd.options.display.float_format = '{:,.0f}'.format
df
Out[35]:
a
0 0
1 1
2 2
3 3
4 4
Solution 2 - Python
Use the pandas.DataFrame.astype(<type>)
function to manipulate column dtypes.
>>> df = pd.DataFrame(np.random.rand(3,4), columns=list("ABCD"))
>>> df
A B C D
0 0.542447 0.949988 0.669239 0.879887
1 0.068542 0.757775 0.891903 0.384542
2 0.021274 0.587504 0.180426 0.574300
>>> df[list("ABCD")] = df[list("ABCD")].astype(int)
>>> df
A B C D
0 0 0 0 0
1 0 0 0 0
2 0 0 0 0
EDIT:
To handle missing values:
>>> df
A B C D
0 0.475103 0.355453 0.66 0.869336
1 0.260395 0.200287 NaN 0.617024
2 0.517692 0.735613 0.18 0.657106
>>> df[list("ABCD")] = df[list("ABCD")].fillna(0.0).astype(int)
>>> df
A B C D
0 0 0 0 0
1 0 0 0 0
2 0 0 0 0
Solution 3 - Python
Considering the following data frame:
>>> df = pd.DataFrame(10*np.random.rand(3, 4), columns=list("ABCD"))
>>> print(df)
... A B C D
... 0 8.362940 0.354027 1.916283 6.226750
... 1 1.988232 9.003545 9.277504 8.522808
... 2 1.141432 4.935593 2.700118 7.739108
Using a list of column names, change the type for multiple columns with applymap()
:
>>> cols = ['A', 'B']
>>> df[cols] = df[cols].applymap(np.int64)
>>> print(df)
... A B C D
... 0 8 0 1.916283 6.226750
... 1 1 9 9.277504 8.522808
... 2 1 4 2.700118 7.739108
Or for a single column with apply()
:
>>> df['C'] = df['C'].apply(np.int64)
>>> print(df)
... A B C D
... 0 8 0 1 6.226750
... 1 1 9 9 8.522808
... 2 1 4 2 7.739108
Solution 4 - Python
To convert all float columns to int
>>> df = pd.DataFrame(np.random.rand(5, 4) * 10, columns=list('PQRS'))
>>> print(df)
... P Q R S
... 0 4.395994 0.844292 8.543430 1.933934
... 1 0.311974 9.519054 6.171577 3.859993
... 2 2.056797 0.836150 5.270513 3.224497
... 3 3.919300 8.562298 6.852941 1.415992
... 4 9.958550 9.013425 8.703142 3.588733
>>> float_col = df.select_dtypes(include=['float64']) # This will select float columns only
>>> # list(float_col.columns.values)
>>> for col in float_col.columns.values:
... df[col] = df[col].astype('int64')
>>> print(df)
... P Q R S
... 0 4 0 8 1
... 1 0 9 6 3
... 2 2 0 5 3
... 3 3 8 6 1
... 4 9 9 8 3
Solution 5 - Python
This is a quick solution in case you want to convert more columns of your pandas.DataFrame
from float to integer considering also the case that you can have NaN values.
cols = ['col_1', 'col_2', 'col_3', 'col_4']
for col in cols:
df[col] = df[col].apply(lambda x: int(x) if x == x else "")
I tried with else x)
and else None)
, but the result is still having the float number, so I used else ""
.
Solution 6 - Python
Expanding on @Ryan G mentioned usage of the pandas.DataFrame.astype(<type>)
method, one can use the errors=ignore
argument to only convert those columns that do not produce an error, which notably simplifies the syntax. Obviously, caution should be applied when ignoring errors, but for this task it comes very handy.
>>> df = pd.DataFrame(np.random.rand(3, 4), columns=list('ABCD'))
>>> df *= 10
>>> print(df)
... A B C D
... 0 2.16861 8.34139 1.83434 6.91706
... 1 5.85938 9.71712 5.53371 4.26542
... 2 0.50112 4.06725 1.99795 4.75698
>>> df['E'] = list('XYZ')
>>> df.astype(int, errors='ignore')
>>> print(df)
... A B C D E
... 0 2 8 1 6 X
... 1 5 9 5 4 Y
... 2 0 4 1 4 Z
From pandas.DataFrame.astype docs: > errors : {‘raise’, ‘ignore’}, default ‘raise’ > > Control raising of exceptions on invalid data for provided dtype. > > - raise : allow exceptions to be raised > - ignore : suppress exceptions. On error return original object > > New in version 0.20.0.
Solution 7 - Python
The columns that needs to be converted to int can be mentioned in a dictionary also as below
df = df.astype({'col1': 'int', 'col2': 'int', 'col3': 'int'})
Solution 8 - Python
>>> import pandas as pd
>>> right = pd.DataFrame({'C': [1.002, 2.003], 'D': [1.009, 4.55], 'key': ['K0', 'K1']})
>>> print(right)
C D key
0 1.002 1.009 K0
1 2.003 4.550 K1
>>> right['C'] = right.C.astype(int)
>>> print(right)
C D key
0 1 1.009 K0
1 2 4.550 K1
Solution 9 - Python
'Int64'
for NaN support
Use astype(int)
andastype('int64')
cannot handle missing values (numpy int)astype('Int64')
can handle missing values (pandas int)
df['A'] = df['A'].astype('Int64') # capital I
This assumes you want to keep missing values as NaN. If you plan to impute them, you could fillna
first as Ryan suggested.
'Int64'
(capital I
)
Examples of -
If the floats are already rounded, just use
astype
:df = pd.DataFrame({'A': [99.0, np.nan, 42.0]}) df['A'] = df['A'].astype('Int64') # A # 0 99 # 1 <NA> # 2 42
-
If the floats are not rounded yet,
round
beforeastype
:df = pd.DataFrame({'A': [3.14159, np.nan, 1.61803]}) df['A'] = df['A'].round().astype('Int64') # A # 0 3 # 1 <NA> # 2 2
-
To read int+NaN data from a file, use
dtype='Int64'
to avoid the need for converting at all:csv = io.StringIO(''' id,rating foo,5 bar, baz,2 ''') df = pd.read_csv(csv, dtype={'rating': 'Int64'}) # id rating # 0 foo 5 # 1 bar <NA> # 2 baz 2
Notes
-
'Int64'
is an alias forInt64Dtype
:df['A'] = df['A'].astype(pd.Int64Dtype()) # same as astype('Int64')
-
Sized/signed aliases are available:
lower bound upper bound 'Int8'
-128 127 'Int16'
-32,768 32,767 'Int32'
-2,147,483,648 2,147,483,647 'Int64'
-9,223,372,036,854,775,808 9,223,372,036,854,775,807 'UInt8'
0 255 'UInt16'
0 65,535 'UInt32'
0 4,294,967,295 'UInt64'
0 18,446,744,073,709,551,615
Solution 10 - Python
In the text of the question is explained that the data comes from a csv. Só, I think that show options to make the conversion when the data is read and not after are relevant to the topic.
When importing spreadsheets or csv in a dataframe, "only integer columns" are commonly converted to float because excel stores all numerical values as floats and how the underlying libraries works.
When the file is read with read_excel or read_csv there are a couple of options avoid the after import conversion:
- parameter
dtype
allows a pass a dictionary of column names and target types likedtype = {"my_column": "Int64"}
- parameter
converters
can be used to pass a function that makes the conversion, for example changing NaN's with 0.converters = {"my_column": lambda x: int(x) if x else 0}
- parameter
convert_float
will convert "integral floats to int (i.e., 1.0 –> 1)", but take care with corner cases like NaN's. This parameter is only available inread_excel
To make the conversion in an existing dataframe several alternatives have been given in other comments, but since v1.0.0 pandas has a interesting function for this cases: convert_dtypes, that "Convert columns to best possible dtypes using dtypes supporting pd.NA."
As example:
In [3]: import numpy as np
In [4]: import pandas as pd
In [5]: df = pd.DataFrame(
...: {
...: "a": pd.Series([1, 2, 3], dtype=np.dtype("int64")),
...: "b": pd.Series([1.0, 2.0, 3.0], dtype=np.dtype("float")),
...: "c": pd.Series([1.0, np.nan, 3.0]),
...: "d": pd.Series([1, np.nan, 3]),
...: }
...: )
In [6]: df
Out[6]:
a b c d
0 1 1.0 1.0 1.0
1 2 2.0 NaN NaN
2 3 3.0 3.0 3.0
In [7]: df.dtypes
Out[7]:
a int64
b float64
c float64
d float64
dtype: object
In [8]: converted = df.convert_dtypes()
In [9]: converted.dtypes
Out[9]:
a Int64
b Int64
c Int64
d Int64
dtype: object
In [10]: converted
Out[10]:
a b c d
0 1 1 1 1
1 2 2 <NA> <NA>
2 3 3 3 3
Solution 11 - Python
Although there are many options here, You can also convert the format of specific columns using a dictionary
Data = pd.read_csv('Your_Data.csv')
Data_2 = Data.astype({"Column a":"int32", "Column_b": "float64", "Column_c": "int32"})
print(Data_2 .dtypes) # Check the dtypes of the columns
This is an useful and very fast way to change the data format of specific columns for quick data analysis.