convert entire pandas dataframe to integers in pandas (0.17.0)

PythonPandas

Python Problem Overview


My question is very similar to this one, but I need to convert my entire dataframe instead of just a series. The to_numeric function only works on one series at a time and is not a good replacement for the deprecated convert_objects command. Is there a way to get similar results to the convert_objects(convert_numeric=True) command in the new pandas release?

Thank you Mike Müller for your example. df.apply(pd.to_numeric) works very well if the values can all be converted to integers. What if in my dataframe I had strings that could not be converted into integers? Example:

df = pd.DataFrame({'ints': ['3', '5'], 'Words': ['Kobe', 'Bryant']})
df.dtypes
Out[59]: 
Words    object
ints     object
dtype: object

Then I could run the deprecated function and get:

df = df.convert_objects(convert_numeric=True)
df.dtypes
Out[60]: 
Words    object
ints      int64
dtype: object

Running the apply command gives me errors, even with try and except handling.

Python Solutions


Solution 1 - Python

All columns convertible

You can apply the function to all columns:

df.apply(pd.to_numeric)

Example:

>>> df = pd.DataFrame({'a': ['1', '2'], 
                       'b': ['45.8', '73.9'],
                       'c': [10.5, 3.7]})

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 3 columns):
a    2 non-null object
b    2 non-null object
c    2 non-null float64
dtypes: float64(1), object(2)
memory usage: 64.0+ bytes

>>> df.apply(pd.to_numeric).info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 3 columns):
a    2 non-null int64
b    2 non-null float64
c    2 non-null float64
dtypes: float64(2), int64(1)
memory usage: 64.0 bytes

Not all columns convertible

pd.to_numeric has the keyword argument errors:

> Signature: pd.to_numeric(arg, errors='raise') Docstring: Convert argument to a numeric type.

> Parameters ---------- arg : list, tuple or array of objects, or Series errors : {'ignore', 'raise', 'coerce'}, default 'raise' - If 'raise', then invalid parsing will raise an exception - If 'coerce', then invalid parsing will be set as NaN - If 'ignore', then invalid parsing will return the input

Setting it to ignore will return the column unchanged if it cannot be converted into a numeric type.

As pointed out by Anton Protopopov, the most elegant way is to supply ignore as keyword argument to apply():

>>> df = pd.DataFrame({'ints': ['3', '5'], 'Words': ['Kobe', 'Bryant']})
>>> df.apply(pd.to_numeric, errors='ignore').info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 2 columns):
Words    2 non-null object
ints     2 non-null int64
dtypes: int64(1), object(1)
memory usage: 48.0+ bytes

My previously suggested way, using partial from the module functools, is more verbose:

>>> from functools import partial
>>> df = pd.DataFrame({'ints': ['3', '5'], 
                       'Words': ['Kobe', 'Bryant']})
>>> df.apply(partial(pd.to_numeric, errors='ignore')).info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 2 columns):
Words    2 non-null object
ints     2 non-null int64
dtypes: int64(1), object(1)
memory usage: 48.0+ bytes

Solution 2 - Python

The accepted answer with pd.to_numeric() converts to float, as soon as it is needed. Reading the question in detail, it is about converting any numeric column to integer. That is why the accepted answer needs a loop over all columns to convert the numbers to int in the end.

Just for completeness, this is even possible without pd.to_numeric(); of course, this is not recommended:

df = pd.DataFrame({'a': ['1', '2'], 
                   'b': ['45.8', '73.9'],
                   'c': [10.5, 3.7]})

for i in df.columns:
    try:
        df[[i]] = df[[i]].astype(float).astype(int)
    except:
        pass

print(df.dtypes)

Out:

a    int32
b    int32
c    int32
dtype: object

EDITED: Mind that this not recommended solution is unnecessarily complicated; pd.to_numeric() can simply use the keyword argument downcast='integer' to force integer as output, thank you for the comment. This is then still missing in the accepted answer, though.

Solution 3 - Python

apply() the pd.to_numeric with errors='ignore' and assign it back to the DataFrame:

df = pd.DataFrame({'ints': ['3', '5'], 'Words': ['Kobe', 'Bryant']})
print ("Orig: \n",df.dtypes)

df.apply(pd.to_numeric, errors='ignore')
print ("\nto_numeric: \n",df.dtypes)

df = df.apply(pd.to_numeric, errors='ignore')
print ("\nto_numeric with assign: \n",df.dtypes)

Output:

Orig: 
 ints     object
Words    object
dtype: object

to_numeric: 
 ints     object
Words    object
dtype: object

to_numeric with assign: 
 ints      int64
Words    object
dtype: object

Solution 4 - Python

you can use df.astype() to convert the series to desired datatype.

For example: my_str_df = [['20','30','40']]

then: my_int_df = my_str_df['column_name'].astype(int) # this will be the int type

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionBobe KryantView Question on Stackoverflow
Solution 1 - PythonMike MüllerView Answer on Stackoverflow
Solution 2 - Pythonquestionto42standswithUkraineView Answer on Stackoverflow
Solution 3 - PythonAlon LavianView Answer on Stackoverflow
Solution 4 - PythonP.R.View Answer on Stackoverflow