Convert columns to string in Pandas

Python Problem Overview

I have the following DataFrame from a SQL query:

(Pdb) pp total_rows
     ColumnID  RespondentCount
0          -1                2
1  3030096843                1
2  3030096845                1

and I want to pivot it like this:

total_data = total_rows.pivot_table(cols=['ColumnID'])

(Pdb) pp total_data
ColumnID         -1            3030096843   3030096845
RespondentCount            2            1            1

[1 rows x 3 columns]


total_rows.pivot_table(cols=['ColumnID']).to_dict('records')[0]

{3030096843: 1, 3030096845: 1, -1: 2}

but I want to make sure the 303 columns are casted as strings instead of integers so that I get this:

{'3030096843': 1, '3030096845': 1, -1: 2}

Python Solutions

Solution 1 - Python

One way to convert to string is to use astype:

total_rows['ColumnID'] = total_rows['ColumnID'].astype(str)

However, perhaps you are looking for the to_json function, which will convert keys to valid json (and therefore your keys to strings):

In [11]: df = pd.DataFrame([['A', 2], ['A', 4], ['B', 6]])

In [12]: df.to_json()
Out[12]: '{"0":{"0":"A","1":"A","2":"B"},"1":{"0":2,"1":4,"2":6}}'

In [13]: df[0].to_json()
Out[13]: '{"0":"A","1":"A","2":"B"}'

Note: you can pass in a buffer/file to save this to, along with some other options...

Solution 2 - Python

If you need to convert ALL columns to strings, you can simply use:

df = df.astype(str)

This is useful if you need everything except a few columns to be strings/objects, then go back and convert the other ones to whatever you need (integer in this case):

 df[["D", "E"]] = df[["D", "E"]].astype(int)

Solution 3 - Python

pandas >= 1.0: It's time to stop using `astype(str)`!

Prior to pandas 1.0 (well, 0.25 actually) this was the defacto way of declaring a Series/column as as string:

# pandas <= 0.25
# Note to pedants: specifying the type is unnecessary since pandas will 
# automagically infer the type as object
s = pd.Series(['a', 'b', 'c'], dtype=str)
s.dtype
# dtype('O')

From pandas 1.0 onwards, consider using "string" type instead.

# pandas >= 1.0
s = pd.Series(['a', 'b', 'c'], dtype="string")
s.dtype
# StringDtype

Here's why, as quoted by the docs:

> 1. You can accidentally store a mixture of strings and non-strings in an object dtype array. It’s better to have a dedicated dtype. > > 2. object dtype breaks dtype-specific operations like DataFrame.select_dtypes(). There isn’t a clear way to select just text > while excluding non-text but still object-dtype columns. > > 3. When reading code, the contents of an object dtype array is less clear than 'string'.

See also the section on Behavioral Differences between "string" and object.

Extension types (introduced in 0.24 and formalized in 1.0) are closer to pandas than numpy, which is good because numpy types are not powerful enough. For example NumPy does not have any way of representing missing data in integer data (since type(NaN) == float). But pandas can using Nullable Integer columns.

Why should I stop using it?

Accidentally mixing dtypes
The first reason, as outlined in the docs is that you can accidentally store non-text data in object columns.

# pandas <= 0.25
pd.Series(['a', 'b', 1.23])   # whoops, this should have been "1.23"

0       a
1       b
2    1.23
dtype: object

pd.Series(['a', 'b', 1.23]).tolist()
# ['a', 'b', 1.23]   # oops, pandas was storing this as float all the time.

# pandas >= 1.0
pd.Series(['a', 'b', 1.23], dtype="string")

0       a
1       b
2    1.23
dtype: string

pd.Series(['a', 'b', 1.23], dtype="string").tolist()
# ['a', 'b', '1.23']   # it's a string and we just averted some potentially nasty bugs.

Challenging to differentiate strings and other python objects
Another obvious example example is that it's harder to distinguish between "strings" and "objects". Objects are essentially the blanket type for any type that does not support vectorizable operations.

Consider,

# Setup
df = pd.DataFrame({'A': ['a', 'b', 'c'], 'B': [{}, [1, 2, 3], 123]})
df
 
   A          B
0  a         {}
1  b  [1, 2, 3]
2  c        123

Upto pandas 0.25, there was virtually no way to distinguish that "A" and "B" do not have the same type of data.

# pandas <= 0.25  
df.dtypes

A    object
B    object
dtype: object

df.select_dtypes(object)

   A          B
0  a         {}
1  b  [1, 2, 3]
2  c        123

From pandas 1.0, this becomes a lot simpler:

# pandas >= 1.0
# Convenience function I call to help illustrate my point.
df = df.convert_dtypes()
df.dtypes

A    string
B    object
dtype: object

df.select_dtypes("string")

   A
0  a
1  b
2  c

Readability
This is self-explanatory ;-)

OK, so should I stop using it right now?

...No. As of writing this answer (version 1.1), there are no performance benefits but the docs expect future enhancements to significantly improve performance and reduce memory usage for "string" columns as opposed to objects. With that said, however, it's never too early to form good habits!

Solution 4 - Python

Here's the other one, particularly useful to convert the multiple columns to string instead of just single column:

In [76]: import numpy as np
In [77]: import pandas as pd
In [78]: df = pd.DataFrame({
    ...:     'A': [20, 30.0, np.nan],
    ...:     'B': ["a45a", "a3", "b1"],
    ...:     'C': [10, 5, np.nan]})
    ...: 

In [79]: df.dtypes ## Current datatype
Out[79]: 
A    float64
B     object
C    float64
dtype: object

## Multiple columns string conversion
In [80]: df[["A", "C"]] = df[["A", "C"]].astype(str) 

In [81]: df.dtypes ## Updated datatype after string conversion
Out[81]: 
A    object
B    object
C    object
dtype: object

Solution 5 - Python

There are four ways to convert columns to string

1. astype(str)
df['column_name'] = df['column_name'].astype(str)

2. values.astype(str)
df['column_name'] = df['column_name'].values.astype(str)

3. map(str)
df['column_name'] = df['column_name'].map(str)

4. apply(str)
df['column_name'] = df['column_name'].apply(str)

Lets see the performance of each type

#importing libraries
import numpy as np
import pandas as pd
import time

#creating four sample dataframes using dummy data
df1 = pd.DataFrame(np.random.randint(1, 1000, size =(10000000, 1)), columns =['A'])
df2 = pd.DataFrame(np.random.randint(1, 1000, size =(10000000, 1)), columns =['A'])
df3 = pd.DataFrame(np.random.randint(1, 1000, size =(10000000, 1)), columns =['A'])
df4 = pd.DataFrame(np.random.randint(1, 1000, size =(10000000, 1)), columns =['A'])

#applying astype(str)
time1 = time.time()
df1['A'] = df1['A'].astype(str)
print('time taken for astype(str) : ' + str(time.time()-time1) + ' seconds')

#applying values.astype(str)
time2 = time.time()
df2['A'] = df2['A'].values.astype(str)
print('time taken for values.astype(str) : ' + str(time.time()-time2) + ' seconds')

#applying map(str)
time3 = time.time()
df3['A'] = df3['A'].map(str)
print('time taken for map(str) : ' + str(time.time()-time3) + ' seconds')

#applying apply(str)
time4 = time.time()
df4['A'] = df4['A'].apply(str)
print('time taken for apply(str) : ' + str(time.time()-time4) + ' seconds')

Output

time taken for astype(str): 5.472359895706177 seconds
time taken for values.astype(str): 6.5844292640686035 seconds
time taken for map(str): 2.3686647415161133 seconds
time taken for apply(str): 2.39758563041687 seconds

map(str) and apply(str) are takes less time compare with remaining two techniques

Solution 6 - Python

I usually use this one:

pd['Column'].map(str)

Solution 7 - Python

Using .apply() with a lambda conversion function also works in this case:

total_rows['ColumnID'] = total_rows['ColumnID'].apply(lambda x: str(x))

For entire dataframes you can use .applymap(). (but in any case probably .astype() is faster)

Content Type	Original Author	Original Content on Stackoverflow
Question	sontek	View Question on Stackoverflow
Solution 1 - Python	Andy Hayden	View Answer on Stackoverflow
Solution 2 - Python	Mike	View Answer on Stackoverflow
Solution 3 - Python	cs95	View Answer on Stackoverflow
Solution 4 - Python	Surya	View Answer on Stackoverflow
Solution 5 - Python	Govinda	View Answer on Stackoverflow
Solution 6 - Python	Feng Bao	View Answer on Stackoverflow
Solution 7 - Python	dbouz	View Answer on Stackoverflow

Convert columns to string in Pandas

Python Problem Overview

Python Solutions

Solution 1 - Python

Solution 2 - Python

Solution 3 - Python

pandas >= 1.0: It's time to stop using `astype(str)`!

Why should I stop using it?

OK, so should I stop using it right now?

Solution 4 - Python

Solution 5 - Python

Solution 6 - Python

Solution 7 - Python

Replace uppercase with lowercase letters

Why do we need to install gulp globally and locally?

Attributions

Python Problem Overview

Python Solutions

Solution 1 - Python

Solution 2 - Python

Solution 3 - Python

pandas >= 1.0: It's time to stop using astype(str)!

Why should I stop using it?

OK, so should I stop using it right now?

Solution 4 - Python

Solution 5 - Python

Solution 6 - Python

Solution 7 - Python

Replace uppercase with lowercase letters

Why do we need to install gulp globally and locally?

Attributions

pandas >= 1.0: It's time to stop using `astype(str)`!