How to lowercase a pandas dataframe string column if it has missing values?
PythonStringPandasMissing DataPython Problem Overview
The following code does not work.
import pandas as pd
import numpy as np
df=pd.DataFrame(['ONE','Two', np.nan],columns=['x'])
xLower = df["x"].map(lambda x: x.lower())
How should I tweak it to get xLower = ['one','two',np.nan] ? Efficiency is important since the real data frame is huge.
Python Solutions
Solution 1 - Python
use pandas vectorized string methods; as in the documentation:
> these methods exclude missing/NA values automatically
.str.lower()
is the very first example there;
>>> df['x'].str.lower()
0 one
1 two
2 NaN
Name: x, dtype: object
Solution 2 - Python
Another possible solution, in case the column has not only strings but numbers too, is to use astype(str).str.lower()
or to_string(na_rep='')
because otherwise, given that a number is not a string, when lowered it will return NaN
, therefore:
import pandas as pd
import numpy as np
df=pd.DataFrame(['ONE','Two', np.nan,2],columns=['x'])
xSecureLower = df['x'].to_string(na_rep='').lower()
xLower = df['x'].str.lower()
then we have:
>>> xSecureLower
0 one
1 two
2
3 2
Name: x, dtype: object
and not
>>> xLower
0 one
1 two
2 NaN
3 NaN
Name: x, dtype: object
edit:
if you don't want to lose the NaNs, then using map will be better, (from @wojciech-walczak, and @cs95 comment) it will look something like this
xSecureLower = df['x'].map(lambda x: x.lower() if isinstance(x,str) else x)
Solution 3 - Python
you can try this one also,
df= df.applymap(lambda s:s.lower() if type(s) == str else s)
Solution 4 - Python
str.casefold
Pandas >= 0.25: Remove Case Distinctions with Starting from v0.25, I recommend using the "vectorized" string method str.casefold
if you're dealing with unicode data (it works regardless of string or unicodes):
s = pd.Series(['lower', 'CAPITALS', np.nan, 'SwApCaSe'])
s.str.casefold()
0 lower
1 capitals
2 NaN
3 swapcase
dtype: object
Also see related GitHub issue GH25405.
casefold
lends itself to more aggressive case-folding comparison. It also handles NaNs gracefully (just as str.lower
does).
But why is this better?
The difference is seen with unicodes. Taking the example in the python str.casefold
docs,
> Casefolding is similar to lowercasing but more aggressive because it
> is intended to remove all case distinctions in a string. For example,
> the German lowercase letter 'ß'
is equivalent to "ss"
. Since it is
> already lowercase, lower()
would do nothing to 'ß'
; casefold()
> converts it to "ss"
.
Compare the output of lower
for,
s = pd.Series(["der Fluß"])
s.str.lower()
0 der fluß
dtype: object
Versus casefold
,
s.str.casefold()
0 der fluss
dtype: object
Solution 5 - Python
Apply lambda function
df['original_category'] = df['original_category'].apply(lambda x:x.lower())
Solution 6 - Python
A possible solution:
import pandas as pd
import numpy as np
df=pd.DataFrame(['ONE','Two', np.nan],columns=['x'])
xLower = df["x"].map(lambda x: x if type(x)!=str else x.lower())
print (xLower)
And a result:
0 one
1 two
2 NaN
Name: x, dtype: object
Not sure about the efficiency though.
Solution 7 - Python
May be using List comprehension
import pandas as pd
import numpy as np
df=pd.DataFrame(['ONE','Two', np.nan],columns=['Name']})
df['Name'] = [str(i).lower() for i in df['Name']]
print(df)
Solution 8 - Python
copy your Dataframe column and simply apply
df=data['x']
newdf=df.str.lower()
Solution 9 - Python
Use apply function,
Xlower = df['x'].apply(lambda x: x.upper()).head(10)