Pythonic/efficient way to strip whitespace from every Pandas Data frame cell that has a stringlike object in it

PythonPandasDataframe

Python Problem Overview


I'm reading a CSV file into a DataFrame. I need to strip whitespace from all the stringlike cells, leaving the other cells unchanged in Python 2.7.

Here is what I'm doing:

def remove_whitespace( x ):
    if isinstance( x, basestring ):
        return x.strip()
    else:
        return x

my_data = my_data.applymap( remove_whitespace )

Is there a better or more idiomatic to Pandas way to do this?

Is there a more efficient way (perhaps by doing things column wise)?

I've tried searching for a definitive answer, but most questions on this topic seem to be how to strip whitespace from the column names themselves, or presume the cells are all strings.

Python Solutions


Solution 1 - Python

Stumbled onto this question while looking for a quick and minimalistic snippet I could use. Had to assemble one myself from posts above. Maybe someone will find it useful:

data_frame_trimmed = data_frame.apply(lambda x: x.str.strip() if x.dtype == "object" else x)

Solution 2 - Python

You could use pandas' Series.str.strip() method to do this quickly for each string-like column:

>>> data = pd.DataFrame({'values': ['   ABC   ', '   DEF', '  GHI  ']})
>>> data
      values
0     ABC   
1        DEF
2      GHI  

>>> data['values'].str.strip()
0    ABC
1    DEF
2    GHI
Name: values, dtype: object

Solution 3 - Python

We want to:

  1. Apply our function to each element in our dataframe - use applymap.

  2. Use type(x)==str (versus x.dtype == 'object') because Pandas will label columns as object for columns of mixed datatypes (an object column may contain int and/or str).

  3. Maintain the datatype of each element (we don't want to convert everything to a str and then strip whitespace).

Therefore, I've found the following to be the easiest:

df.applymap(lambda x: x.strip() if type(x)==str else x)

Solution 4 - Python

When you call pandas.read_csv, you can use a regular expression that matches zero or more spaces followed by a comma followed by zero or more spaces as the delimiter.

For example, here's "data.csv":

In [19]: !cat data.csv
1.5, aaa,  bbb ,  ddd     , 10 ,  XXX   
2.5, eee, fff  ,       ggg, 20 ,     YYY

(The first line ends with three spaces after XXX, while the second line ends at the last Y.)

The following uses pandas.read_csv() to read the files, with the regular expression ' *, *' as the delimiter. (Using a regular expression as the delimiter is only available in the "python" engine of read_csv().)

In [20]: import pandas as pd

In [21]: df = pd.read_csv('data.csv', header=None, delimiter=' *, *', engine='python')

In [22]: df
Out[22]: 
     0    1    2    3   4    5
0  1.5  aaa  bbb  ddd  10  XXX
1  2.5  eee  fff  ggg  20  YYY

Solution 5 - Python

The "data['values'].str.strip()" answer above did not work for me, but I found a simple work around. I am sure there is a better way to do this. The str.strip() function works on Series. Thus, I converted the dataframe column into a Series, stripped the whitespace, replaced the converted column back into the dataframe. Below is the example code.

import pandas as pd
data = pd.DataFrame({'values': ['   ABC   ', '   DEF', '  GHI  ']})
print ('-----')
print (data)

data['values'].str.strip()
print ('-----')
print (data)

new = pd.Series([])
new = data['values'].str.strip()
data['values'] = new
print ('-----')
print (new)

Solution 6 - Python

Here is a column-wise solution with pandas apply:

import numpy as np

def strip_obj(col):
    if col.dtypes == object:
        return (col.astype(str)
                   .str.strip()
                   .replace({'nan': np.nan}))
    return col

df = df.apply(strip_obj, axis=0)

This will convert values in object type columns to string. Should take caution with mixed-type columns. For example if your column is zip codes with 20001 and ' 21110 ' you will end up with '20001' and '21110'.

Solution 7 - Python

This worked for me - applies it to the whole dataframe:

def panda_strip(x):
    r =[]
    for y in x:
        if isinstance(y, str):
            y = y.strip()

        r.append(y)
    return pd.Series(r)

df = df.apply(lambda x: panda_strip(x))

Solution 8 - Python

I found the following code useful and something that would likely help others. This snippet will allow you to delete spaces in a column as well as in the entire DataFrame, depending on your use case.

import pandas as pd

def remove_whitespace(x):
    try:
        # remove spaces inside and outside of string
        x = "".join(x.split())

    except:
        pass
    return x

# Apply remove_whitespace to column only
df.orderId = df.orderId.apply(remove_whitespace)
print(df)


# Apply to remove_whitespace to entire Dataframe
df = df.applymap(remove_whitespace)
print(df)

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestiondeadcodeView Question on Stackoverflow
Solution 1 - PythonAdam OwczarczykView Answer on Stackoverflow
Solution 2 - PythonjakevdpView Answer on Stackoverflow
Solution 3 - PythonMichael SilversteinView Answer on Stackoverflow
Solution 4 - PythonWarren WeckesserView Answer on Stackoverflow
Solution 5 - PythonS. HerronView Answer on Stackoverflow
Solution 6 - PythonBlakeView Answer on Stackoverflow
Solution 7 - PythonSaul FrankView Answer on Stackoverflow
Solution 8 - PythonFunnyChefView Answer on Stackoverflow