Strip / trim all strings of a dataframe

PythonRegexPandasDataframeTrim

Python Problem Overview


Cleaning the values of a multitype data frame in python/pandas, I want to trim the strings. I am currently doing it in two instructions :

import pandas as pd

df = pd.DataFrame([['  a  ', 10], ['  c  ', 5]])

df.replace('^\s+', '', regex=True, inplace=True) #front
df.replace('\s+$', '', regex=True, inplace=True) #end

df.values

This is quite slow, what could I improve ?

Python Solutions


Solution 1 - Python

You can use DataFrame.select_dtypes to select string columns and then apply function str.strip.

Notice: Values cannot be types like dicts or lists, because their dtypes is object.

df_obj = df.select_dtypes(['object'])
print (df_obj)
0    a  
1    c  

df[df_obj.columns] = df_obj.apply(lambda x: x.str.strip())
print (df)

   0   1
0  a  10
1  c   5

But if there are only a few columns use str.strip:

df[0] = df[0].str.strip()

Solution 2 - Python

Money Shot

Here's a compact version of using applymap with a straightforward lambda expression to call strip only when the value is of a string type:

df.applymap(lambda x: x.strip() if isinstance(x, str) else x)

Full Example

A more complete example:

import pandas as pd


def trim_all_columns(df):
    """
    Trim whitespace from ends of each value across all series in dataframe
    """
    trim_strings = lambda x: x.strip() if isinstance(x, str) else x
    return df.applymap(trim_strings)


# simple example of trimming whitespace from data elements
df = pd.DataFrame([['  a  ', 10], ['  c  ', 5]])
df = trim_all_columns(df)
print(df)


>>>
   0   1
0  a  10
1  c   5

Working Example

Here's a working example hosted by trinket: https://trinket.io/python3/e6ab7fb4ab

Solution 3 - Python

If you really want to use regex, then

>>> df.replace('(^\s+|\s+$)', '', regex=True, inplace=True)
>>> df
   0   1
0  a  10
1  c   5

But it should be faster to do it like this:

>>> df[0] = df[0].str.strip()

Solution 4 - Python

You can try:

df[0] = df[0].str.strip()

or more specifically for all string columns

non_numeric_columns = list(set(df.columns)-set(df._get_numeric_data().columns))
df[non_numeric_columns] = df[non_numeric_columns].apply(lambda x : str(x).strip())

Solution 5 - Python

You can use the apply function of the Series object:

>>> df = pd.DataFrame([['  a  ', 10], ['  c  ', 5]])
>>> df[0][0]
'  a  '
>>> df[0] = df[0].apply(lambda x: x.strip())
>>> df[0][0]
'a'

> Note the usage of strip and not the regex which is much faster

Another option - use the apply function of the DataFrame object:

>>> df = pd.DataFrame([['  a  ', 10], ['  c  ', 5]])
>>> df.apply(lambda x: x.apply(lambda y: y.strip() if type(y) == type('') else y), axis=0)

   0   1
0  a  10
1  c   5

Solution 6 - Python

Strip alone does not remove the inner extra spaces in a string. The workaround to this is to first replace one or more spaces with a single space. This ensures that we remove extra inner spaces and outer spaces.

# Import packages
import re 

# First inspect the dtypes of the dataframe
df.dtypes

# First replace one or more spaces with a single space. This ensures that we remove extra inner spaces and outer spaces.
df = df.applymap(lambda x: re.sub('\s+', ' ', x) if isinstance(x, str) else x)


# Then strip leading and trailing white spaces
df = df.apply(lambda x: x.str.strip() if isinstance(x, object) else x)

Solution 7 - Python

how about (for string columns)

df[col] = df[col].str.replace(" ","")

never fails

Solution 8 - Python

@jezrael answer is looking good. But if you want to get back the other (numeric/integer etc) columns as well in the final result set then you suppose need to merge back with original DataFrame.

If it is the case then you may use this approach,

df = df.apply(lambda x: x.str.strip() if x.dtype.name == 'object' else x, axis=0)

Thanks!

Solution 9 - Python

def trim(x):
    if x.dtype == object:
        x = x.str.split(' ').str[0]
    return(x)

df = df.apply(trim)

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionmxdbldView Question on Stackoverflow
Solution 1 - PythonjezraelView Answer on Stackoverflow
Solution 2 - PythonJonathan B.View Answer on Stackoverflow
Solution 3 - PythonRoman PekarView Answer on Stackoverflow
Solution 4 - PythonAakash MakwanaView Answer on Stackoverflow
Solution 5 - PythonDekelView Answer on Stackoverflow
Solution 6 - PythonJane KathambiView Answer on Stackoverflow
Solution 7 - PythonTalisView Answer on Stackoverflow
Solution 8 - PythonJai KView Answer on Stackoverflow
Solution 9 - Pythonhyunwoo jeongView Answer on Stackoverflow