How can I remove extra whitespace from strings when parsing a csv file in Pandas?

PythonParsingPandas

Python Problem Overview


I have the following file named 'data.csv':

    1997,Ford,E350
    1997, Ford , E350
    1997,Ford,E350,"Super, luxurious truck"
    1997,Ford,E350,"Super ""luxurious"" truck"
    1997,Ford,E350," Super luxurious truck "
    "1997",Ford,E350
    1997,Ford,E350
    2000,Mercury,Cougar

And I would like to parse it into a pandas DataFrame so that the DataFrame looks as follows:

       Year     Make   Model              Description
    0  1997     Ford    E350                     None
    1  1997     Ford    E350                     None
    2  1997     Ford    E350   Super, luxurious truck
    3  1997     Ford    E350  Super "luxurious" truck
    4  1997     Ford    E350    Super luxurious truck
    5  1997     Ford    E350                     None
    6  1997     Ford    E350                     None
    7  2000  Mercury  Cougar                     None

The best I could do was:

    pd.read_table("data.csv", sep=r',', names=["Year", "Make", "Model", "Description"])

Which gets me:

    Year     Make   Model              Description
 0  1997     Ford    E350                     None
 1  1997    Ford     E350                     None
 2  1997     Ford    E350   Super, luxurious truck
 3  1997     Ford    E350  Super "luxurious" truck
 4  1997     Ford    E350   Super luxurious truck 
 5  1997     Ford    E350                     None
 6  1997     Ford    E350                     None
 7  2000  Mercury  Cougar                     None

How can I get the DataFrame without those whitespaces?

Python Solutions


Solution 1 - Python

You could use converters:

import pandas as pd

def strip(text):
    try:
        return text.strip()
    except AttributeError:
        return text

def make_int(text):
    return int(text.strip('" '))
    
table = pd.read_table("data.csv", sep=r',',
                      names=["Year", "Make", "Model", "Description"],
                      converters = {'Description' : strip,
                                    'Model' : strip,
                                    'Make' : strip,
                                    'Year' : make_int})
print(table)

yields

   Year     Make   Model              Description
0  1997     Ford    E350                     None
1  1997     Ford    E350                     None
2  1997     Ford    E350   Super, luxurious truck
3  1997     Ford    E350  Super "luxurious" truck
4  1997     Ford    E350    Super luxurious truck
5  1997     Ford    E350                     None
6  1997     Ford    E350                     None
7  2000  Mercury  Cougar                     None

Solution 2 - Python

Adding parameter skipinitialspace=True to read_table worked for me.

So try:

pd.read_table("data.csv", 
              sep=r',', 
              names=["Year", "Make", "Model", "Description"], 
              skipinitialspace=True)

Same thing works in pd.read_csv().

Solution 3 - Python

Well, the whitespace is in your data, so you can't read in the data without reading in the whitespace. However, after you've read it in, you could strip out the whitespace by doing, e.g., df["Make"] = df["Make"].map(str.strip) (where df is your dataframe).

Solution 4 - Python

I don't have enough reputation to leave a comment, but the answer above suggesting using the map function along with strip won't work if you have NaN values, since strip only works on chars and NaN are floats.

There is a built-in pandas function to do this, which I used: pd.core.strings.str_strip(df['Description'])
where df is your dataframe. In my case I used it on a dataframe with ~1.2 million rows and it was very fast.

Solution 5 - Python

I don't believe Pandas supported this at the time this question was posted but the the most straight forward way to do this is by using regex in the sep parameter of read_csv. So something like the following should work for this issue.

table = pd.read_table("data.csv", sep=' *, *')

Solution 6 - Python

read_table is Deprecated, Here is the message as it appears in the documentation.

> Deprecated since version 0.24.0. > > Use pandas.read_csv() instead, passing sep='\t' if necessary.

So using read_csv you can pass in a regex for the sep argument, where you can specify the separator as

sep="\s*,\s*"

any number of spaces, followed by a separator, followed by any number of space again, this will make sure all the leading and trailing spaces are also chosen as a delimiter chunk which in-turn removes the white-spaces on either side of your data.

regex details as follows:

\s -> white-space
* -> any number (zero or many)
, -> no meaning, direct character match

So, the regular expression \s*,\s* stands for white-space[any number] match a comma and white-space[any number].

if your delimiter is anything else other than a comma then replace the , in the above expression with your delimiter. Eg: \s*;\s* if ; is your delimiter.

Solution 7 - Python

Here's a function to iterate through each column and apply pd.core.strings.str_strip:

def df_strip(df):
  df = df.copy()
  for c in df.columns:
    if df[c].dtype == np.object:
      df[c] = pd.core.strings.str_strip(df[c])
    df = df.rename(columns={c:c.strip()})
  return df

Solution 8 - Python

The str.strip() function works really well on Series. Thus, I convert the dataframe column that contains the whitespaces into a Series, strip the whitespace using the str.strip() function and then replace the converted column back into the dataframe. Below is the example code.

import pandas as pd
data = pd.DataFrame({'values': ['   ABC   ', '   DEF', '  GHI  ']})
new = pd.Series([])
new = data['values'].str.strip()
data['values'] = new

Solution 9 - Python

For me the best way was

def read_csv_regex(data, date_columns=[]):
    df = pd.read_csv(data, quotechar='"', parse_dates=date_columns)

    # remove front and ending blank spaces
    df = df.replace({"^\s*|\s*$":""}, regex=True) 

    # if there remained only empty string "", change to Nan
    df = df.replace({"":np.nan}) 
    return df

You don't need to write converter function and set it to every column, it works for head and tail spaces and has now problem with quotas unlike regexp sep.

See https://towardsdatascience.com/dealing-with-extra-white-spaces-while-reading-csv-in-pandas-67b0c2b71e6a#9281

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionmpjanView Question on Stackoverflow
Solution 1 - PythonunutbuView Answer on Stackoverflow
Solution 2 - PythonTheGrimmScientistView Answer on Stackoverflow
Solution 3 - PythonBrenBarnView Answer on Stackoverflow
Solution 4 - PythonRKD314View Answer on Stackoverflow
Solution 5 - PythonHunter JacksonView Answer on Stackoverflow
Solution 6 - PythonRajshekar ReddyView Answer on Stackoverflow
Solution 7 - PythonJ WangView Answer on Stackoverflow
Solution 8 - PythonS. HerronView Answer on Stackoverflow
Solution 9 - PythonninabelView Answer on Stackoverflow