Pandas dataframe read_csv on bad data

PythonCsvPandas

Python Problem Overview


I want to read in a very large csv (cannot be opened in excel and edited easily) but somewhere around the 100,000th row, there is a row with one extra column causing the program to crash. This row is errored so I need a way to ignore the fact that it was an extra column. There is around 50 columns so hardcoding the headers and using names or usecols isn't preferable. I'll also possibly encounter this issue in other csv's and want a generic solution. I couldn't find anything in read_csv unfortunately. The code is as simple as this:

def loadCSV(filePath):
    dataframe = pd.read_csv(filePath, index_col=False, encoding='iso-8859-1', nrows=1000)
    datakeys = dataframe.keys();
    return dataframe, datakeys

Python Solutions


Solution 1 - Python

pass error_bad_lines=False to skip erroneous rows:

> error_bad_lines : boolean, default True Lines with too many fields > (e.g. a csv line with too many commas) will by default cause an > exception to be raised, and no DataFrame will be returned. If False, > then these “bad lines” will dropped from the DataFrame that is > returned. (Only valid with C parser)

Solution 2 - Python

To get information about error causing rows try to use combination of error_bad_lines=False and warn_bad_lines=True:

dataframe = pd.read_csv(filePath, index_col=False, encoding='iso-8859-1', nrows=1000,
                        warn_bad_lines=True, error_bad_lines=False)

error_bad_lines=False skips error-causing rows and warn_bad_lines=True prints error details and row number, like this:

'Skipping line 3: expected 4 fields, saw 3401\nSkipping line 4: expected 4 fields, saw 30...'

If you want to save the warning message (i.e. for some further processing), then you can save it to a file too (with use of contextlib):

import contextlib

with open(r'D:\Temp\log.txt', 'w') as log:
    with contextlib.redirect_stderr(log):
        dataframe = pd.read_csv(filePath, index_col=False, encoding='iso-8859-1', 
                                warn_bad_lines=True, error_bad_lines=False)

Solution 3 - Python

For anyone like me who ran across this years later than the original was posted, the other answers suggest using error_bad_lines=False and warn_bad_lines=True, but both are being deprecated in pandas.

Instead, use on_bad_lines = 'warn' to achieve the same effect to skip over bad data lines.

dataframe = pd.read_csv(filePath, index_col=False, encoding='iso-8859-1', nrows=1000, on_bad_lines = 'warn')

on_bad_lines = 'warn' will raise a warning when a bad line is encountered and skip that line.


Other acceptable values for on_bad_lines are

  • 'error' which raises an Exception on a bad line

  • 'skip' which will skip any bad lines

Solution 4 - Python

New with 1.4.0

Starting with pandas 1.4.0, read_csv() delivers capability that allows you to handle these situations in a more graceful and intelligent fashion by allowing a callable to be assigned to on_bad_lines=.

For example, assume a CSV that could cause a bad data error: Expected 4 fields in line 3, saw 5:

C1,C2,C3,C4
10,11,12,13
25,26,27,28,garbage
80,81,82,83

This function simply ignores the last column in the bad line (as was desired in the original problem statement above):

df = pd.read_csv('your.csv', on_bad_lines=lambda x: x[:-1], engine='python')
df

   C1  C2  C3  C4
0  10  11  12  13
1  25  26  27  28
2  80  81  82  83

The on_bad_lines callable function is called on each bad line and has a function signature (bad_line: list[str]) -> list[str] | None. If the function returns None, the bad line will be ignored. As you can see engine='python' is required.

The great thing about this is that it opens the door big-time for whatever fine-grained logic you want to code to fix the problem.

For example, say you'd like to remove bad data from the start or the end of the line and simply ignore the line if there is bad data in both the start and the end you could:

CSV:

C1,C2,C3,C4
10,11,12,13
20,21,22,23,garbage
60,61,62,63
trash,80,81,82,83
trash,90,91,82,garbage

Function:

def line_fixer(x):
    if not x[0].isnumeric() and x[-1].isnumeric():
        return x[1:] 
    
    if not x[-1].isnumeric() and x[0].isnumeric():
        return x[:-1]
    
    return None

Result:

df = pd.read_csv('your.csv', on_bad_lines=line_fixer, engine='python')
df

   C1  C2  C3  C4
0  10  11  12  13
1  20  21  22  23
2  60  61  62  63
3  80  81  82  83

Solution 5 - Python

here is my way to solve those problem, it is slow but works so well, Simply says just read the CSV file as txt file, and go through each line. if the "," comma is less than it should be just skip that row. eventurally safe the correct lines.

def bad_lines(path):
    import itertools
    num_columns = []
    lines = ""
    
    for i in range(10,50,5):
        content = open(path).readlines(i)[0]
        if (content.count("'") == 0) and (content.count('"') == 0):
            num_columns.append(content.count(","))

    if len(set(num_columns)) == 1:
        for line in itertools.islice(open(path), 0, None):
            if line.count(",") >= num_columns[0]:
                lines = lines + line

    text_file = open("temp.txt", "w")
    n = text_file.write(lines)
    text_file.close()
    
    return("temp.txt")

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionFontiView Question on Stackoverflow
Solution 1 - PythonEdChumView Answer on Stackoverflow
Solution 2 - PythonLukasView Answer on Stackoverflow
Solution 3 - PythonTorsten KnabeView Answer on Stackoverflow
Solution 4 - PythonjchView Answer on Stackoverflow
Solution 5 - PythonSway WuView Answer on Stackoverflow