Python Pandas Error tokenizing data

PythonCsvPandas

Python Problem Overview


I'm trying to use pandas to manipulate a .csv file but I get this error:

>pandas.parser.CParserError: Error tokenizing data. C error: Expected 2 fields in line 3, saw 12

I have tried to read the pandas docs, but found nothing.

My code is simple:

path = 'GOOG Key Ratios.csv'
#print(open(path).read())
data = pd.read_csv(path)

How can I resolve this? Should I use the csv module or another language ?

File is from Morningstar

Python Solutions


Solution 1 - Python

you could also try;

data = pd.read_csv('file1.csv', on_bad_lines='skip')

Do note that this will cause the offending lines to be skipped.

Edit

For Pandas < 1.3.0 try

data = pd.read_csv("file1.csv", error_bad_lines=False)

as per pandas API reference.

Solution 2 - Python

It might be an issue with

  • the delimiters in your data
  • the first row, as @TomAugspurger noted

To solve it, try specifying the sep and/or header arguments when calling read_csv. For instance,

df = pandas.read_csv(filepath, sep='delimiter', header=None)

In the code above, sep defines your delimiter and header=None tells pandas that your source data has no row for headers / column titles. Thus saith the docs: "If file contains no header row, then you should explicitly pass header=None". In this instance, pandas automatically creates whole-number indices for each field {0,1,2,...}.

According to the docs, the delimiter thing should not be an issue. The docs say that "if sep is None [not specified], will try to automatically determine this." I however have not had good luck with this, including instances with obvious delimiters.

Another solution may be to try auto detect the delimiter

# use the first 2 lines of the file to detect separator
temp_lines = csv_file.readline() + '\n' + csv_file.readline()
dialect = csv.Sniffer().sniff(temp_lines, delimiters=';,')

# remember to go back to the start of the file for the next time it's read
csv_file.seek(0) 

df = pd.read_csv(csv_file, sep=dialect.delimiter)

Solution 3 - Python

The parser is getting confused by the header of the file. It reads the first row and infers the number of columns from that row. But the first two rows aren't representative of the actual data in the file.

Try it with data = pd.read_csv(path, skiprows=2)

Solution 4 - Python

This is definitely an issue of delimiter, as most of the csv CSV are got create using sep='/t' so try to read_csv using the tab character (\t) using separator /t. so, try to open using following code line.

data=pd.read_csv("File_path", sep='\t')

Solution 5 - Python

I had this problem, where I was trying to read in a CSV without passing in column names.

df = pd.read_csv(filename, header=None)

I specified the column names in a list beforehand and then pass them into names, and it solved it immediately. If you don't have set column names, you could just create as many placeholder names as the maximum number of columns that might be in your data.

col_names = ["col1", "col2", "col3", ...]
df = pd.read_csv(filename, names=col_names)

Solution 6 - Python

Your CSV file might have variable number of columns and read_csv inferred the number of columns from the first few rows. Two ways to solve it in this case:

  1. Change the CSV file to have a dummy first line with max number of columns (and specify header=[0])

  2. Or use names = list(range(0,N)) where N is the max number of columns.

Solution 7 - Python

I had this problem as well but perhaps for a different reason. I had some trailing commas in my CSV that were adding an additional column that pandas was attempting to read. Using the following works but it simply ignores the bad lines:

data = pd.read_csv('file1.csv', error_bad_lines=False)

If you want to keep the lines an ugly kind of hack for handling the errors is to do something like the following:

line     = []
expected = []
saw      = []     
cont     = True 

while cont == True:     
    try:
        data = pd.read_csv('file1.csv',skiprows=line)
        cont = False
    except Exception as e:    
        errortype = e.message.split('.')[0].strip()                                
        if errortype == 'Error tokenizing data':                        
           cerror      = e.message.split(':')[1].strip().replace(',','')
           nums        = [n for n in cerror.split(' ') if str.isdigit(n)]
           expected.append(int(nums[0]))
           saw.append(int(nums[2]))
           line.append(int(nums[1])-1)
         else:
           cerror      = 'Unknown'
           print 'Unknown Error - 222'

if line != []:
    # Handle the errors however you want

I proceeded to write a script to reinsert the lines into the DataFrame since the bad lines will be given by the variable 'line' in the above code. This can all be avoided by simply using the csv reader. Hopefully the pandas developers can make it easier to deal with this situation in the future.

Solution 8 - Python

The following worked for me (I posted this answer, because I specifically had this problem in a Google Colaboratory Notebook):

df = pd.read_csv("/path/foo.csv", delimiter=';', skiprows=0, low_memory=False)

Solution 9 - Python

You can try;

data = pd.read_csv('file1.csv', sep='\t')

Solution 10 - Python

I came across the same issue. Using pd.read_table() on the same source file seemed to work. I could not trace the reason for this but it was a useful workaround for my case. Perhaps someone more knowledgeable can shed more light on why it worked.

Edit: I found that this error creeps up when you have some text in your file that does not have the same format as the actual data. This is usually header or footer information (greater than one line, so skip_header doesn't work) which will not be separated by the same number of commas as your actual data (when using read_csv). Using read_table uses a tab as the delimiter which could circumvent the users current error but introduce others.

I usually get around this by reading the extra data into a file then use the read_csv() method.

The exact solution might differ depending on your actual file, but this approach has worked for me in several cases

Solution 11 - Python

I've had this problem a few times myself. Almost every time, the reason is that the file I was attempting to open was not a properly saved CSV to begin with. And by "properly", I mean each row had the same number of separators or columns.

Typically it happened because I had opened the CSV in Excel then improperly saved it. Even though the file extension was still .csv, the pure CSV format had been altered.

Any file saved with pandas to_csv will be properly formatted and shouldn't have that issue. But if you open it with another program, it may change the structure.

Hope that helps.

Solution 12 - Python

I've had a similar problem while trying to read a tab-delimited table with spaces, commas and quotes:

1115794	4218	"k__Bacteria", "p__Firmicutes", "c__Bacilli", "o__Bacillales", "f__Bacillaceae", ""
1144102	3180	"k__Bacteria", "p__Firmicutes", "c__Bacilli", "o__Bacillales", "f__Bacillaceae", "g__Bacillus", ""
368444	2328	"k__Bacteria", "p__Bacteroidetes", "c__Bacteroidia", "o__Bacteroidales", "f__Bacteroidaceae", "g__Bacteroides", ""



import pandas as pd
# Same error for read_table
counts = pd.read_csv(path_counts, sep='\t', index_col=2, header=None, engine = 'c')

pandas.io.common.CParserError: Error tokenizing data. C error: out of memory

This says it has something to do with C parsing engine (which is the default one). Maybe changing to a python one will change anything

counts = pd.read_table(path_counts, sep='\t', index_col=2, header=None, engine='python')

Segmentation fault (core dumped)

Now that is a different error.
If we go ahead and try to remove spaces from the table, the error from python-engine changes once again:

1115794	4218	"k__Bacteria","p__Firmicutes","c__Bacilli","o__Bacillales","f__Bacillaceae",""
1144102	3180	"k__Bacteria","p__Firmicutes","c__Bacilli","o__Bacillales","f__Bacillaceae","g__Bacillus",""
368444	2328	"k__Bacteria","p__Bacteroidetes","c__Bacteroidia","o__Bacteroidales","f__Bacteroidaceae","g__Bacteroides",""


_csv.Error: '   ' expected after '"'

And it gets clear that pandas was having problems parsing our rows. To parse a table with python engine I needed to remove all spaces and quotes from the table beforehand. Meanwhile C-engine kept crashing even with commas in rows.

To avoid creating a new file with replacements I did this, as my tables are small:

from io import StringIO
with open(path_counts) as f:
    input = StringIO(f.read().replace('", ""', '').replace('"', '').replace(', ', ',').replace('\0',''))
    counts = pd.read_table(input, sep='\t', index_col=2, header=None, engine='python')

tl;dr
Change parsing engine, try to avoid any non-delimiting quotes/commas/spaces in your data.

Solution 13 - Python

The dataset that I used had a lot of quote marks (") used extraneous of the formatting. I was able to fix the error by including this parameter for read_csv():

quoting=3 # 3 correlates to csv.QUOTE_NONE for pandas

Solution 14 - Python

Use delimiter in parameter

pd.read_csv(filename, delimiter=",", encoding='utf-8')

It will read.

Solution 15 - Python

For those who are having similar issue with Python 3 on linux OS.

pandas.errors.ParserError: Error tokenizing data. C error: Calling
read(nbytes) on source failed. Try engine='python'.

Try:

df.read_csv('file.csv', encoding='utf8', engine='python')

Solution 16 - Python

Although not the case for this question, this error may also appear with compressed data. Explicitly setting the value for kwarg compression resolved my problem.

result = pandas.read_csv(data_source, compression='gzip')

Solution 17 - Python

In my case the separator was not the default "," but Tab.

pd.read_csv(file_name.csv, sep='\\t',lineterminator='\\r', engine='python', header='infer')

Note: "\t" did not work as suggested by some sources. "\\t" was required.

Solution 18 - Python

I came across multiple solutions for this issue. Lot's of folks have given the best explanation for the answers also. But for the beginners I think below two methods will be enough :

import pandas as pd

#Method 1

data = pd.read_csv('file1.csv', error_bad_lines=False)
#Note that this will cause the offending lines to be skipped.

#Method 2 using sep

data = pd.read_csv('file1.csv', sep='\t')

Solution 19 - Python

Sometimes the problem is not how to use python, but with the raw data.
I got this error message

Error tokenizing data. C error: Expected 18 fields in line 72, saw 19.

It turned out that in the column description there were sometimes commas. This means that the CSV file needs to be cleaned up or another separator used.

Solution 20 - Python

An alternative that I have found to be useful in dealing with similar parsing errors uses the CSV module to re-route data into a pandas df. For example:

import csv
import pandas as pd
path = 'C:/FileLocation/'
file = 'filename.csv'
f = open(path+file,'rt')
reader = csv.reader(f)

#once contents are available, I then put them in a list
csv_list = []
for l in reader:
    csv_list.append(l)
f.close()
#now pandas has no problem getting into a df
df = pd.DataFrame(csv_list)

I find the CSV module to be a bit more robust to poorly formatted comma separated files and so have had success with this route to address issues like these.

Solution 21 - Python

following sequence of commands works (I lose the first line of the data -no header=None present-, but at least it loads):

df = pd.read_csv(filename, usecols=range(0, 42)) df.columns = ['YR', 'MO', 'DAY', 'HR', 'MIN', 'SEC', 'HUND', 'ERROR', 'RECTYPE', 'LANE', 'SPEED', 'CLASS', 'LENGTH', 'GVW', 'ESAL', 'W1', 'S1', 'W2', 'S2', 'W3', 'S3', 'W4', 'S4', 'W5', 'S5', 'W6', 'S6', 'W7', 'S7', 'W8', 'S8', 'W9', 'S9', 'W10', 'S10', 'W11', 'S11', 'W12', 'S12', 'W13', 'S13', 'W14']

Following does NOT work:

df = pd.read_csv(filename, names=['YR', 'MO', 'DAY', 'HR', 'MIN', 'SEC', 'HUND', 'ERROR', 'RECTYPE', 'LANE', 'SPEED', 'CLASS', 'LENGTH', 'GVW', 'ESAL', 'W1', 'S1', 'W2', 'S2', 'W3', 'S3', 'W4', 'S4', 'W5', 'S5', 'W6', 'S6', 'W7', 'S7', 'W8', 'S8', 'W9', 'S9', 'W10', 'S10', 'W11', 'S11', 'W12', 'S12', 'W13', 'S13', 'W14'], usecols=range(0, 42))

CParserError: Error tokenizing data. C error: Expected 53 fields in line 1605634, saw 54 Following does NOT work:

df = pd.read_csv(filename, header=None)

CParserError: Error tokenizing data. C error: Expected 53 fields in line 1605634, saw 54

Hence, in your problem you have to pass usecols=range(0, 2)

Solution 22 - Python

As far as I can tell, and after taking a look at your file, the problem is that the csv file you're trying to load has multiple tables. There are empty lines, or lines that contain table titles. Try to have a look at this Stackoverflow answer. It shows how to achieve that programmatically.

Another dynamic approach to do that would be to use the csv module, read every single row at a time and make sanity checks/regular expressions, to infer if the row is (title/header/values/blank). You have one more advantage with this approach, that you can split/append/collect your data in python objects as desired.

The easiest of all would be to use pandas function pd.read_clipboard() after manually selecting and copying the table to the clipboard, in case you can open the csv in excel or something.

Irrelevant:

Additionally, irrelevant to your problem, but because no one made mention of this: I had this same issue when loading some datasets such as seeds_dataset.txt from UCI. In my case, the error was occurring because some separators had more whitespaces than a true tab \t. See line 3 in the following for instance

14.38	14.21	0.8951	5.386	3.312	2.462	4.956	1
14.69	14.49	0.8799	5.563	3.259	3.586	5.219	1
14.11	14.1	0.8911	5.42	3.302	2.7		5		1

Therefore, use \t+ in the separator pattern instead of \t.

data = pd.read_csv(path, sep='\t+`, header=None)

Solution 23 - Python

I believe the solutions,

,engine='python'
, error_bad_lines = False

will be good if it is dummy columns and you want to delete it. In my case, the second row really had more columns and I wanted those columns to be integrated and to have the number of columns = MAX(columns).

Please refer to the solution below that I could not read anywhere:

try:
    df_data = pd.read_csv(PATH, header = bl_header, sep = str_sep)
except pd.errors.ParserError as err:
    str_find = 'saw '
    int_position = int(str(err).find(str_find)) + len(str_find)
    str_nbCol = str(err)[int_position:]
    l_col = range(int(str_nbCol))
    df_data = pd.read_csv(PATH, header = bl_header, sep = str_sep, names = l_col)

Solution 24 - Python

use pandas.read_csv('CSVFILENAME',header=None,sep=', ')

when trying to read csv data from the link

http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data

I copied the data from the site into my csvfile. It had extra spaces so used sep =', ' and it worked :)

Solution 25 - Python

I had a similar case as this and setting

train = pd.read_csv('input.csv' , encoding='latin1',engine='python') 

worked

Solution 26 - Python

Simple resolution: Open the csv file in excel & save it with different name file of csv format. Again try importing it spyder, Your problem will be resolved!

Solution 27 - Python

Error tokenizing data. C error: Expected 2 fields in line 3, saw 12

The error gives a clue to solve the problem " Expected 2 fields in line 3, saw 12", saw 12 means length of the second row is 12 and first row is 2.

When you have data like the one shown below, if you skip rows then most of the data will be skipped

data = """1,2,3
1,2,3,4
1,2,3,4,5
1,2
1,2,3,4"""

If you dont want to skip any rows do the following

#First lets find the maximum column for all the rows
with open("file_name.csv", 'r') as temp_f:
    # get No of columns in each line
    col_count = [ len(l.split(",")) for l in temp_f.readlines() ]

### Generate column names  (names will be 0, 1, 2, ..., maximum columns - 1)
column_names = [i for i in range(max(col_count))] 

import pandas as pd
# inside range set the maximum value you can see in "Expected 4 fields in line 2, saw 8"
# here will be 8 
data = pd.read_csv("file_name.csv",header = None,names=column_names )

Use range instead of manually setting names as it will be cumbersome when you have many columns.

Additionally you can fill up the NaN values with 0, if you need to use even data length. Eg. for clustering (k-means)

new_data = data.fillna(0)

Solution 28 - Python

I had a dataset with prexisting row numbers, I used index_col:

pd.read_csv('train.csv', index_col=0)

Solution 29 - Python

This is what I did.

sep='::' solved my issue:

data=pd.read_csv('C:\\Users\\HP\\Downloads\\NPL ASSINGMENT 2 imdb_labelled\\imdb_labelled.txt',engine='python',header=None,sep='::')

Solution 30 - Python

I have the same problem when read_csv: ParserError: Error tokenizing data. I just saved the old csv file to a new csv file. The problem is solved!

Solution 31 - Python

The issue for me was that a new column was appended to my CSV intraday. The accepted answer solution would not work as every future row would be discarded if I used error_bad_lines=False.

The solution in this case was to use the usecols parameter in pd.read_csv(). This way I can specify only the columns that I need to read into the CSV and my Python code will remain resilient to future CSV changes so long as a header column exists (and the column names do not change).

> usecols : list-like or callable, optional > > Return a subset of the columns. If list-like, all elements must either > be positional (i.e. integer indices into the document columns) or > strings that correspond to column names provided either by the user in > names or inferred from the document header row(s). For example, a > valid list-like usecols parameter would be [0, 1, 2] or ['foo', 'bar', > 'baz']. Element order is ignored, so usecols=[0, 1] is the same as [1, > 0]. To instantiate a DataFrame from data with element order preserved > use pd.read_csv(data, usecols=['foo', 'bar'])[['foo', 'bar']] for > columns in ['foo', 'bar'] order or pd.read_csv(data, usecols=['foo', > 'bar'])[['bar', 'foo']] for ['bar', 'foo'] order.

Example
my_columns = ['foo', 'bar', 'bob']
df = pd.read_csv(file_path, usecols=my_columns)

Another benefit of this is that I can load way less data into memory if I am only using 3-4 columns of a CSV that has 18-20 columns.

Solution 32 - Python

I have encountered this error with a stray quotation mark. I use mapping software which will put quotation marks around text items when exporting comma-delimited files. Text which uses quote marks (e.g. ' = feet and " = inches) can be problematic when then induce delimiter collisions. Consider this example which notes that a 5-inch well log print is poor:

UWI_key,Latitude,Longitude,Remark US42051316890000,30.4386484,-96.4330734,"poor 5""

Using 5" as shorthand for 5 inch ends up throwing a wrench in the works. Excel will simply strip off the extra quote mark, but Pandas breaks down without the error_bad_lines=False argument mentioned above.

Solution 33 - Python

Most of the useful answers are already mentioned, however I suggest saving the pandas dataframes as parquet file. Parquet files don't have this problem and they are memory efficient at the same time.

Solution 34 - Python

In my case, it is because the format of the first and last two lines of the csv file is different from the middle content of the file.

So what I do is open the csv file as a string, parse the content of the string, then use read_csv to get a dataframe.

import io
import pandas as pd

file = open(f'{file_path}/{file_name}', 'r')
content = file.read()

# change new line character from '\r\n' to '\n'
lines = content.replace('\r', '').split('\n')

# Remove the first and last 2 lines of the file
# StringIO can be considered as a file stored in memory
df = pd.read_csv(StringIO("\n".join(lines[2:-2])), header=None)

Solution 35 - Python

Sometimes in a cell there is a comma ",". Due to that pandas can' t read it. Try delimiter with ";"

df = pd.read_csv(r'yourpath', delimiter=";")

Solution 36 - Python

The issue is with the delimiter. Find what kind of delimiter is used in your data and specify it like below:

data = pd.read_csv('some_data.csv', sep='\t')

Solution 37 - Python

I had a similar error and the issue was that I had some escaped quotes in my csv file and needed to set the escapechar parameter appropriately.

Solution 38 - Python

I had received a .csv from a coworker and when I tried to read the csv using pd.read_csv(), I received a similar error. It was apparently attempting to use the first row to generate the columns for the dataframe, but there were many rows which contained more columns than the first row would imply. I ended up fixing this problem by simply opening and re-saving the file as .csv and using pd.read_csv() again.

Solution 39 - Python

You can do this step to avoid the problem -

train = pd.read_csv('/home/Project/output.csv' , header=None)

just add - header=None

Hope this helps!!

Solution 40 - Python

Issue could be with file Issues, In my case, Issue was solved after renaming the file. yet to figure out the reason..

Solution 41 - Python

I have encountered this error with a stray quotation mark. I use mapping software which will put quotation marks around text items when exporting comma-delimited files. Text which uses quote marks (e.g. ' = feet and " = inches) can be problematic. Consider this example which notes that a 5-inch well log print is poor:

UWI_key,Latitude,Longitude,Remark US42051316890000,30.4386484,-96.4330734,"poor 5""

Using 5" as shorthand for 5 inch ends up throwing a wrench in the works. Excel will simply strip off the extra quote mark, but Pandas breaks down without the error_bad_lines=False argument mentioned above.

Once you know the nature of your error, it may be easiest to do a Find-Replace from a text editor (e.g., Sublime Text 3 or Notepad++) prior to import.

Solution 42 - Python

This looks ugly but you will have your dataframe

import re
path = 'GOOG Key Ratios.csv'

try:
    data = pd.read_csv(path)
except Exception as e:
    val = re.findall('tokenizing.{1,100}\s*Expected\s*(\d{1,2})\s*',str(e),re.I)
    data = pd.read_csv(path, skiprows=int(val[0])-1)

Solution 43 - Python

try: pandas.read_csv(path, sep = ',' ,header=None)

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionabuteauView Question on Stackoverflow
Solution 1 - PythonrichieView Answer on Stackoverflow
Solution 2 - Pythonwilliam_grisaitisView Answer on Stackoverflow
Solution 3 - PythonTomAugspurgerView Answer on Stackoverflow
Solution 4 - PythonPiyush S. WanareView Answer on Stackoverflow
Solution 5 - PythonSteven RoukView Answer on Stackoverflow
Solution 6 - PythoncomputeristView Answer on Stackoverflow
Solution 7 - PythonRobert GeigerView Answer on Stackoverflow
Solution 8 - Pythond_-View Answer on Stackoverflow
Solution 9 - PythonManodhya OpallageView Answer on Stackoverflow
Solution 10 - PythonLegend_AriView Answer on Stackoverflow
Solution 11 - PythonelPastorView Answer on Stackoverflow
Solution 12 - Pythonlotrus28View Answer on Stackoverflow
Solution 13 - Pythonuser3426943View Answer on Stackoverflow
Solution 14 - PythonBhavesh KumarView Answer on Stackoverflow
Solution 15 - PythonZstackView Answer on Stackoverflow
Solution 16 - PythonRegularlyScheduledProgrammingView Answer on Stackoverflow
Solution 17 - PythonMihai.MeheView Answer on Stackoverflow
Solution 18 - PythonSachinView Answer on Stackoverflow
Solution 19 - PythonKims SifersView Answer on Stackoverflow
Solution 20 - PythonbcozView Answer on Stackoverflow
Solution 21 - Pythonkepy97View Answer on Stackoverflow
Solution 22 - PythonKareem JeiroudiView Answer on Stackoverflow
Solution 23 - PythonLaurent TView Answer on Stackoverflow
Solution 24 - PythonAbhishek TripathiView Answer on Stackoverflow
Solution 25 - PythonAdewole AdesolaView Answer on Stackoverflow
Solution 26 - PythonNaseerView Answer on Stackoverflow
Solution 27 - Pythonamran hossenView Answer on Stackoverflow
Solution 28 - PythongogascaView Answer on Stackoverflow
Solution 29 - PythonSaurabh TripathiView Answer on Stackoverflow
Solution 30 - PythonSimin ZuoView Answer on Stackoverflow
Solution 31 - PythonScott SkilesView Answer on Stackoverflow
Solution 32 - Pythonfact_finderView Answer on Stackoverflow
Solution 33 - PythonBikash JoshiView Answer on Stackoverflow
Solution 34 - PythonBrianView Answer on Stackoverflow
Solution 35 - PythonpiseynirView Answer on Stackoverflow
Solution 36 - PythonAbu Bakar SiddikView Answer on Stackoverflow
Solution 37 - PythonjvvwView Answer on Stackoverflow
Solution 38 - PythonVictor BurnettView Answer on Stackoverflow
Solution 39 - Pythonrahul ranjanView Answer on Stackoverflow
Solution 40 - PythonSQA_LEARNView Answer on Stackoverflow
Solution 41 - Pythonfact_finderView Answer on Stackoverflow
Solution 42 - PythonShubham ChauhanView Answer on Stackoverflow
Solution 43 - PythonTHE2ndMOUSEView Answer on Stackoverflow