"for line in..." results in UnicodeDecodeError: 'utf-8' codec can't decode byte
PythonPython 3.xCharacter EncodingPython Problem Overview
Here is my code,
for line in open('u.item'):
# Read each line
Whenever I run this code it gives the following error:
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 2892: invalid continuation byte
I tried to solve this and add an extra parameter in open(). The code looks like:
for line in open('u.item', encoding='utf-8'):
# Read each line
But again it gives the same error. What should I do then?
Python Solutions
Solution 1 - Python
As suggested by Mark Ransom, I found the right encoding for that problem. The encoding was "ISO-8859-1"
, so replacing open("u.item", encoding="utf-8")
with open('u.item', encoding = "ISO-8859-1")
will solve the problem.
Solution 2 - Python
The following also worked for me. ISO 8859-1 is going to save a lot, mainly if using Speech Recognition APIs.
Example:
file = open('../Resources/' + filename, 'r', encoding="ISO-8859-1")
Solution 3 - Python
Your file doesn't actually contain UTF-8 encoded data; it contains some other encoding. Figure out what that encoding is and use it in the open
call.
In Windows-1252 encoding, for example, the 0xe9
would be the character é
.
Solution 4 - Python
Try this to read using Pandas:
pd.read_csv('u.item', sep='|', names=m_cols, encoding='latin-1')
Solution 5 - Python
This works:
open('filename', encoding='latin-1')
Or:
open('filename', encoding="ISO-8859-1")
Solution 6 - Python
If you are using Python 2, the following will be the solution:
import io
for line in io.open("u.item", encoding="ISO-8859-1"):
# Do something
Because the encoding
parameter doesn't work with open()
, you will be getting the following error:
> TypeError: 'encoding' is an invalid keyword argument for this function
Solution 7 - Python
You could resolve the problem with:
for line in open(your_file_path, 'rb'):
'rb' is reading the file in binary mode. Read more here.
Solution 8 - Python
You can try this way:
open('u.item', encoding='utf8', errors='ignore')
Solution 9 - Python
Based on another question on Stackoverflow and previous answers in this post, I would like to add a help to find the right encoding.
If your script runs on a Linux OS, you can get the encoding with the file
command:
file --mime-encoding <filename>
Here is a python script to do that for you:
import sys
import subprocess
if len(sys.argv) < 2:
print("Usage: {} <filename>".format(sys.argv[0]))
sys.exit(1)
def find_encoding(fname):
"""Find the encoding of a file using file command
"""
# find fullname of file command
which_run = subprocess.run(['which', 'file'], stdout=subprocess.PIPE)
if which_run.returncode != 0:
print("Unable to find 'file' command ({})".format(which_run.returncode))
return None
file_cmd = which_run.stdout.decode().replace('\n', '')
# run file command to get MIME encoding
file_run = subprocess.run([file_cmd, '--mime-encoding', fname],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
if file_run.returncode != 0:
print(file_run.stderr.decode(), file=sys.stderr)
# return encoding name only
return file_run.stdout.decode().split()[1]
# test
print("Encoding of {}: {}".format(sys.argv[1], find_encoding(sys.argv[1])))
Solution 10 - Python
This is an example for converting a CSV file in Python 3:
try:
inputReader = csv.reader(open(argv[1], encoding='ISO-8859-1'), delimiter=',',quotechar='"')
except IOError:
pass
Solution 11 - Python
Sometimes when using open(filepath)
in which filepath
actually is not a file would get the same error, so firstly make sure the file you're trying to open exists:
import os
assert os.path.isfile(filepath)
Solution 12 - Python
Open your file with Notepad++, select "Encoding" or "Encodage" menu to identify or to convert from ANSI to UTF-8 or the ISO 8859-1 code page.
Solution 13 - Python
I was using a dataset downloaded from Kaggle while reading this dataset it threw this error:
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position > 183: invalid continuation byte
So this is how I fixed it.
import pandas as pd
pd.read_csv('top50.csv', encoding='ISO-8859-1')
Solution 14 - Python
So that the web-page is searched faster for the google-request on a similar question (about error with UTF-8), I leave my solvation here for others.
I had problem with .csv file opening with that description:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 150: invalid continuation byte
I opened the file with NotePad & counted 150th position: that was a Cyrillic symbol. I resaved that file with 'Save as..' command with Encoding 'UTF-8' & my program started to work.
Solution 15 - Python
The encoding replaced with encoding='ISO-8859-1'
for line in open('u.item', encoding='ISO-8859-1'):
print(line)
Solution 16 - Python
Use this, if you are directly loading data from github or kaggle DF=pd.read_csv(file,encoding='ISO-8859-1')