How to handle response encoding from urllib.request.urlopen() , to avoid TypeError: can't use a string pattern on a bytes-like object

PythonRegexEncodingUrllib

Python Problem Overview


I'm trying to open a webpage using urllib.request.urlopen() then search it with regular expressions, but that gives the following error:

> TypeError: can't use a string pattern on a bytes-like object

I understand why, urllib.request.urlopen() returns a bytestream, so re doesn't know the encoding to use. What am I supposed to do in this situation? Is there a way to specify the encoding method in a urlrequest maybe or will I need to re-encode the string myself? If so what am I looking to do, I assume I should read the encoding from the header info or the encoding type if specified in the html and then re-encode it to that?

Python Solutions


Solution 1 - Python

As for me, the solution is as following (python3):

resource = urllib.request.urlopen(an_url)
content =  resource.read().decode(resource.headers.get_content_charset())

Solution 2 - Python

You just need to decode the response, using the Content-Type header typically the last value. There is an example given in the tutorial too.

output = response.decode('utf-8')

Solution 3 - Python

I had the same issues for the last two days. I finally have a solution. I'm using the info() method of the object returned by urlopen():

req=urllib.request.urlopen(URL)
charset=req.info().get_content_charset()
content=req.read().decode(charset)

Solution 4 - Python

With requests:

import requests

response = requests.get(URL).text

Solution 5 - Python

Here is an example simple http request (that I tested and works)...

address = "http://stackoverflow.com"    
urllib.request.urlopen(address).read().decode('utf-8')

Make sure to read the documentation.

> https://docs.python.org/3/library/urllib.request.html

If you want to do something more detailed GET/POST REQUEST.

import urllib.request
# HTTP REQUEST of some address
def REQUEST(address):
    req = urllib.request.Request(address)
    req.add_header('User-Agent', 'NAME (Linux/MacOS; FROM, USA)')
    response = urllib.request.urlopen(req)
    html = response.read().decode('utf-8')  # make sure its all text not binary
    print("REQUEST (ONLINE): " + address)
    return html

Solution 6 - Python

urllib.urlopen(url).headers.getheader('Content-Type')

Will output something like this:

text/html; charset=utf-8

Solution 7 - Python

after you make a request req = urllib.request.urlopen(...) you have to read the request by calling html_string = req.read() that will give you the string response that you can then parse the way you want.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
Questionkryptobs2000View Question on Stackoverflow
Solution 1 - PythonIvan KlassView Answer on Stackoverflow
Solution 2 - PythonSenthil KumaranView Answer on Stackoverflow
Solution 3 - PythonpytohsView Answer on Stackoverflow
Solution 4 - PythonxgedView Answer on Stackoverflow
Solution 5 - PythonAsherView Answer on Stackoverflow
Solution 6 - PythonwynemoView Answer on Stackoverflow
Solution 7 - PythonJesse CohenView Answer on Stackoverflow