How to handle response encoding from urllib.request.urlopen() , to avoid TypeError: can't use a string pattern on a bytes-like object

Python Problem Overview

I'm trying to open a webpage using urllib.request.urlopen() then search it with regular expressions, but that gives the following error:

> TypeError: can't use a string pattern on a bytes-like object

I understand why, urllib.request.urlopen() returns a bytestream, so re doesn't know the encoding to use. What am I supposed to do in this situation? Is there a way to specify the encoding method in a urlrequest maybe or will I need to re-encode the string myself? If so what am I looking to do, I assume I should read the encoding from the header info or the encoding type if specified in the html and then re-encode it to that?

Python Solutions

Solution 1 - Python

As for me, the solution is as following (python3):

resource = urllib.request.urlopen(an_url)
content =  resource.read().decode(resource.headers.get_content_charset())

Solution 2 - Python

You just need to decode the response, using the Content-Type header typically the last value. There is an example given in the tutorial too.

output = response.decode('utf-8')

Solution 3 - Python

I had the same issues for the last two days. I finally have a solution. I'm using the info() method of the object returned by urlopen():

req=urllib.request.urlopen(URL)
charset=req.info().get_content_charset()
content=req.read().decode(charset)

Solution 4 - Python

With requests:

import requests

response = requests.get(URL).text

Solution 5 - Python

Here is an example simple http request (that I tested and works)...

address = "http://stackoverflow.com"    
urllib.request.urlopen(address).read().decode('utf-8')

Make sure to read the documentation.

> https://docs.python.org/3/library/urllib.request.html

If you want to do something more detailed GET/POST REQUEST.

import urllib.request
# HTTP REQUEST of some address
def REQUEST(address):
    req = urllib.request.Request(address)
    req.add_header('User-Agent', 'NAME (Linux/MacOS; FROM, USA)')
    response = urllib.request.urlopen(req)
    html = response.read().decode('utf-8')  # make sure its all text not binary
    print("REQUEST (ONLINE): " + address)
    return html

Solution 6 - Python

urllib.urlopen(url).headers.getheader('Content-Type')

Will output something like this:

text/html; charset=utf-8

Solution 7 - Python

after you make a request req = urllib.request.urlopen(...) you have to read the request by calling html_string = req.read() that will give you the string response that you can then parse the way you want.

Content Type	Original Author	Original Content on Stackoverflow
Question	kryptobs2000	View Question on Stackoverflow
Solution 1 - Python	Ivan Klass	View Answer on Stackoverflow
Solution 2 - Python	Senthil Kumaran	View Answer on Stackoverflow
Solution 3 - Python	pytohs	View Answer on Stackoverflow
Solution 4 - Python	xged	View Answer on Stackoverflow
Solution 5 - Python	Asher	View Answer on Stackoverflow
Solution 6 - Python	wynemo	View Answer on Stackoverflow
Solution 7 - Python	Jesse Cohen	View Answer on Stackoverflow