How can I normalize a URL in python

PythonUrlNormalizationNormalize

Python Problem Overview


I'd like to know do I normalize a URL in python.

For example, If I have a url string like : "http://www.example.com/foo goo/bar.html"

I need a library in python that will transform the extra space (or any other non normalized character) to a proper URL.

Python Solutions


Solution 1 - Python

Have a look at this module: werkzeug.utils. (now in werkzeug.urls)

The function you are looking for is called "url_fix" and works like this:

>>> from werkzeug.urls import url_fix
>>> url_fix(u'http://de.wikipedia.org/wiki/Elf (Begriffsklärung)')
'http://de.wikipedia.org/wiki/Elf%20%28Begriffskl%C3%A4rung%29'

It's implemented in Werkzeug as follows:

import urllib
import urlparse

def url_fix(s, charset='utf-8'):
    """Sometimes you get an URL by a user that just isn't a real
    URL because it contains unsafe characters like ' ' and so on.  This
    function can fix some of the problems in a similar way browsers
    handle data entered by the user:

    >>> url_fix(u'http://de.wikipedia.org/wiki/Elf (Begriffsklärung)')
    'http://de.wikipedia.org/wiki/Elf%20%28Begriffskl%C3%A4rung%29'

    :param charset: The target charset for the URL if the url was
                    given as unicode string.
    """
    if isinstance(s, unicode):
        s = s.encode(charset, 'ignore')
    scheme, netloc, path, qs, anchor = urlparse.urlsplit(s)
    path = urllib.quote(path, '/%')
    qs = urllib.quote_plus(qs, ':&=')
    return urlparse.urlunsplit((scheme, netloc, path, qs, anchor))

Solution 2 - Python

Real fix in Python 2.7 for that problem

Right solution was:

 # percent encode url, fixing lame server errors for e.g, like space
 # within url paths.
 fullurl = quote(fullurl, safe="%/:=&?~#+!$,;'@()*[]")

For more information see Issue918368: "urllib doesn't correct server returned urls"

Solution 3 - Python

use urllib.quote or urllib.quote_plus

From the urllib documentation:

> quote(string[, safe]) > > Replace special characters in string > using the "%xx" escape. Letters, > digits, and the characters "_.-" are > never quoted. The optional safe > parameter specifies additional > characters that should not be quoted > -- its default value is '/'. > > Example: quote('/~connolly/') yields '/%7econnolly/'. > > quote_plus(string[, safe]) > > Like quote(), but also replaces spaces > by plus signs, as required for quoting > HTML form values. Plus signs in the > original string are escaped unless > they are included in safe. It also > does not have safe default to '/'.

EDIT: Using urllib.quote or urllib.quote_plus on the whole URL will mangle it, as @ΤΖΩΤΖΙΟΥ points out:

>>> quoted_url = urllib.quote('http://www.example.com/foo goo/bar.html')
>>> quoted_url
'http%3A//www.example.com/foo%20goo/bar.html'
>>> urllib2.urlopen(quoted_url)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\python25\lib\urllib2.py", line 124, in urlopen
    return _opener.open(url, data)
  File "c:\python25\lib\urllib2.py", line 373, in open
    protocol = req.get_type()
  File "c:\python25\lib\urllib2.py", line 244, in get_type
    raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type: http%3A//www.example.com/foo%20goo/bar.html

@ΤΖΩΤΖΙΟΥ provides a function that uses urlparse.urlparse and urlparse.urlunparse to parse the url and only encode the path. This may be more useful for you, although if you're building the URL from a known protocol and host but with a suspect path, you could probably do just as well to avoid urlparse and just quote the suspect part of the URL, concatenating with known safe parts.

Solution 4 - Python

Because this page is a top result for Google searches on the topic, I think it's worth mentioning some work that has been done on URL normalization with Python that goes beyond urlencoding space characters. For example, dealing with default ports, character case, lack of trailing slashes, etc.

When the Atom syndication format was being developed, there was some discussion on how to normalize URLs into canonical format; this is documented in the article PaceCanonicalIds on the Atom/Pie wiki. That article provides some good test cases.

I believe that one result of this discussion was Mark Nottingham's urlnorm.py library, which I've used with good results on a couple projects. That script doesn't work with the URL given in this question, however. So a better choice might be Sam Ruby's version of urlnorm.py, which handles that URL, and all of the aforementioned test cases from the Atom wiki.

Solution 5 - Python

Py3
from urllib.parse import urlparse, urlunparse, quote
def myquote(url):
    parts = urlparse(url)
    return urlunparse(parts._replace(path=quote(parts.path)))

>>> myquote('https://www.example.com/~user/with space/index.html?a=1&b=2')
'https://www.example.com/~user/with%20space/index.html?a=1&b=2'
Py2
import urlparse, urllib
def myquote(url):
    parts = urlparse.urlparse(url)
    return urlparse.urlunparse(parts[:2] + (urllib.quote(parts[2]),) + parts[3:])

>>> myquote('https://www.example.com/~user/with space/index.html?a=1&b=2')
'https://www.example.com/%7Euser/with%20space/index.html?a=1&b=2'

This quotes only the path component.

Solution 6 - Python

Just FYI, urlnorm has moved to github: http://gist.github.com/246089

Solution 7 - Python

Valid for Python 3.5:

import urllib.parse

urllib.parse.quote([your_url], "\./_-:")

example:

import urllib.parse

print(urllib.parse.quote("http://www.example.com/foo goo/bar.html", "\./_-:"))

the output will be http://www.example.com/foo%20goo/bar.html

Font: https://docs.python.org/3.5/library/urllib.parse.html?highlight=quote#urllib.parse.quote

Solution 8 - Python

I encounter such an problem: need to quote the space only.

fullurl = quote(fullurl, safe="%/:=&?~#+!$,;'@()*[]") do help, but it's too complicated.

So I used a simple way: url = url.replace(' ', '%20'), it's not perfect, but it's the simplest way and it works for this situation.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionTom FeinerView Question on Stackoverflow
Solution 1 - PythonArmin RonacherView Answer on Stackoverflow
Solution 2 - PythonOleg SakharovView Answer on Stackoverflow
Solution 3 - PythonBlair ConradView Answer on Stackoverflow
Solution 4 - Pythoncobra libreView Answer on Stackoverflow
Solution 5 - PythontzotView Answer on Stackoverflow
Solution 6 - PythonMark NottinghamView Answer on Stackoverflow
Solution 7 - PythonHélder LimaView Answer on Stackoverflow
Solution 8 - PythonWKPlusView Answer on Stackoverflow