How to validate a url in Python? (Malformed or not)

Python Problem Overview

I have url from the user and I have to reply with the fetched HTML.

How can I check for the URL to be malformed or not?

For example :

url = 'google' # Malformed
url = 'google.com' # Malformed
url = 'http://google.com' # Valid
url = 'http://google' # Malformed

Python Solutions

Solution 1 - Python

Use the validators package:

>>> import validators
>>> validators.url("http://google.com")
True
>>> validators.url("http://google")
ValidationFailure(func=url, args={'value': 'http://google', 'require_tld': True})
>>> if not validators.url("http://google"):
...     print "not valid"
... 
not valid
>>>

Install it from PyPI with pip (pip install validators).

Solution 2 - Python

Actually, I think this is the best way.

from django.core.validators import URLValidator
from django.core.exceptions import ValidationError

val = URLValidator(verify_exists=False)
try:
    val('http://www.google.com')
except ValidationError, e:
    print e

If you set verify_exists to True, it will actually verify that the URL exists, otherwise it will just check if it's formed correctly.

edit: ah yeah, this question is a duplicate of this: https://stackoverflow.com/questions/3170231/django-url-validation-i-just-want-to-see-if-the-url-exists

Solution 3 - Python

django url validation regex (source):

import re
regex = re.compile(
        r'^(?:http|ftp)s?://' # http:// or https://
        r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?)|' #domain...
        r'localhost|' #localhost...
        r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})' # ...or ip
        r'(?::\d+)?' # optional port
        r'(?:/?|[/?]\S+)$', re.IGNORECASE)

print(re.match(regex, "http://www.example.com") is not None) # True
print(re.match(regex, "example.com") is not None)            # False

Solution 4 - Python

A True or False version, based on @DMfll answer:

try:
    # python2
    from urlparse import urlparse
except:
    # python3
    from urllib.parse import urlparse

a = 'http://www.cwi.nl:80/%7Eguido/Python.html'
b = '/data/Python.html'
c = 532
d = u'dkakasdkjdjakdjadjfalskdjfalk'
e = 'https://stackoverflow.com'

def uri_validator(x):
    try:
        result = urlparse(x)
        return all([result.scheme, result.netloc])
    except:
        return False

print(uri_validator(a))
print(uri_validator(b))
print(uri_validator(c))
print(uri_validator(d))
print(uri_validator(e))

Gives:

True
False
False
False
True

Solution 5 - Python

Nowadays, I use the following, based on the Padam's answer:

$ python --version
Python 3.6.5

And this is how it looks:

from urllib.parse import urlparse

def is_url(url):
  try:
    result = urlparse(url)
    return all([result.scheme, result.netloc])
  except ValueError:
    return False

Just use is_url("http://www.asdf.com").

Hope it helps!

Solution 6 - Python

I landed on this page trying to figure out a sane way to validate strings as "valid" urls. I share here my solution using python3. No extra libraries required.

See https://docs.python.org/2/library/urlparse.html if you are using python2.

See https://docs.python.org/3.0/library/urllib.parse.html if you are using python3 as I am.

import urllib
from pprint import pprint

invalid_url = 'dkakasdkjdjakdjadjfalskdjfalk'
valid_url = 'https://stackoverflow.com'
tokens = [urllib.parse.urlparse(url) for url in (invalid_url, valid_url)]

for token in tokens:
    pprint(token)
    
min_attributes = ('scheme', 'netloc')  # add attrs to your liking
for token in tokens:
    if not all([getattr(token, attr) for attr in min_attributes]):
        error = "'{url}' string has no scheme or netloc.".format(url=token.geturl())
        print(error)
    else:
        print("'{url}' is probably a valid url.".format(url=token.geturl()))

> ParseResult(scheme='', netloc='', path='dkakasdkjdjakdjadjfalskdjfalk', params='', query='', fragment='')

> ParseResult(scheme='https', netloc='stackoverflow.com', path='', params='', query='', fragment='')

> 'dkakasdkjdjakdjadjfalskdjfalk' string has no scheme or netloc.

> 'https://stackoverflow.com'; is probably a valid url.

Here is a more concise function:

from urllib.parse import urlparse

min_attributes = ('scheme', 'netloc')


def is_valid(url, qualifying=min_attributes):
    tokens = urlparse(url)
    return all([getattr(tokens, qualifying_attr)
                for qualifying_attr in qualifying])

Solution 7 - Python

note - lepl is no longer supported, sorry (you're welcome to use it, and i think the code below works, but it's not going to get updates).

rfc 3696 http://www.faqs.org/rfcs/rfc3696.html defines how to do this (for http urls and email). i implemented its recommendations in python using lepl (a parser library). see http://acooke.org/lepl/rfc3696.html

to use:

> easy_install lepl
...
> python
...
>>> from lepl.apps.rfc3696 import HttpUrl
>>> validator = HttpUrl()
>>> validator('google')
False
>>> validator('http://google')
False
>>> validator('http://google.com')
True

Solution 8 - Python

EDIT

> As pointed out by @Kwame , the below code does validate the url even if the .com or .co etc are not present.

> also pointed out by @Blaise, URLs like https://www.google is a valid URL and you need to do a DNS check for checking if it resolves or not, separately.

This is simple and works:

So min_attr contains the basic set of strings that needs to be present to define the validity of a URL, i.e http:// part and google.com part.

urlparse.scheme stores http:// and

urlparse.netloc store the domain name google.com

from urlparse import urlparse
def url_check(url):
	
	min_attr = ('scheme' , 'netloc')
	try:
		result = urlparse(url)
		if all([result.scheme, result.netloc]):
			return True
		else:
			return False
	except:
		return False

all() returns true if all the variables inside it return true. So if result.scheme and result.netloc is present i.e. has some value then the URL is valid and hence returns True.

Solution 9 - Python

Validate URL with `urllib` and Django-like regex

The Django URL validation regex was actually pretty good but I needed to tweak it a little bit for my use case. Feel free to adapt it to yours!

Python 3.7

import re
import urllib

# Check https://regex101.com/r/A326u1/5 for reference
DOMAIN_FORMAT = re.compile(
    r"(?:^(\w{1,255}):(.{1,255})@|^)" # http basic authentication [optional]
    r"(?:(?:(?=\S{0,253}(?:$|:))" # check full domain length to be less than or equal to 253 (starting after http basic auth, stopping before port)
    r"((?:[a-z0-9](?:[a-z0-9-]{0,61}[a-z0-9])?\.)+" # check for at least one subdomain (maximum length per subdomain: 63 characters), dashes in between allowed
    r"(?:[a-z0-9]{1,63})))" # check for top level domain, no dashes allowed
    r"|localhost)" # accept also "localhost" only
    r"(:\d{1,5})?", # port [optional]
    re.IGNORECASE
)
SCHEME_FORMAT = re.compile(
    r"^(http|hxxp|ftp|fxp)s?$", # scheme: http(s) or ftp(s)
    re.IGNORECASE
)

def validate_url(url: str):
    url = url.strip()

    if not url:
        raise Exception("No URL specified")

    if len(url) > 2048:
        raise Exception("URL exceeds its maximum length of 2048 characters (given length={})".format(len(url)))

    result = urllib.parse.urlparse(url)
    scheme = result.scheme
    domain = result.netloc

    if not scheme:
        raise Exception("No URL scheme specified")

    if not re.fullmatch(SCHEME_FORMAT, scheme):
        raise Exception("URL scheme must either be http(s) or ftp(s) (given scheme={})".format(scheme))

    if not domain:
        raise Exception("No URL domain specified")

    if not re.fullmatch(DOMAIN_FORMAT, domain):
        raise Exception("URL domain malformed (domain={})".format(domain))
    
    return url

Explanation

The code only validates the scheme and netloc part of a given URL. (To do this properly, I split the URL with urllib.parse.urlparse() in the two according parts which are then matched with the corresponding regex terms.)

The netloc part stops before the first occurrence of a slash /, so port numbers are still part of the netloc, e.g.:

 https://www.google.com:80/search?q=python
 ^^^^^   ^^^^^^^^^^^^^^^^^
   |             |      
   |             +-- netloc (aka "domain" in my code)
   +-- scheme

IPv4 addresses are also validated

IPv6 Support

If you want the URL validator to also work with IPv6 addresses, do the following:

Add is_valid_ipv6(ip) from Markus Jarderot's answer, which has a really good IPv6 validator regex
Add and not is_valid_ipv6(domain) to the last if

Examples

Here are some examples of the regex for the netloc (aka domain) part in action:

IPv4 and alphanumeric: https://regex101.com/r/A326u1/5
IPv6: https://regex101.com/r/lKIIgq/1 (with the regex from Markus Jarderot's answer)

Solution 10 - Python

All of the above solutions recognize a string like "http://www.google.com/path,www.yahoo.com/path" as valid. This solution always works as it should

import re

# URL-link validation
ip_middle_octet = u"(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5]))"
ip_last_octet = u"(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))"

URL_PATTERN = re.compile(
                        u"^"
                        # protocol identifier
                        u"(?:(?:https?|ftp|rtsp|rtp|mmp)://)"
                        # user:pass authentication
                        u"(?:\S+(?::\S*)?@)?"
                        u"(?:"
                        u"(?P<private_ip>"
                        # IP address exclusion
                        # private & local networks
                        u"(?:localhost)|"
                        u"(?:(?:10|127)" + ip_middle_octet + u"{2}" + ip_last_octet + u")|"
                        u"(?:(?:169\.254|192\.168)" + ip_middle_octet + ip_last_octet + u")|"
                        u"(?:172\.(?:1[6-9]|2\d|3[0-1])" + ip_middle_octet + ip_last_octet + u"))"
                        u"|"
                        # IP address dotted notation octets
                        # excludes loopback network 0.0.0.0
                        # excludes reserved space >= 224.0.0.0
                        # excludes network & broadcast addresses
                        # (first & last IP address of each class)
                        u"(?P<public_ip>"
                        u"(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])"
                        u"" + ip_middle_octet + u"{2}"
                        u"" + ip_last_octet + u")"
                        u"|"
                        # host name
                        u"(?:(?:[a-z\u00a1-\uffff0-9_-]-?)*[a-z\u00a1-\uffff0-9_-]+)"
                        # domain name
                        u"(?:\.(?:[a-z\u00a1-\uffff0-9_-]-?)*[a-z\u00a1-\uffff0-9_-]+)*"
                        # TLD identifier
                        u"(?:\.(?:[a-z\u00a1-\uffff]{2,}))"
                        u")"
                        # port number
                        u"(?::\d{2,5})?"
                        # resource path
                        u"(?:/\S*)?"
                        # query string
                        u"(?:\?\S*)?"
                        u"$",
                        re.UNICODE | re.IGNORECASE
                       )
def url_validate(url):   
    """ URL string validation
    """                                                                                                                                                      
    return re.compile(URL_PATTERN).match(url)

Solution 11 - Python

Here's a regex solution since top voted regex doesn't work for weird cases like top-level domain. Some test cases down below.

regex = re.compile(
    r"(\w+://)?"                # protocol                      (optional)
    r"(\w+\.)?"                 # host                          (optional)
    r"((\w+)\.(\w+))"           # domain
    r"(\.\w+)*"                 # top-level domain              (optional, can have > 1)
    r"([\w\-\._\~/]*)*(?<!\.)"  # path, params, anchors, etc.   (optional)
)

cases = [    "http://www.google.com",    "https://www.google.com",    "http://google.com",    "https://google.com",    "www.google.com",    "google.com",    "http://www.google.com/~as_db3.2123/134-1a",    "https://www.google.com/~as_db3.2123/134-1a",    "http://google.com/~as_db3.2123/134-1a",    "https://google.com/~as_db3.2123/134-1a",    "www.google.com/~as_db3.2123/134-1a",    "google.com/~as_db3.2123/134-1a",    # .co.uk top level    "http://www.google.co.uk",    "https://www.google.co.uk",    "http://google.co.uk",    "https://google.co.uk",    "www.google.co.uk",    "google.co.uk",    "http://www.google.co.uk/~as_db3.2123/134-1a",    "https://www.google.co.uk/~as_db3.2123/134-1a",    "http://google.co.uk/~as_db3.2123/134-1a",    "https://google.co.uk/~as_db3.2123/134-1a",    "www.google.co.uk/~as_db3.2123/134-1a",    "google.co.uk/~as_db3.2123/134-1a",    "https://...",    "https://..",    "https://.",    "https://.google.com",    "https://..google.com",    "https://...google.com",    "https://.google..com",    "https://.google...com"    "https://...google..com",    "https://...google...com",    ".google.com",    ".google.co."    "https://google.co."]
for c in cases:
    print(c, regex.match(c).span()[1] - regex.match(c).span()[0] == len(c))

Solution 12 - Python

Not directly relevant, but often it's required to identify whether some token CAN be a url or not, not necessarily 100% correctly formed (ie, https part omitted and so on). I've read this post and did not find the solution, so I am posting my own here for the sake of completeness.

def get_domain_suffixes():
    import requests
    res=requests.get('https://publicsuffix.org/list/public_suffix_list.dat')
    lst=set()
    for line in res.text.split('\n'):
        if not line.startswith('//'):
            domains=line.split('.')
            cand=domains[-1]
            if cand:
                lst.add('.'+cand)
    return tuple(sorted(lst))

domain_suffixes=get_domain_suffixes()

def reminds_url(txt:str):
    """
    >>> reminds_url('yandex.ru.com/somepath')
    True
    
    """
    ltext=txt.lower().split('/')[0]
    return ltext.startswith(('http','www','ftp')) or ltext.endswith(domain_suffixes)

Solution 13 - Python

Function based on Dominic Tarro answer:

import re
def is_url(x):
    return bool(re.match(
        r"(https?|ftp)://" # protocol
        r"(\w+(\-\w+)*\.)?" # host (optional)
        r"((\w+(\-\w+)*)\.(\w+))" # domain
        r"(\.\w+)*" # top-level domain (optional, can have > 1)
        r"([\w\-\._\~/]*)*(?<!\.)" # path, params, anchors, etc. (optional)
    , x))

Content Type	Original Author	Original Content on Stackoverflow
Question	Yugal Jindle	View Question on Stackoverflow
Solution 1 - Python	Jabba	View Answer on Stackoverflow
Solution 2 - Python	Drekembe	View Answer on Stackoverflow
Solution 3 - Python	cetver	View Answer on Stackoverflow
Solution 4 - Python	alemol	View Answer on Stackoverflow
Solution 5 - Python	Jonathan Prieto-Cubides	View Answer on Stackoverflow
Solution 6 - Python	dmmfll	View Answer on Stackoverflow
Solution 7 - Python	andrew cooke	View Answer on Stackoverflow
Solution 8 - Python	Padam Sethia	View Answer on Stackoverflow
Solution 9 - Python	winklerrr	View Answer on Stackoverflow
Solution 10 - Python	Сергей Дорофий	View Answer on Stackoverflow
Solution 11 - Python	Dominic Tarro	View Answer on Stackoverflow
Solution 12 - Python	Anatoly Alekseev	View Answer on Stackoverflow
Solution 13 - Python	pmiguelpinto	View Answer on Stackoverflow

How to validate a url in Python? (Malformed or not)

Python Problem Overview

Python Solutions

Solution 1 - Python

Solution 2 - Python

Solution 3 - Python

django url validation regex (source):

Solution 4 - Python

Solution 5 - Python

Solution 6 - Python

Solution 7 - Python

Solution 8 - Python

Solution 9 - Python

Validate URL with `urllib` and Django-like regex

Python 3.7

Explanation

IPv6 Support

Examples

Solution 10 - Python

Solution 11 - Python

Solution 12 - Python

Solution 13 - Python

Python set to list

Open file in a relative location in Python

Attributions

Python Problem Overview

Python Solutions

Solution 1 - Python

Solution 2 - Python

Solution 3 - Python

django url validation regex (source):

Solution 4 - Python

Solution 5 - Python

Solution 6 - Python

Solution 7 - Python

Solution 8 - Python

Solution 9 - Python

Validate URL with urllib and Django-like regex

Python 3.7

Explanation

IPv6 Support

Examples

Solution 10 - Python

Solution 11 - Python

Solution 12 - Python

Solution 13 - Python

Python set to list

Open file in a relative location in Python

Attributions

Validate URL with `urllib` and Django-like regex