Regex to extract URLs from href attribute in HTML with Python

PythonRegexUrl

Python Problem Overview


> Possible Duplicate:
> What is the best regular expression to check if a string is a valid URL?

Considering a string as follows:

string = "<p>Hello World</p><a href="http://example.com">More Examples</a><a href="http://example2.com">Even More Examples</a>"

How could I, with Python, extract the urls, inside the anchor tag's href? Something like:

>>> url = getURLs(string)
>>> url
['http://example.com', 'http://example2.com']

Thanks!

Python Solutions


Solution 1 - Python

import re

url = '<p>Hello World</p><a href="http://example.com">More Examples</a><a href="http://example2.com">Even More Examples</a>'
                
urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', url)
            
>>> print urls
['http://example.com', 'http://example2.com']

Solution 2 - Python

The best answer is...

Don't use a regex

The expression in the accepted answer misses many cases. Among other things, URLs can have unicode characters in them. The regex you want is here, and after looking at it, you may conclude that you don't really want it after all. The most correct version is ten-thousand characters long.

Admittedly, if you were starting with plain, unstructured text with a bunch of URLs in it, then you might need that ten-thousand-character-long regex. But if your input is structured, use the structure. Your stated aim is to "extract the url, inside the anchor tag's href." Why use a ten-thousand-character-long regex when you can do something much simpler?

Parse the HTML instead

For many tasks, using Beautiful Soup will be far faster and easier to use:

>>> from bs4 import BeautifulSoup as Soup
>>> html = Soup(s, 'html.parser')           # Soup(s, 'lxml') if lxml is installed
>>> [a['href'] for a in html.find_all('a')]
['http://example.com', 'http://example2.com']

If you prefer not to use external tools, you can also directly use Python's own built-in HTML parsing library. Here's a really simple subclass of HTMLParser that does exactly what you want:

from html.parser import HTMLParser

class MyParser(HTMLParser):
    def __init__(self, output_list=None):
        HTMLParser.__init__(self)
        if output_list is None:
            self.output_list = []
        else:
            self.output_list = output_list
    def handle_starttag(self, tag, attrs):
        if tag == 'a':
            self.output_list.append(dict(attrs).get('href'))

Test:

>>> p = MyParser()
>>> p.feed(s)
>>> p.output_list
['http://example.com', 'http://example2.com']

You could even create a new method that accepts a string, calls feed, and returns output_list. This is a vastly more powerful and extensible way than regular expressions to extract information from html.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
Questionuser825286View Question on Stackoverflow
Solution 1 - PythonJohnJohnGaView Answer on Stackoverflow
Solution 2 - PythonsenderleView Answer on Stackoverflow