Extract domain from URL in python

PythonUrl

Python Problem Overview


I have an url like:
http://abc.hostname.com/somethings/anything/

I want to get:
hostname.com

What module can I use to accomplish this?
I want to use the same module and method in python2.

Python Solutions


Solution 1 - Python

For parsing the domain of a URL in Python 3, you can use:

from urllib.parse import urlparse

domain = urlparse('http://www.example.test/foo/bar').netloc
print(domain) # --> www.example.test

However, for reliably parsing the top-level domain (example.test in this example), you need to install a specialized library (e.g., tldextract).

Solution 2 - Python

Instead of regex or hand-written solutions, you can use python's urlparse

from urllib.parse import urlparse

print(urlparse('http://abc.hostname.com/somethings/anything/'))
>> ParseResult(scheme='http', netloc='abc.hostname.com', path='/somethings/anything/', params='', query='', fragment='')

print(urlparse('http://abc.hostname.com/somethings/anything/').netloc)
>> abc.hostname.com

To get without the subdomain

t = urlparse('http://abc.hostname.com/somethings/anything/').netloc
print ('.'.join(t.split('.')[-2:]))
>> hostname.com

Solution 3 - Python

You can use tldextract.

Example code:

from tldextract import extract
tsd, td, tsu = extract("http://abc.hostname.com/somethings/anything/") # prints abc, hostname, com
url = td + '.' + tsu # will prints as hostname.com    
print(url)

Solution 4 - Python

Assuming you have it in an accessible string, and assuming we want to be generic for having multiple levels on the top domain, you could:

token=my_string.split('http://')[1].split('/')[0]
top_level=token.split('.')[-2]+'.'+token.split('.')[-1]

We split first by the http:// to remove that from the string. Then we split by the / to remove all directory or sub-directory parts of the string, and then the [-2] means we take the second last token after a ., and append it with the last token, to give us the top level domain.

There are probably more graceful and robust ways to do this, for example if your website is http://.com it will break, but its a start :)

Solution 5 - Python

best way i found is:

from six.moves.urllib.parse import urlparse

t = urlparse('http://asas.abc.hostname.com/somethings/anything/').netloc

print('.'.join(t.split('.')[-2:]))

Solution 6 - Python

Try:

from urlparse import urlparse

parsed = urlparse('http://abc.hostname.com/somethings/anything/')
domain = parsed.netloc.split(".")[-2:]
host = ".".join(domain)
print host  # will prints hostname.com

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionAmitView Question on Stackoverflow
Solution 1 - PythonPhilipp ClaßenView Answer on Stackoverflow
Solution 2 - PythonphilshemView Answer on Stackoverflow
Solution 3 - PythonDeivanai SubramanianView Answer on Stackoverflow
Solution 4 - PythonHenryView Answer on Stackoverflow
Solution 5 - PythonMaor KavodView Answer on Stackoverflow
Solution 6 - PythonSathish Kumar VGView Answer on Stackoverflow