How can I retrieve the page title of a webpage using Python?

Python Problem Overview

How can I retrieve the page title of a webpage (title html tag) using Python?

Python Solutions

Solution 1 - Python

Here's a simplified version of @Vinko Vrsalovic's answer:

import urllib2
from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(urllib2.urlopen("https://www.google.com"))
print soup.title.string

NOTE:

soup.title finds the first title element anywhere in the html document
title.string assumes it has only one child node, and that child node is a string

For beautifulsoup 4.x, use different import:

from bs4 import BeautifulSoup

Solution 2 - Python

I'll always use lxml for such tasks. You could use beautifulsoup as well.

import lxml.html
t = lxml.html.parse(url)
print(t.find(".//title").text)

EDIT based on comment:

from urllib2 import urlopen
from lxml.html import parse

url = "https://www.google.com"
page = urlopen(url)
p = parse(page)
print(p.find(".//title").text)

Solution 3 - Python

No need to import other libraries. Request has this functionality in-built.

>> hearders = {'headers':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:51.0) Gecko/20100101 Firefox/51.0'}
>>> n = requests.get('http://www.imdb.com/title/tt0108778/', headers=hearders)
>>> al = n.text
>>> al[al.find('<title>') + 7 : al.find('</title>')]
u'Friends (TV Series 1994\u20132004) - IMDb'

Solution 4 - Python

The mechanize Browser object has a title() method. So the code from this post can be rewritten as:

from mechanize import Browser
br = Browser()
br.open("http://www.google.com/";)
print br.title()

Solution 5 - Python

This is probably overkill for such a simple task, but if you plan to do more than that, then it's saner to start from these tools (mechanize, BeautifulSoup) because they are much easier to use than the alternatives (urllib to get content and regexen or some other parser to parse html)

Links: BeautifulSoup mechanize

#!/usr/bin/env python
#coding:utf-8

from bs4 import BeautifulSoup
from mechanize import Browser

#This retrieves the webpage content
br = Browser()
res = br.open("https://www.google.com/")
data = res.get_data() 

#This parses the content
soup = BeautifulSoup(data)
title = soup.find('title')

#This outputs the content :)
print title.renderContents()

Solution 6 - Python

Using HTMLParser:

from urllib.request import urlopen
from html.parser import HTMLParser


class TitleParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.match = False
        self.title = ''

    def handle_starttag(self, tag, attributes):
        self.match = tag == 'title'

    def handle_data(self, data):
        if self.match:
            self.title = data
            self.match = False

url = "http://example.com/"
html_string = str(urlopen(url).read())

parser = TitleParser()
parser.feed(html_string)
print(parser.title)  # prints: Example Domain

Solution 7 - Python

Use soup.select_one to target title tag

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('url')
soup = bs(r.content, 'lxml')
print(soup.select_one('title').text)

Solution 8 - Python

Using regular expressions

import re
match = re.search('<title>(.*?)</title>', raw_html)
title = match.group(1) if match else 'No title'

Solution 9 - Python

soup.title.string actually returns a unicode string. To convert that into normal string, you need to do string=string.encode('ascii','ignore')

Solution 10 - Python

Here is a fault tolerant HTMLParser implementation.
You can throw pretty much anything at get_title() without it breaking, If anything unexpected happens get_title() will return None.
When Parser() downloads the page it encodes it to ASCII regardless of the charset used in the page ignoring any errors.
It would be trivial to change to_ascii() to convert the data into UTF-8 or any other encoding. Just add an encoding argument and rename the function to something like to_encoding().
By default HTMLParser() will break on broken html, it will even break on trivial things like mismatched tags. To prevent this behavior I replaced HTMLParser()'s error method with a function that will ignore the errors.

#-*-coding:utf8;-*-
#qpy:3
#qpy:console

''' 
Extract the title from a web page using
the standard lib.
'''

from html.parser import HTMLParser
from urllib.request import urlopen
import urllib

def error_callback(*_, **__):
    pass

def is_string(data):
    return isinstance(data, str)
    
def is_bytes(data):
    return isinstance(data, bytes)
    
def to_ascii(data):
    if is_string(data):
        data = data.encode('ascii', errors='ignore')
    elif is_bytes(data):
        data = data.decode('ascii', errors='ignore')
    else:
        data = str(data).encode('ascii', errors='ignore')
    return data
    
    
class Parser(HTMLParser):
    def __init__(self, url):
        self.title = None
        self.rec = False
        HTMLParser.__init__(self)
        try:
            self.feed(to_ascii(urlopen(url).read()))
        except urllib.error.HTTPError:
            return
        except urllib.error.URLError:
            return
        except ValueError:
            return

        self.rec = False
        self.error = error_callback
    
    def handle_starttag(self, tag, attrs):
        if tag == 'title':
            self.rec = True
                
    def handle_data(self, data):
        if self.rec:
            self.title = data
            
    def handle_endtag(self, tag):
        if tag == 'title':
            self.rec = False
            
            
def get_title(url):
    return Parser(url).title
    
print(get_title('http://www.google.com'))

Solution 11 - Python

In Python3, we can call method urlopen from urllib.request and BeautifulSoup from bs4 library to fetch the page title.

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://www.google.com")
soup = BeautifulSoup(html, 'lxml')
print(soup.title.string)

Here we are using the most efficient parser 'lxml'.

Solution 12 - Python

Using lxml...

Getting it from page meta tagged according to the Facebook opengraph protocol:

import lxml.html.parse
html_doc = lxml.html.parse(some_url)

t = html_doc.xpath('//meta[@property="og:title"]/@content')[0]

or using .xpath with lxml:

t = html_doc.xpath(".//title")[0].text

Content Type	Original Author	Original Content on Stackoverflow
Question	cschol	View Question on Stackoverflow
Solution 1 - Python	jfs	View Answer on Stackoverflow
Solution 2 - Python	Peter Hoffmann	View Answer on Stackoverflow
Solution 3 - Python	Rahul Chawla	View Answer on Stackoverflow
Solution 4 - Python	codeape	View Answer on Stackoverflow
Solution 5 - Python	Vinko Vrsalovic	View Answer on Stackoverflow
Solution 6 - Python	Finn	View Answer on Stackoverflow
Solution 7 - Python	QHarr	View Answer on Stackoverflow
Solution 8 - Python	Finn	View Answer on Stackoverflow
Solution 9 - Python	Sai Kiriti Badam	View Answer on Stackoverflow
Solution 10 - Python	Ricky Wilson	View Answer on Stackoverflow
Solution 11 - Python	S Habeeb Ullah	View Answer on Stackoverflow
Solution 12 - Python	markling	View Answer on Stackoverflow