Python code to remove HTML tags from a string

PythonHtmlXmlStringParsing

Python Problem Overview


I have a text like this:

text = """<div>
<h1>Title</h1>
<p>A long text........ </p>
<a href=""> a link </a>
</div>"""

using pure Python, with no external module I want to have this:

>>> print remove_tags(text)
Title A long text..... a link

I know I can do it using lxml.html.fromstring(text).text_content() but I need to achieve the same in pure Python using builtin or std library for 2.6+

How can I do that?

Python Solutions


Solution 1 - Python

Using a regex

Using a regex, you can clean everything inside <> :

import re
# as per recommendation from @freylis, compile once only
CLEANR = re.compile('<.*?>') 

def cleanhtml(raw_html):
  cleantext = re.sub(CLEANR, '', raw_html)
  return cleantext

Some HTML texts can also contain entities that are not enclosed in brackets, such as '&nsbm'. If that is the case, then you might want to write the regex as

CLEANR = re.compile('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});')

This link contains more details on this.

Using BeautifulSoup

You could also use BeautifulSoup additional package to find out all the raw text.

You will need to explicitly set a parser when calling BeautifulSoup I recommend "lxml" as mentioned in alternative answers (much more robust than the default one (html.parser) (i.e. available without additional install).

from bs4 import BeautifulSoup
cleantext = BeautifulSoup(raw_html, "lxml").text

But it doesn't prevent you from using external libraries, so I recommend the first solution.

EDIT: To use lxml you need to pip install lxml.

Solution 2 - Python

Python has several XML modules built in. The simplest one for the case that you already have a string with the full HTML is xml.etree, which works (somewhat) similarly to the lxml example you mention:

def remove_tags(text):
    return ''.join(xml.etree.ElementTree.fromstring(text).itertext())

Solution 3 - Python

Note that this isn't perfect, since if you had something like, say, <a title=">"> it would break. However, it's about the closest you'd get in non-library Python without a really complex function:

import re

TAG_RE = re.compile(r'<[^>]+>')

def remove_tags(text):
    return TAG_RE.sub('', text)

However, as lvc mentions xml.etree is available in the Python Standard Library, so you could probably just adapt it to serve like your existing lxml version:

def remove_tags(text):
    return ''.join(xml.etree.ElementTree.fromstring(text).itertext())

Solution 4 - Python

There's a simple way to this in any C-like language. The style is not Pythonic but works with pure Python:

def remove_html_markup(s):
    tag = False
    quote = False
    out = ""

    for c in s:
            if c == '<' and not quote:
                tag = True
            elif c == '>' and not quote:
                tag = False
            elif (c == '"' or c == "'") and tag:
                quote = not quote
            elif not tag:
                out = out + c

    return out

The idea based in a simple finite-state machine and is detailed explained here: http://youtu.be/2tu9LTDujbw

You can see it working here: http://youtu.be/HPkNPcYed9M?t=35s

PS - If you're interested in the class(about smart debugging with python) I give you a link: https://www.udacity.com/course/software-debugging--cs259. It's free!

Solution 5 - Python

global temp

temp =''

s = ' '

def remove_strings(text):

    global temp 

    if text == '':

        return temp

    start = text.find('<')

    end = text.find('>')

    if start == -1 and end == -1 :

        temp = temp + text

    return temp

newstring = text[end+1:]

fresh_start = newstring.find('<')

if newstring[:fresh_start] != '':
    
    temp += s+newstring[:fresh_start]

remove_strings(newstring[fresh_start:])

return temp

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionBruno Rocha - rochacbrunoView Question on Stackoverflow
Solution 1 - Pythonc24bView Answer on Stackoverflow
Solution 2 - PythonlvcView Answer on Stackoverflow
Solution 3 - PythonAmberView Answer on Stackoverflow
Solution 4 - PythonMedeirosView Answer on Stackoverflow
Solution 5 - Pythonuser1899895View Answer on Stackoverflow