Remove HTML tags not on an allowed list from a Python string

Python Problem Overview

I have a string containing text and HTML. I want to remove or otherwise disable some HTML tags, such as <script>, while allowing others, so that I can render it on a web page safely. I have a list of allowed tags, how can I process the string to remove any other tags?

Python Solutions

Solution 1 - Python

Use lxml.html.clean! It's VERY easy!

from lxml.html.clean import clean_html
print clean_html(html)

Suppose the following html:

html = '''\
<html>
 <head>
   <script type="text/javascript" src="evil-site"></script>
   <link rel="alternate" type="text/rss" src="evil-rss">
   <style>
     body {background-image: url(javascript:do_evil)};
     div {color: expression(evil)};
   </style>
 </head>
 <body onload="evil_function()">
    <!-- I am interpreted for EVIL! -->
   <a href="javascript:evil_function()">a link</a>
   <a href="#" onclick="evil_function()">another link</a>
   <p onclick="evil_function()">a paragraph</p>
   <div style="display: none">secret EVIL!</div>
   <object> of EVIL! </object>
   <iframe src="evil-site"></iframe>
   <form action="evil-site">
     Password: <input type="password" name="password">
   </form>
   <blink>annoying EVIL!</blink>
   <a href="evil-site">spam spam SPAM!</a>
   <image src="evil!">
 </body>
</html>'''

The results...

<html>
  <body>
    <div>
      <style>/* deleted */</style>
      <a href="">a link</a>
      <a href="#">another link</a>
      <p>a paragraph</p>
      <div>secret EVIL!</div>
      of EVIL!
      Password:
      annoying EVIL!
      <a href="evil-site">spam spam SPAM!</a>
      <img src="evil!">
    </div>
  </body>
</html>

You can customize the elements you want to clean and whatnot.

Solution 2 - Python

Here's a simple solution using BeautifulSoup:

from bs4 import BeautifulSoup

VALID_TAGS = ['strong', 'em', 'p', 'ul', 'li', 'br']

def sanitize_html(value):

    soup = BeautifulSoup(value)

    for tag in soup.findAll(True):
        if tag.name not in VALID_TAGS:
            tag.hidden = True

    return soup.renderContents()

If you want to remove the contents of the invalid tags as well, substitute tag.extract() for tag.hidden.

You might also look into using lxml and Tidy.

Solution 3 - Python

The above solutions via Beautiful Soup will not work. You might be able to hack something with Beautiful Soup above and beyond them, because Beautiful Soup provides access to the parse tree. In a while, I think I'll try to solve the problem properly, but it's a week-long project or so, and I don't have a free week soon.

Just to be specific, not only will Beautiful Soup throw exceptions for some parsing errors which the above code doesn't catch; but also, there are plenty of very real XSS vulnerabilities that aren't caught, like:

<<script>script> alert("Haha, I hacked your page."); </</script>script>

Probably the best thing that you can do is instead to strip out the < element as <, to prohibit all HTML, and then use a restricted subset like Markdown to render formatting properly. In particular, you can also go back and re-introduce common bits of HTML with a regex. Here's what the process looks like, roughly:

_lt_     = re.compile('<')
_tc_ = '~(lt)~'   # or whatever, so long as markdown doesn't mangle it.     
_ok_ = re.compile(_tc_ + '(/?(?:u|b|i|em|strong|sup|sub|p|br|q|blockquote|code))>', re.I)
_sqrt_ = re.compile(_tc_ + 'sqrt>', re.I)     #just to give an example of extending
_endsqrt_ = re.compile(_tc_ + '/sqrt>', re.I) #html syntax with your own elements.
_tcre_ = re.compile(_tc_)

def sanitize(text):
    text = _lt_.sub(_tc_, text)
    text = markdown(text)
    text = _ok_.sub(r'<\1>', text)
    text = _sqrt_.sub(r'&radic;<span style="text-decoration:overline;">', text)
    text = _endsqrt_.sub(r'</span>', text)
    return _tcre_.sub('&lt;', text)

I haven't tested that code yet, so there may be bugs. But you see the general idea: you have to blacklist all HTML in general before you whitelist the ok stuff.

Solution 4 - Python

Here is what i use in my own project. The acceptable_elements/attributes come from feedparser and BeautifulSoup does the work.

from BeautifulSoup import BeautifulSoup

acceptable_elements = ['a', 'abbr', 'acronym', 'address', 'area', 'b', 'big',
      'blockquote', 'br', 'button', 'caption', 'center', 'cite', 'code', 'col',
      'colgroup', 'dd', 'del', 'dfn', 'dir', 'div', 'dl', 'dt', 'em',
      'font', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'hr', 'i', 'img', 
      'ins', 'kbd', 'label', 'legend', 'li', 'map', 'menu', 'ol', 
      'p', 'pre', 'q', 's', 'samp', 'small', 'span', 'strike',
      'strong', 'sub', 'sup', 'table', 'tbody', 'td', 'tfoot', 'th',
      'thead', 'tr', 'tt', 'u', 'ul', 'var']

acceptable_attributes = ['abbr', 'accept', 'accept-charset', 'accesskey',
  'action', 'align', 'alt', 'axis', 'border', 'cellpadding', 'cellspacing',
  'char', 'charoff', 'charset', 'checked', 'cite', 'clear', 'cols',
  'colspan', 'color', 'compact', 'coords', 'datetime', 'dir', 
  'enctype', 'for', 'headers', 'height', 'href', 'hreflang', 'hspace',
  'id', 'ismap', 'label', 'lang', 'longdesc', 'maxlength', 'method',
  'multiple', 'name', 'nohref', 'noshade', 'nowrap', 'prompt', 
  'rel', 'rev', 'rows', 'rowspan', 'rules', 'scope', 'shape', 'size',
  'span', 'src', 'start', 'summary', 'tabindex', 'target', 'title', 'type',
  'usemap', 'valign', 'value', 'vspace', 'width']

def clean_html( fragment ):
    while True:
        soup = BeautifulSoup( fragment )
        removed = False        
        for tag in soup.findAll(True): # find all tags
            if tag.name not in acceptable_elements:
                tag.extract() # remove the bad ones
                removed = True
            else: # it might have bad attributes
                # a better way to get all attributes?
                for attr in tag._getAttrMap().keys():
                    if attr not in acceptable_attributes:
                        del tag[attr]
                            
        # turn it back to html
        fragment = unicode(soup)
        
        if removed:
            # we removed tags and tricky can could exploit that!
            # we need to reparse the html until it stops changing
            continue # next round
        
        return fragment

Some small tests to make sure this behaves correctly:

tests = [   #text should work
            ('<p>this is text</p>but this too', '<p>this is text</p>but this too'),
            # make sure we cant exploit removal of tags
            ('<<script></script>script> alert("Haha, I hacked your page."); <<script></script>/script>', ''),
            # try the same trick with attributes, gives an Exception
            ('<div on<script></script>load="alert("Haha, I hacked your page.");">1</div>',  Exception),
             # no tags should be skipped
            ('<script>bad</script><script>bad</script><script>bad</script>', ''),
            # leave valid tags but remove bad attributes
            ('<a href="good" onload="bad" onclick="bad" alt="good">1</div>', '<a href="good" alt="good">1</a>'),
]

for text, out in tests:
    try:
        res = clean_html(text)
        assert res == out, "%s => %s != %s" % (text, res, out)
    except out, e:
        assert isinstance(e, out), "Wrong exception %r" % e

Solution 5 - Python

Bleach does better with more useful options. It's built on html5lib and ready for production. Check the documentation for the bleach.clean function. Its default configuration escapes unsafe tags like <script> while allowing useful tags like <a>.

import bleach
bleach.clean("<script>evil</script> <a href='http://example.com'>example</a>")
# '&lt;script&gt;evil&lt;/script&gt; <a href="http://example.com">example</a>'

Solution 6 - Python

I modified Bryan's solution with BeautifulSoup to address the problem raised by Chris Drost. A little crude, but does the job:

from BeautifulSoup import BeautifulSoup, Comment

VALID_TAGS = {'strong': [],
              'em': [],
              'p': [],
              'ol': [],
              'ul': [],
              'li': [],
              'br': [],
              'a': ['href', 'title']
              }

def sanitize_html(value, valid_tags=VALID_TAGS):
    soup = BeautifulSoup(value)
    comments = soup.findAll(text=lambda text:isinstance(text, Comment))
    [comment.extract() for comment in comments]
    # Some markup can be crafted to slip through BeautifulSoup's parser, so
    # we run this repeatedly until it generates the same output twice.
    newoutput = soup.renderContents()
    while 1:
        oldoutput = newoutput
        soup = BeautifulSoup(newoutput)
        for tag in soup.findAll(True):
            if tag.name not in valid_tags:
                tag.hidden = True
            else:
                tag.attrs = [(attr, value) for attr, value in tag.attrs if attr in valid_tags[tag.name]]
        newoutput = soup.renderContents()
        if oldoutput == newoutput:
            break
    return newoutput

Edit: Updated to support valid attributes.

Solution 7 - Python

I use FilterHTML. It's simple and lets you define a well-controlled white-list, scrubs URLs and even matches attribute values against regex or have custom filtering functions per attribute. If used carefully it could be a safe solution. Here's a simplified example from the readme:

> import FilterHTML >
> # only allow: > # tags with valid href URLs > # tags with valid src URLs and measurements > whitelist = { > 'a': { > 'href': 'url', > 'target': [ > '_blank', > '_self' > ], > 'class': [ > 'button' > ] > }, > 'img': { > 'src': 'url', > 'width': 'measurement', > 'height': 'measurement' > }, > } >
> filtered_html = FilterHTML.filter_html(unfiltered_html, whitelist)

Solution 8 - Python

You could use html5lib, which uses a whitelist to sanitize.

An example:

import html5lib
from html5lib import sanitizer, treebuilders, treewalkers, serializer
  
def clean_html(buf):
    """Cleans HTML of dangerous tags and content."""
    buf = buf.strip()
    if not buf:
        return buf

    p = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("dom"),
            tokenizer=sanitizer.HTMLSanitizer)
    dom_tree = p.parseFragment(buf)

    walker = treewalkers.getTreeWalker("dom")
    stream = walker(dom_tree)

    s = serializer.htmlserializer.HTMLSerializer(
            omit_optional_tags=False,
            quote_attr_values=True)
    return s.render(stream)

Solution 9 - Python

I prefer the lxml.html.clean solution, like nosklo points out. Here's to also remove some empty tags:

from lxml import etree
from lxml.html import clean, fromstring, tostring

remove_attrs = ['class']
remove_tags = ['table', 'tr', 'td']
nonempty_tags = ['a', 'p', 'span', 'div']

cleaner = clean.Cleaner(remove_tags=remove_tags)

def squeaky_clean(html):
	clean_html = cleaner.clean_html(html)
	# now remove the useless empty tags
	root = fromstring(clean_html)
	context = etree.iterwalk(root) # just the end tag event
	for action, elem in context:
		clean_text = elem.text and elem.text.strip(' \t\r\n')
		if elem.tag in nonempty_tags and \
		not (len(elem) or clean_text): # no children nor text
			elem.getparent().remove(elem)
			continue
		elem.text = clean_text # if you want
		# and if you also wanna remove some attrs:
		for badattr in remove_attrs:
			if elem.attrib.has_key(badattr):
				del elem.attrib[badattr]
	return tostring(root)

Content Type	Original Author	Original Content on Stackoverflow
Question	Everett Toews	View Question on Stackoverflow
Solution 1 - Python	nosklo	View Answer on Stackoverflow
Solution 2 - Python	bryan	View Answer on Stackoverflow
Solution 3 - Python	Chris Drost	View Answer on Stackoverflow
Solution 4 - Python	Jochen Ritzel	View Answer on Stackoverflow
Solution 5 - Python	chuangbo	View Answer on Stackoverflow
Solution 6 - Python	Kiran Jonnalagadda	View Answer on Stackoverflow
Solution 7 - Python	codedvillain	View Answer on Stackoverflow
Solution 8 - Python	Brian Neal	View Answer on Stackoverflow
Solution 9 - Python	ducu	View Answer on Stackoverflow