Remove a tag using BeautifulSoup but keep its contents

PythonBeautifulsoup

Python Problem Overview


Currently I have code that does something like this:

soup = BeautifulSoup(value)

for tag in soup.findAll(True):
    if tag.name not in VALID_TAGS:
        tag.extract()
soup.renderContents()

Except I don't want to throw away the contents inside the invalid tag. How do I get rid of the tag but keep the contents inside when calling soup.renderContents()?

Python Solutions


Solution 1 - Python

Current versions of the BeautifulSoup library have an undocumented method on Tag objects called replaceWithChildren(). So, you could do something like this:

html = "<p>Good, <b>bad</b>, and <i>ug<b>l</b><u>y</u></i></p>"
invalid_tags = ['b', 'i', 'u']
soup = BeautifulSoup(html)
for tag in invalid_tags: 
    for match in soup.findAll(tag):
        match.replaceWithChildren()
print soup

Looks like it behaves like you want it to and is fairly straightforward code (although it does make a few passes through the DOM, but this could easily be optimized.)

Solution 2 - Python

The strategy I used is to replace a tag with its contents if they are of type NavigableString and if they aren't, then recurse into them and replace their contents with NavigableString, etc. Try this:

from BeautifulSoup import BeautifulSoup, NavigableString

def strip_tags(html, invalid_tags):
    soup = BeautifulSoup(html)

    for tag in soup.findAll(True):
        if tag.name in invalid_tags:
            s = ""

            for c in tag.contents:
                if not isinstance(c, NavigableString):
                    c = strip_tags(unicode(c), invalid_tags)
                s += unicode(c)

            tag.replaceWith(s)

    return soup

html = "<p>Good, <b>bad</b>, and <i>ug<b>l</b><u>y</u></i></p>"
invalid_tags = ['b', 'i', 'u']
print strip_tags(html, invalid_tags)

The result is:

<p>Good, bad, and ugly</p>

I gave this same answer on another question. It seems to come up a lot.

Solution 3 - Python

Although this has already been mentoned by other people in the comments, I thought I'd post a full answer showing how to do it with Mozilla's Bleach. Personally, I think this is a lot nicer than using BeautifulSoup for this.

import bleach
html = "<b>Bad</b> <strong>Ugly</strong> <script>Evil()</script>"
clean = bleach.clean(html, tags=[], strip=True)
print clean # Should print: "Bad Ugly Evil()"

Solution 4 - Python

I have a simpler solution but I don't know if there's a drawback to it.

UPDATE: there's a drawback, see Jesse Dhillon's comment. Also, another solution will be to use Mozilla's Bleach instead of BeautifulSoup.

from BeautifulSoup import BeautifulSoup

VALID_TAGS = ['div', 'p']

value = '<div><p>Hello <b>there</b> my friend!</p></div>'

soup = BeautifulSoup(value)

for tag in soup.findAll(True):
    if tag.name not in VALID_TAGS:
        tag.replaceWith(tag.renderContents())

print soup.renderContents()

This will also print <div><p>Hello there my friend!</p></div> as desired.

Solution 5 - Python

You'll presumably have to move tag's children to be children of tag's parent before you remove the tag -- is that what you mean?

If so, then, while inserting the contents in the right place is tricky, something like this should work:

from BeautifulSoup import BeautifulSoup

VALID_TAGS = 'div', 'p'

value = '<div><p>Hello <b>there</b> my friend!</p></div>'

soup = BeautifulSoup(value)

for tag in soup.findAll(True):
    if tag.name not in VALID_TAGS:
        for i, x in enumerate(tag.parent.contents):
          if x == tag: break
        else:
          print "Can't find", tag, "in", tag.parent
          continue
        for r in reversed(tag.contents):
          tag.parent.insert(i, r)
        tag.extract()
print soup.renderContents()

with the example value, this prints <div><p>Hello there my friend!</p></div> as desired.

Solution 6 - Python

you can use soup.text

.text removes all tags and concatenate all text.

Solution 7 - Python

Use unwrap.

Unwrap will remove one of multiple occurrence of the tag and still keep the contents.

Example:

>> soup = BeautifulSoup('Hi. This is a <nobr> nobr </nobr>')
>> soup
<html><body><p>Hi. This is a <nobr> nobr </nobr></p></body></html>
>> soup.nobr.unwrap
<nobr></nobr>
>> soup
>> <html><body><p>Hi. This is a nobr </p></body></html>

Solution 8 - Python

None of the proposed answered seemed to work with BeautifulSoup for me. Here's a version that works with BeautifulSoup 3.2.1, and also inserts a space when joining content from different tags instead of concatenating words.

def strip_tags(html, whitelist=[]):
    """
    Strip all HTML tags except for a list of whitelisted tags.
    """
    soup = BeautifulSoup(html)

    for tag in soup.findAll(True):
        if tag.name not in whitelist:
            tag.append(' ')
            tag.replaceWithChildren()

    result = unicode(soup)

    # Clean up any repeated spaces and spaces like this: '<a>test </a> '
    result = re.sub(' +', ' ', result)
    result = re.sub(r' (<[^>]*> )', r'\1', result)
    return result.strip()

Example:

strip_tags('<h2><a><span>test</span></a> testing</h2><p>again</p>', ['a'])
# result: u'<a>test</a> testing again'

Solution 9 - Python

Here is the better solution without any hassles and boilerplate code to filter out the tags keeping the content.Lets say you want to remove any children tags within the parent tag and just want to keep the contents/text then,you can simply do:

for p_tags in div_tags.find_all("p"):
    print(p_tags.get_text())

That's it and you can be free with all the br or i b tags within the parent tags and get the clean text.

Solution 10 - Python

Here is a python 3 friendly version of this function:

from bs4 import BeautifulSoup, NavigableString
invalidTags = ['br','b','font']
def stripTags(html, invalid_tags):
    soup = BeautifulSoup(html, "lxml")
    for tag in soup.findAll(True):
        if tag.name in invalid_tags:
            s = ""
            for c in tag.contents:
                if not isinstance(c, NavigableString):
                    c = stripTags(str(c), invalid_tags)
                s += str(c)
            tag.replaceWith(s)
    return soup

Solution 11 - Python

This is an old question, but just to say of a better ways to do it. First of all, BeautifulSoup 3* is no longer being developed, so you should rather use BeautifulSoup 4*, so called bs4.

Also, lxml has just function that you need: Cleaner class has attribute remove_tags, which you can set to tags that will be removed while their content getting pulled up into the parent tag.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionJason ChristaView Question on Stackoverflow
Solution 1 - PythonslacyView Answer on Stackoverflow
Solution 2 - PythonJesse DhillonView Answer on Stackoverflow
Solution 3 - PythoncorfordView Answer on Stackoverflow
Solution 4 - PythonEtienneView Answer on Stackoverflow
Solution 5 - PythonAlex MartelliView Answer on Stackoverflow
Solution 6 - PythonjimmyView Answer on Stackoverflow
Solution 7 - PythonBishwas MishraView Answer on Stackoverflow
Solution 8 - PythonOlof SjöberghView Answer on Stackoverflow
Solution 9 - Pythonrobus gauliView Answer on Stackoverflow
Solution 10 - PythonDom DaFonteView Answer on Stackoverflow
Solution 11 - PythonTommzView Answer on Stackoverflow