Rendered HTML to plain text using Python

Python Problem Overview

I'm trying to convert a chunk of HTML text with BeautifulSoup. Here is an example:

<div>
    <p>
        Some text
        <span>more text</span>
        even more text
    </p>
    <ul>
        <li>list item</li>
        <li>yet another list item</li>
    </ul>
</div>
<p>Some other text</p>
<ul>
    <li>list item</li>
    <li>yet another list item</li>
</ul>

I tried doing something like:

def parse_text(contents_string)
    Newlines = re.compile(r'[\r\n]\s+')
    bs = BeautifulSoup.BeautifulSoup(contents_string, convertEntities=BeautifulSoup.BeautifulSoup.HTML_ENTITIES)
    txt = bs.getText('\n')
    return Newlines.sub('\n', txt)

...but that way my span element is always on a new line. This is of course a simple example. Is there a way to get the text in the HTML page as the way it will be rendered in the browser (no css rules required, just the regular way div, span, li, etc. elements are rendered) in Python?

Python Solutions

Solution 1 - Python

BeautifulSoup is a scraping library, so it's probably not the best choice for doing HTML rendering. If it's not essential to use BeautifulSoup, you should take a look at html2text. For example:

import html2text
html = open("foobar.html").read()
print html2text.html2text(html)

This outputs:

Some text more text even more text

list item
yet another list item

Some other text

list item
yet another list item

Solution 2 - Python

I was encountering the same problem trying to parse the rendered HTML. Basically it seems that BS is not the ideal package for this. @Del gives the great html2text solution.

On a differet SO question: https://stackoverflow.com/questions/10524387/beautifulsoup-get-text-does-not-strip-all-tags-and-javascript @Helge mentioned using nltk. Unfortunately nltk appears to be discontinuing this method.

I tried both html2text and nltk.clean_html and was surprised by the timing results so thought they warranted an answer for posterity. Of course, the speeds highly depend on the contents of the data...

Answer from @Helge (nltk).

import nltk

%timeit nltk.clean_html(html)
was returning 153 us per loop

It worked really well to return a string with rendered html. This nltk module was faster than even html2text, though perhaps html2text is more robust.

Answer above from @del

betterHTML = html.decode(errors='ignore')
%timeit html2text.html2text(betterHTML)
%3.09 ms per loop

Content Type	Original Author	Original Content on Stackoverflow
Question	btatarov	View Question on Stackoverflow
Solution 1 - Python	del	View Answer on Stackoverflow
Solution 2 - Python	Paul	View Answer on Stackoverflow

Rendered HTML to plain text using Python

Python Problem Overview

Python Solutions

Solution 1 - Python

Solution 2 - Python

What is the purpose of NGINX and Gunicorn running in parallel?

SGEN: An attempt was made to load an assembly with an incorrect format

Attributions