Rendered HTML to plain text using Python

PythonBeautifulsoup

Python Problem Overview


I'm trying to convert a chunk of HTML text with BeautifulSoup. Here is an example:

<div>
    <p>
        Some text
        <span>more text</span>
        even more text
    </p>
    <ul>
        <li>list item</li>
        <li>yet another list item</li>
    </ul>
</div>
<p>Some other text</p>
<ul>
    <li>list item</li>
    <li>yet another list item</li>
</ul>

I tried doing something like:

def parse_text(contents_string)
    Newlines = re.compile(r'[\r\n]\s+')
    bs = BeautifulSoup.BeautifulSoup(contents_string, convertEntities=BeautifulSoup.BeautifulSoup.HTML_ENTITIES)
    txt = bs.getText('\n')
    return Newlines.sub('\n', txt)

...but that way my span element is always on a new line. This is of course a simple example. Is there a way to get the text in the HTML page as the way it will be rendered in the browser (no css rules required, just the regular way div, span, li, etc. elements are rendered) in Python?

Python Solutions


Solution 1 - Python

BeautifulSoup is a scraping library, so it's probably not the best choice for doing HTML rendering. If it's not essential to use BeautifulSoup, you should take a look at html2text. For example:

import html2text
html = open("foobar.html").read()
print html2text.html2text(html)

This outputs:

Some text more text even more text

  • list item
  • yet another list item

Some other text

  • list item
  • yet another list item

Solution 2 - Python

I was encountering the same problem trying to parse the rendered HTML. Basically it seems that BS is not the ideal package for this. @Del gives the great html2text solution.

On a differet SO question: https://stackoverflow.com/questions/10524387/beautifulsoup-get-text-does-not-strip-all-tags-and-javascript @Helge mentioned using nltk. Unfortunately nltk appears to be discontinuing this method.

I tried both html2text and nltk.clean_html and was surprised by the timing results so thought they warranted an answer for posterity. Of course, the speeds highly depend on the contents of the data...

Answer from @Helge (nltk).

import nltk

%timeit nltk.clean_html(html)
was returning 153 us per loop

It worked really well to return a string with rendered html. This nltk module was faster than even html2text, though perhaps html2text is more robust.

Answer above from @del

betterHTML = html.decode(errors='ignore')
%timeit html2text.html2text(betterHTML)
%3.09 ms per loop

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionbtatarovView Question on Stackoverflow
Solution 1 - PythondelView Answer on Stackoverflow
Solution 2 - PythonPaulView Answer on Stackoverflow