Python BeautifulSoup extract text between element

PythonBeautifulsoup

Python Problem Overview


I try to extract "THIS IS MY TEXT" from the following HTML:

<html>
<body>
<table>
   <td class="MYCLASS">
      <!-- a comment -->
      <a hef="xy">Text</a>
      <p>something</p>
      THIS IS MY TEXT
      <p>something else</p>
      </br>
   </td>
</table>
</body>
</html>

I tried it this way:

soup = BeautifulSoup(html)

for hit in soup.findAll(attrs={'class' : 'MYCLASS'}):
    print hit.text

But I get all the text between all nested Tags plus the comment.

Can anyone help me to just get "THIS IS MY TEXT" out of this?

Python Solutions


Solution 1 - Python

Learn more about how to navigate through the parse tree in BeautifulSoup. Parse tree has got tags and NavigableStrings (as THIS IS A TEXT). An example

from BeautifulSoup import BeautifulSoup 
doc = ['<html><head><title>Page title</title></head>',
       '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
       '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
       '</html>']
soup = BeautifulSoup(''.join(doc))

print soup.prettify()
# <html>
#  <head>
#   <title>
#    Page title
#   </title>
#  </head>
#  <body>
#   <p id="firstpara" align="center">
#    This is paragraph
#    <b>
#     one
#    </b>
#    .
#   </p>
#   <p id="secondpara" align="blah">
#    This is paragraph
#    <b>
#     two
#    </b>
#    .
#   </p>
#  </body>
# </html>

To move down the parse tree you have contents and string.

  • > contents is an ordered list of the Tag and NavigableString objects > contained within a page element

  • > if a tag has only one child node, and that child node is a string, > the child node is made available as tag.string, as well as > tag.contents[0]

For the above, that is to say you can get

soup.b.string
# u'one'
soup.b.contents[0]
# u'one'

For several children nodes, you can have for instance

pTag = soup.p
pTag.contents
# [u'This is paragraph ', <b>one</b>, u'.']

so here you may play with contents and get contents at the index you want.

You also can iterate over a Tag, this is a shortcut. For instance,

for i in soup.body:
    print i
# <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
# <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>

Solution 2 - Python

Use .children instead:

from bs4 import NavigableString, Comment
print ''.join(unicode(child) for child in hit.children 
    if isinstance(child, NavigableString) and not isinstance(child, Comment))

Yes, this is a bit of a dance.

Output:

>>> for hit in soup.findAll(attrs={'class' : 'MYCLASS'}):
...     print ''.join(unicode(child) for child in hit.children 
...         if isinstance(child, NavigableString) and not isinstance(child, Comment))
... 




      THIS IS MY TEXT
      

Solution 3 - Python

You can use .contents:

>>> for hit in soup.findAll(attrs={'class' : 'MYCLASS'}):
...     print hit.contents[6].strip()
... 
THIS IS MY TEXT

Solution 4 - Python

with your own soup object:

soup.p.next_sibling.strip()
  1. you grab the <p> directly with soup.p *(this hinges on it being the first <p> in the parse tree)
  2. then use next_sibling on the tag object that soup.p returns since the desired text is nested at the same level of the parse tree as the <p>
  3. .strip() is just a Python str method to remove leading and trailing whitespace

*otherwise just find the element using your choice of filter(s)

in the interpreter this looks something like:

In [4]: soup.p
Out[4]: <p>something</p>

In [5]: type(soup.p)
Out[5]: bs4.element.Tag

In [6]: soup.p.next_sibling
Out[6]: u'\n      THIS IS MY TEXT\n      '

In [7]: type(soup.p.next_sibling)
Out[7]: bs4.element.NavigableString

In [8]: soup.p.next_sibling.strip()
Out[8]: u'THIS IS MY TEXT'

In [9]: type(soup.p.next_sibling.strip())
Out[9]: unicode

Solution 5 - Python

Short answer: soup.findAll('p')[0].next

Real answer: You need an invariant reference point from which you can get to your target.

You mention in your comment to Haidro's answer that the text you want is not always in the same place. Find a sense in which it is in the same place relative to some element. Then figure out how to make BeautifulSoup navigate the parse tree following that invariant path.

For example, in the HTML you provide in the original post, the target string appears immediately after the first paragraph element, and that paragraph is not empty. Since findAll('p') will find paragraph elements, soup.find('p')[0] will be the first paragraph element.

You could in this case use soup.find('p') but soup.findAll('p')[n] is more general since maybe your actual scenario needs the 5th paragraph or something like that.

The next field attribute will be the next parsed element in the tree, including children. So soup.findAll('p')[0].next contains the text of the paragraph, and soup.findAll('p')[0].next.next will return your target in the HTML provided.

Solution 6 - Python

soup = BeautifulSoup(html)
for hit in soup.findAll(attrs={'class' : 'MYCLASS'}):
  hit = hit.text.strip()
  print hit

This will print: THIS IS MY TEXT Try this..

Solution 7 - Python

The BeautifulSoup documentation provides an example about removing objects from a document using the extract method. In the following example the aim is to remove all comments from the document:

Removing Elements

> Once you have a reference to an element, you can rip it out of the > tree with the extract method. This code removes all the comments > from a document:

from BeautifulSoup import BeautifulSoup, Comment
soup = BeautifulSoup("""1<!--The loneliest number-->
                    <a>2<!--Can be as bad as one--><b>3""")
comments = soup.findAll(text=lambda text:isinstance(text, Comment))
[comment.extract() for comment in comments]
print soup
# 1
# <a>2<b>3</b></a>

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
Questionɥɔǝnq ɹǝƃloɥView Question on Stackoverflow
Solution 1 - PythonkiriloffView Answer on Stackoverflow
Solution 2 - PythonMartijn PietersView Answer on Stackoverflow
Solution 3 - PythonTerryAView Answer on Stackoverflow
Solution 4 - PythonGregory KremlerView Answer on Stackoverflow
Solution 5 - PythonBennett BrownView Answer on Stackoverflow
Solution 6 - PythonNaiswitaView Answer on Stackoverflow
Solution 7 - Pythonalireza sanaeeView Answer on Stackoverflow