Get all text inside a tag in lxml

Python Problem Overview

I'd like to write a code snippet that would grab all of the text inside the <content> tag, in lxml, in all three instances below, including the code tags. I've tried tostring(getchildren()) but that would miss the text in between the tags. I didn't have very much luck searching the API for a relevant function. Could you help me out?

<!--1-->
<content>
<div>Text inside tag</div>
</content>
#should return "<div>Text inside tag</div>

<!--2-->
<content>
Text with no tag
</content>
#should return "Text with no tag"


<!--3-->
<content>
Text outside tag <div>Text inside tag</div>
</content>
#should return "Text outside tag <div>Text inside tag</div>"

Python Solutions

Solution 1 - Python

Just use the node.itertext() method, as in:

 ''.join(node.itertext())

Solution 2 - Python

Does text_content() do what you need?

Solution 3 - Python

Try:

def stringify_children(node):
    from lxml.etree import tostring
    from itertools import chain
    parts = ([node.text] +
            list(chain(*([c.text, tostring(c), c.tail] for c in node.getchildren()))) +
            [node.tail])
    # filter removes possible Nones in texts and tails
    return ''.join(filter(None, parts))

Example:

from lxml import etree
node = etree.fromstring("""<content>
Text outside tag <div>Text <em>inside</em> tag</div>
</content>""")
stringify_children(node)

Produces: '\nText outside tag <div>Text <em>inside</em> tag</div>\n'

Solution 4 - Python

A version of albertov 's stringify-content that solves the bugs reported by hoju:

def stringify_children(node):
    from lxml.etree import tostring
    from itertools import chain
    return ''.join(
        chunk for chunk in chain(
            (node.text,),
            chain(*((tostring(child, with_tail=False), child.tail) for child in node.getchildren())),
            (node.tail,)) if chunk)

Solution 5 - Python

The following snippet which uses python generators works perfectly and is very efficient.

''.join(node.itertext()).strip()

Solution 6 - Python

Defining stringify_children this way may be less complicated:

from lxml import etree

def stringify_children(node):
	s = node.text
	if s is None:
		s = ''
	for child in node:
		s += etree.tostring(child, encoding='unicode')
	return s

or in one line

return (node.text if node.text is not None else '') + ''.join((etree.tostring(child, encoding='unicode') for child in node))

Rationale is the same as in this answer: leave the serialization of child nodes to lxml. The tail part of node in this case isn't interesting since it is "behind" the end tag. Note that the encoding argument may be changed according to one's needs.

Another possible solution is to serialize the node itself and afterwards, strip the start and end tag away:

def stringify_children(node):
	s = etree.tostring(node, encoding='unicode', with_tail=False)
	return s[s.index(node.tag) + 1 + len(node.tag): s.rindex(node.tag) - 2]

which is somewhat horrible. This code is correct only if node has no attributes, and I don't think anyone would want to use it even then.

Solution 7 - Python

import urllib2
from lxml import etree
url = 'some_url'

getting url

test = urllib2.urlopen(url)
page = test.read()

getting all html code within

including table tag

tree = etree.HTML(page)

xpath selector

table = tree.xpath("xpath_here")
res = etree.tostring(table)

res is the html code of table this was doing job for me.

so you can extract the tags content with xpath_text() and tags including their content using tostring()

div = tree.xpath("//div")
div_res = etree.tostring(div)

text = tree.xpath_text("//content")

or text = tree.xpath("//content/text()")

div_3 = tree.xpath("//content")
div_3_res = etree.tostring(div_3).strip('<content>').rstrip('</')

this last line with strip method using is not nice, but it just works

Solution 8 - Python

One of the simplest code snippets, that actually worked for me and as per documentation at http://lxml.de/tutorial.html#using-xpath-to-find-text is

etree.tostring(html, method="text")

where etree is a node/tag whose complete text, you are trying to read. Behold that it doesn't get rid of script and style tags though.

Solution 9 - Python

Just a quick enhancement as the answer has been given. If you want to clean the inside text:

clean_string = ' '.join([n.strip() for n in node.itertext()]).strip()

Solution 10 - Python

In response to @Richard's comment above, if you patch stringify_children to read:

 parts = ([node.text] +
--            list(chain(*([c.text, tostring(c), c.tail] for c in node.getchildren()))) +
++            list(chain(*([tostring(c)] for c in node.getchildren()))) +
           [node.tail])

it seems to avoid the duplication he refers to.

Solution 11 - Python

I know that this is an old question, but this is a common problem and I have a solution that seems simpler than the ones suggested so far:

def stringify_children(node):
    """Given a LXML tag, return contents as a string

       >>> html = "<p><strong>Sample sentence</strong> with tags.</p>"
       >>> node = lxml.html.fragment_fromstring(html)
       >>> extract_html_content(node)
       "<strong>Sample sentence</strong> with tags."
    """
    if node is None or (len(node) == 0 and not getattr(node, 'text', None)):
        return ""
    node.attrib.clear()
    opening_tag = len(node.tag) + 2
    closing_tag = -(len(node.tag) + 3)
    return lxml.html.tostring(node)[opening_tag:closing_tag]

Unlike some of the other answers to this question this solution preserves all of tags contained within it and attacks the problem from a different angle than the other working solutions.

Solution 12 - Python

Here is a working solution. We can get content with a parent tag and then cut the parent tag from output.

import re
from lxml import etree

def _tostr_with_tags(parent_element, html_entities=False):
	RE_CUT = r'^<([\w-]+)>(.*)</([\w-]+)>$'	
	content_with_parent = etree.tostring(parent_element)	

	def _replace_html_entities(s):
		RE_ENTITY = r'&#(\d+);'
		
		def repl(m):
			return unichr(int(m.group(1)))
		
		replaced = re.sub(RE_ENTITY, repl, s, flags=re.MULTILINE|re.UNICODE)

		return replaced

	if not html_entities:
		content_with_parent = _replace_html_entities(content_with_parent)

	content_with_parent = content_with_parent.strip() # remove 'white' characters on margins

	start_tag, content_without_parent, end_tag = re.findall(RE_CUT, content_with_parent, flags=re.UNICODE|re.MULTILINE|re.DOTALL)[0]

	if start_tag != end_tag:
		raise Exception('Start tag does not match to end tag while getting content with tags.')

	return content_without_parent

parent_element must have Element type.

Please note, that if you want text content (not html entities in text) please leave html_entities parameter as False.

Solution 13 - Python

lxml have a method for that:

node.text_content()

Solution 14 - Python

If this is an a tag, you can try:

node.values()

Solution 15 - Python

import re
from lxml import etree

node = etree.fromstring("""
<content>Text before inner tag
    <div>Text
        <em>inside</em>
        tag
    </div>
    Text after inner tag
</content>""")

print re.search("\A<[^<>]*>(.*)</[^<>]*>\Z", etree.tostring(node), re.DOTALL).group(1)

Content Type	Original Author	Original Content on Stackoverflow
Question	Kevin Burke	View Question on Stackoverflow
Solution 1 - Python	Arthur Debert	View Answer on Stackoverflow
Solution 2 - Python	Ed Summers	View Answer on Stackoverflow
Solution 3 - Python	albertov	View Answer on Stackoverflow
Solution 4 - Python	anana	View Answer on Stackoverflow
Solution 5 - Python	Sandeep	View Answer on Stackoverflow
Solution 6 - Python	Percival Ulysses	View Answer on Stackoverflow
Solution 7 - Python	d3day	View Answer on Stackoverflow
Solution 8 - Python	Deepan Prabhu Babu	View Answer on Stackoverflow
Solution 9 - Python	inverted_index	View Answer on Stackoverflow
Solution 10 - Python	bwingenroth	View Answer on Stackoverflow
Solution 11 - Python	Joshmaker	View Answer on Stackoverflow
Solution 12 - Python	sergzach	View Answer on Stackoverflow
Solution 13 - Python	Hrabal	View Answer on Stackoverflow
Solution 14 - Python	David	View Answer on Stackoverflow
Solution 15 - Python	kazufusa	View Answer on Stackoverflow