Get all text inside a tag in lxml

PythonParsingLxml

Python Problem Overview


I'd like to write a code snippet that would grab all of the text inside the <content> tag, in lxml, in all three instances below, including the code tags. I've tried tostring(getchildren()) but that would miss the text in between the tags. I didn't have very much luck searching the API for a relevant function. Could you help me out?

<!--1-->
<content>
<div>Text inside tag</div>
</content>
#should return "<div>Text inside tag</div>

<!--2-->
<content>
Text with no tag
</content>
#should return "Text with no tag"


<!--3-->
<content>
Text outside tag <div>Text inside tag</div>
</content>
#should return "Text outside tag <div>Text inside tag</div>"

Python Solutions


Solution 1 - Python

Just use the node.itertext() method, as in:

 ''.join(node.itertext())

Solution 2 - Python

Does text_content() do what you need?

Solution 3 - Python

Try:

def stringify_children(node):
    from lxml.etree import tostring
    from itertools import chain
    parts = ([node.text] +
            list(chain(*([c.text, tostring(c), c.tail] for c in node.getchildren()))) +
            [node.tail])
    # filter removes possible Nones in texts and tails
    return ''.join(filter(None, parts))

Example:

from lxml import etree
node = etree.fromstring("""<content>
Text outside tag <div>Text <em>inside</em> tag</div>
</content>""")
stringify_children(node)

Produces: '\nText outside tag <div>Text <em>inside</em> tag</div>\n'

Solution 4 - Python

A version of albertov 's stringify-content that solves the bugs reported by hoju:

def stringify_children(node):
    from lxml.etree import tostring
    from itertools import chain
    return ''.join(
        chunk for chunk in chain(
            (node.text,),
            chain(*((tostring(child, with_tail=False), child.tail) for child in node.getchildren())),
            (node.tail,)) if chunk)

Solution 5 - Python

The following snippet which uses python generators works perfectly and is very efficient.

''.join(node.itertext()).strip()

Solution 6 - Python

Defining stringify_children this way may be less complicated:

from lxml import etree

def stringify_children(node):
	s = node.text
	if s is None:
		s = ''
	for child in node:
		s += etree.tostring(child, encoding='unicode')
	return s

or in one line

return (node.text if node.text is not None else '') + ''.join((etree.tostring(child, encoding='unicode') for child in node))

Rationale is the same as in this answer: leave the serialization of child nodes to lxml. The tail part of node in this case isn't interesting since it is "behind" the end tag. Note that the encoding argument may be changed according to one's needs.

Another possible solution is to serialize the node itself and afterwards, strip the start and end tag away:

def stringify_children(node):
	s = etree.tostring(node, encoding='unicode', with_tail=False)
	return s[s.index(node.tag) + 1 + len(node.tag): s.rindex(node.tag) - 2]

which is somewhat horrible. This code is correct only if node has no attributes, and I don't think anyone would want to use it even then.

Solution 7 - Python

import urllib2
from lxml import etree
url = 'some_url'

getting url

test = urllib2.urlopen(url)
page = test.read()

getting all html code within

including table tag

tree = etree.HTML(page)

xpath selector

table = tree.xpath("xpath_here")
res = etree.tostring(table)

res is the html code of table this was doing job for me.

so you can extract the tags content with xpath_text() and tags including their content using tostring()

div = tree.xpath("//div")
div_res = etree.tostring(div)

text = tree.xpath_text("//content") 

or text = tree.xpath("//content/text()")

div_3 = tree.xpath("//content")
div_3_res = etree.tostring(div_3).strip('<content>').rstrip('</')

this last line with strip method using is not nice, but it just works

Solution 8 - Python

One of the simplest code snippets, that actually worked for me and as per documentation at http://lxml.de/tutorial.html#using-xpath-to-find-text is

etree.tostring(html, method="text")

where etree is a node/tag whose complete text, you are trying to read. Behold that it doesn't get rid of script and style tags though.

Solution 9 - Python

Just a quick enhancement as the answer has been given. If you want to clean the inside text:

clean_string = ' '.join([n.strip() for n in node.itertext()]).strip()

Solution 10 - Python

In response to @Richard's comment above, if you patch stringify_children to read:

 parts = ([node.text] +
--            list(chain(*([c.text, tostring(c), c.tail] for c in node.getchildren()))) +
++            list(chain(*([tostring(c)] for c in node.getchildren()))) +
           [node.tail])

it seems to avoid the duplication he refers to.

Solution 11 - Python

I know that this is an old question, but this is a common problem and I have a solution that seems simpler than the ones suggested so far:

def stringify_children(node):
    """Given a LXML tag, return contents as a string

       >>> html = "<p><strong>Sample sentence</strong> with tags.</p>"
       >>> node = lxml.html.fragment_fromstring(html)
       >>> extract_html_content(node)
       "<strong>Sample sentence</strong> with tags."
    """
    if node is None or (len(node) == 0 and not getattr(node, 'text', None)):
        return ""
    node.attrib.clear()
    opening_tag = len(node.tag) + 2
    closing_tag = -(len(node.tag) + 3)
    return lxml.html.tostring(node)[opening_tag:closing_tag]

Unlike some of the other answers to this question this solution preserves all of tags contained within it and attacks the problem from a different angle than the other working solutions.

Solution 12 - Python

Here is a working solution. We can get content with a parent tag and then cut the parent tag from output.

import re
from lxml import etree

def _tostr_with_tags(parent_element, html_entities=False):
	RE_CUT = r'^<([\w-]+)>(.*)</([\w-]+)>$'	
	content_with_parent = etree.tostring(parent_element)	

	def _replace_html_entities(s):
		RE_ENTITY = r'&#(\d+);'
		
		def repl(m):
			return unichr(int(m.group(1)))
		
		replaced = re.sub(RE_ENTITY, repl, s, flags=re.MULTILINE|re.UNICODE)

		return replaced

	if not html_entities:
		content_with_parent = _replace_html_entities(content_with_parent)

	content_with_parent = content_with_parent.strip() # remove 'white' characters on margins

	start_tag, content_without_parent, end_tag = re.findall(RE_CUT, content_with_parent, flags=re.UNICODE|re.MULTILINE|re.DOTALL)[0]

	if start_tag != end_tag:
		raise Exception('Start tag does not match to end tag while getting content with tags.')

	return content_without_parent

parent_element must have Element type.

Please note, that if you want text content (not html entities in text) please leave html_entities parameter as False.

Solution 13 - Python

lxml have a method for that:

node.text_content()

Solution 14 - Python

If this is an a tag, you can try:

node.values()

Solution 15 - Python

import re
from lxml import etree

node = etree.fromstring("""
<content>Text before inner tag
    <div>Text
        <em>inside</em>
        tag
    </div>
    Text after inner tag
</content>""")

print re.search("\A<[^<>]*>(.*)</[^<>]*>\Z", etree.tostring(node), re.DOTALL).group(1) 

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionKevin BurkeView Question on Stackoverflow
Solution 1 - PythonArthur DebertView Answer on Stackoverflow
Solution 2 - PythonEd SummersView Answer on Stackoverflow
Solution 3 - PythonalbertovView Answer on Stackoverflow
Solution 4 - PythonananaView Answer on Stackoverflow
Solution 5 - PythonSandeepView Answer on Stackoverflow
Solution 6 - PythonPercival UlyssesView Answer on Stackoverflow
Solution 7 - Pythond3dayView Answer on Stackoverflow
Solution 8 - PythonDeepan Prabhu BabuView Answer on Stackoverflow
Solution 9 - Pythoninverted_indexView Answer on Stackoverflow
Solution 10 - PythonbwingenrothView Answer on Stackoverflow
Solution 11 - PythonJoshmakerView Answer on Stackoverflow
Solution 12 - PythonsergzachView Answer on Stackoverflow
Solution 13 - PythonHrabalView Answer on Stackoverflow
Solution 14 - PythonDavidView Answer on Stackoverflow
Solution 15 - PythonkazufusaView Answer on Stackoverflow