How to get text of an element in Selenium WebDriver, without including child element text?

PythonHtmlSeleniumSelenium Webdriver

Python Problem Overview


<div id="a">This is some
   <div id="b">text</div>
</div>

Getting "This is some" is non-trivial. For instance, this returns "This is some text":

driver.find_element_by_id('a').text

How does one, in a general way, get the text of a specific element without including the text of it's children?

(I'm providing an answer below but will leave the question open in case someone can come up with a less hideous solution).

Python Solutions


Solution 1 - Python

Here's a general solution:

def get_text_excluding_children(driver, element):
    return driver.execute_script("""
    return jQuery(arguments[0]).contents().filter(function() {
        return this.nodeType == Node.TEXT_NODE;
    }).text();
    """, element)

The element passed to the function can be something obtained from the find_element...() methods (i.e. it can be a WebElement object).

Or if you don't have jQuery or don't want to use it you can replace the body of the function above above with this:

return self.driver.execute_script("""
var parent = arguments[0];
var child = parent.firstChild;
var ret = "";
while(child) {
    if (child.nodeType === Node.TEXT_NODE)
        ret += child.textContent;
    child = child.nextSibling;
}
return ret;
""", element) 

I'm actually using this code in a test suite.

Solution 2 - Python

In the HTML which you have shared:

<div id="a">This is some
   <div id="b">text</div>
</div>

The text This is some is within a text node. To depict the text node in a structured way:

<div id="a">
    This is some
   <div id="b">text</div>
</div>

This Usecase

To extract and print the text This is some from the text node using Selenium's [tag:python] client you have 2 ways as follows:

  • Using splitlines(): You can identify the parent element i.e. <div id="a">, extract the innerHTML and then use splitlines() as follows:

  • using xpath:

     	print(driver.find_element_by_xpath("//div[@id='a']").get_attribute("innerHTML").splitlines()[0])
    
  • using xpath:

     	print(driver.find_element_by_css_selector("div#a").get_attribute("innerHTML").splitlines()[0])
    
  • Using execute_script(): You can also use the execute_script() method which can synchronously execute JavaScript in the current window/frame as follows:

  • using xpath and firstChild:

     	parent_element = driver.find_element_by_xpath("//div[@id='a']")
     	print(driver.execute_script('return arguments[0].firstChild.textContent;', parent_element).strip())
    
  • using xpath and childNodes[n]:

     	parent_element = driver.find_element_by_xpath("//div[@id='a']")
     	print(driver.execute_script('return arguments[0].childNodes[1].textContent;', parent_element).strip())
    

Solution 3 - Python

def get_true_text(tag):
    children = tag.find_elements_by_xpath('*')
    original_text = tag.text
    for child in children:
        original_text = original_text.replace(child.text, '', 1)
    return original_text

Solution 4 - Python

You don't have to do a replace, you can get the length of the children text and subtract that from the overall length, and slice into the original text. That should be substantially faster.

Solution 5 - Python

Unfortunately, Selenium was only built to work with Elements, not Text nodes.

If you try to use a function like get_element_by_xpath to target the text nodes, Selenium will throw an InvalidSelectorException.

One workaround is to grab the relevant HTML with Selenium and then use an HTML parsing library like BeautifulSoup that can handle text nodes more elegantly.

import bs4
from bs4 import BeautifulSoup

inner_html = driver.find_elements_by_css_selector('#a')[0].get_attribute("innerHTML")
inner_soup = BeautifulSoup(inner_html, 'html.parser')

outer_html = driver.find_elements_by_css_selector('#a')[0].get_attribute("outerHTML")
outer_soup = BeautifulSoup(outer_html, 'html.parser')

From there, there are several ways to search for the Text content. You'll have to experiment to see what works best for your use case.

Here's a simple one-liner that may be sufficient:

inner_soup.find(text=True)

If that doesn't work, then you can loop through the element's child nodes with .contents() and check their object type.

BeautifulSoup has four types of elements, and the one that you'll be interested in is the NavigableString type, which is produced by Text nodes. By contrast, Elements will have a type of Tag.

contents = inner_soup.contents

for bs4_object in contents:

    if (type(bs4_object) == bs4.Tag):
	    print("This object is an Element.")

    elif (type(bs4_object) == bs4.NavigableString):
	    print("This object is a Text node.")

Note that BeautifulSoup doesn't support Xpath expressions. If you need those, then you can use some of the workarounds in this thread.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionjoshView Question on Stackoverflow
Solution 1 - PythonLouisView Answer on Stackoverflow
Solution 2 - Pythonundetected SeleniumView Answer on Stackoverflow
Solution 3 - PythonjoshView Answer on Stackoverflow
Solution 4 - PythonkreativiteaView Answer on Stackoverflow
Solution 5 - PythonPikamander2View Answer on Stackoverflow