lxml etree xmlparser remove unwanted namespace

PythonLxmlXml ParsingElementtree

Python Problem Overview


I have an xml doc that I am trying to parse using Etree.lxml

<Envelope xmlns="http://www.example.com/zzz/yyy">
  <Header>
    <Version>1</Version>
  </Header>
  <Body>
    some stuff
  <Body>
<Envelope>

My code is:

path = "path to xml file"
from lxml import etree as ET
parser = ET.XMLParser(ns_clean=True)
dom = ET.parse(path, parser)
dom.getroot()

When I try to get dom.getroot() I get:

<Element {http://www.example.com/zzz/yyy}Envelope at 28adacac>

However I only want:

<Element Envelope at 28adacac>

When i do

dom.getroot().find("Body")

I get nothing returned. However, when I

dom.getroot().find("{http://www.example.com/zzz/yyy}Body") 

I get a result.

I thought passing ns_clean=True to the parser would prevent this.

Any ideas?

Python Solutions


Solution 1 - Python

import io
import lxml.etree as ET

content='''\
<Envelope xmlns="http://www.example.com/zzz/yyy">
  <Header>
    <Version>1</Version>
  </Header>
  <Body>
    some stuff
  </Body>
</Envelope>
'''    
dom = ET.parse(io.BytesIO(content))

You can find namespace-aware nodes using the xpath method:

body=dom.xpath('//ns:Body',namespaces={'ns':'http://www.example.com/zzz/yyy'})
print(body)
# [<Element {http://www.example.com/zzz/yyy}Body at 90b2d4c>]

If you really want to remove namespaces, you could use an XSL transformation:

# http://wiki.tei-c.org/index.php/Remove-Namespaces.xsl
xslt='''<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="no"/>

<xsl:template match="/|comment()|processing-instruction()">
    <xsl:copy>
      <xsl:apply-templates/>
    </xsl:copy>
</xsl:template>

<xsl:template match="*">
    <xsl:element name="{local-name()}">
      <xsl:apply-templates select="@*|node()"/>
    </xsl:element>
</xsl:template>

<xsl:template match="@*">
    <xsl:attribute name="{local-name()}">
      <xsl:value-of select="."/>
    </xsl:attribute>
</xsl:template>
</xsl:stylesheet>
'''

xslt_doc=ET.parse(io.BytesIO(xslt))
transform=ET.XSLT(xslt_doc)
dom=transform(dom)

Here we see the namespace has been removed:

print(ET.tostring(dom))
# <Envelope>
#   <Header>
#     <Version>1</Version>
#   </Header>
#   <Body>
#     some stuff
#   </Body>
# </Envelope>

So you can now find the Body node this way:

print(dom.find("Body"))
# <Element Body at 8506cd4>

Solution 2 - Python

Try using Xpath:

dom.xpath("//*[local-name() = 'Body']")

Taken (and simplified) from this page, under "The xpath() method" section

Solution 3 - Python

The last solution from https://bitbucket.org/olauzanne/pyquery/issue/17 can help you to avoid namespaces with little effort

> apply xml.replace(' xmlns:', ' xmlnamespace:') to your xml before using pyquery so lxml will ignore namespaces

In your case, try xml.replace(' xmlns="', ' xmlnamespace="'). However, you might need something more complex if the string is expected in the bodies as well.

Solution 4 - Python

Another not-too-bad option is to use the QName helper and wrap it in a function with a default namespace:

from lxml import etree

DEFAULT_NS = 'http://www.example.com/zzz/yyy'

def tag(name, ns=DEFAULT_NS):
    return etree.QName(ns, name)

dom = etree.parse(path)
body = dom.getroot().find(tag('Body'))

Solution 5 - Python

You're showing the result of the repr() call. When you programmatically move through the tree, you can simply choose to ignore the namespace.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionMarkView Question on Stackoverflow
Solution 1 - PythonunutbuView Answer on Stackoverflow
Solution 2 - PythondusanView Answer on Stackoverflow
Solution 3 - PythonAndreiView Answer on Stackoverflow
Solution 4 - PythonTomView Answer on Stackoverflow
Solution 5 - PythonrobertView Answer on Stackoverflow