Find everything between two XML tags with RegEx

JavaPhpRegexXmlPerl

Java Problem Overview


In RegEx, I want to find the tag and everything between two XML tags, like the following:

<primaryAddress>
	<addressLine>280 Flinders Mall</addressLine>
	<geoCodeGranularity>PROPERTY</geoCodeGranularity>
	<latitude>-19.261365</latitude>
	<longitude>146.815585</longitude>
	<postcode>4810</postcode>
	<state>QLD</state>
	<suburb>Townsville</suburb>
	<type>PHYSICAL</type>
</primaryAddress>

I want to find the tag and everything between primaryAddress, and erase that.

Everything between the primaryAddress tag is a variable, but I want to remove the entire tag and sub-tags whenever I get primaryAddress.

Anyone have any idea how to do that?

Java Solutions


Solution 1 - Java

It is not a good idea to use regex for HTML/XML parsing...

However, if you want to do it anyway, search for regex pattern

<primaryAddress>[\s\S]*?<\/primaryAddress>

and replace it with empty string...

Solution 2 - Java

You should be able to match it with: /<primaryAddress>(.+?)<\/primaryAddress>/

The content between the tags will be in the matched group.

Solution 3 - Java

It is not good to use this method but if you really want to split it with regex

<primaryAddress.*>((.|\n)*?)<\/primaryAddress>

the verified answer returns the tags but this just return the value between tags.

Solution 4 - Java

this can capture most outermost layer pair of tags, even with attribute in side or without end tags

(<!--((?!-->).)*-->|<\w*((?!\/<).)*\/>|<(?<tag>\w+)[^>]*>(?>[^<]|(?R))*<\/\k<tag>\s*>)

edit: as mentioned in comment above, regex is always not enough to parse xml, trying to modify the regex to fit more situation only makes it longer but still useless

Solution 5 - Java

In our case, we receive an XML as a String and need to get rid of the values that have some "special" characters, like &<> etc. Basically someone can provide an XML to us in this form:

<notes>
  <note>
     <to>jenice & carl </to>
     <from>your neighbor <; </from>
  </note>
</notes>

So I need to find in that String the values jenice & carl and your neighbor <; and properly escape & and < (otherwise this is an invalid xml if you later pass it to an engine that shall rename unnamed).

Doing this with regex is a rather dumb idea to begin with, but it's cheap and easy. So the brave ones that would like to do the same thing I did, here you go:

    String xml = ...
    Pattern p = Pattern.compile("<(.+)>(?!\\R<)(.+)</(\\1)>");
    Matcher m = p.matcher(xml);
    String result = m.replaceAll(mr -> {
        if (mr.group(2).contains("&")) {
            return "<" + m.group(1) + ">" + m.group(2) + "+ some change" + "</" + m.group(3) + ">";
        }
        return "<" + m.group(1) + ">" + mr.group(2) + "</" + m.group(3) + ">";
    });

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionDozView Question on Stackoverflow
Solution 1 - JavaΩmegaView Answer on Stackoverflow
Solution 2 - JavadoublesharpView Answer on Stackoverflow
Solution 3 - JavasamanView Answer on Stackoverflow
Solution 4 - JavaValenView Answer on Stackoverflow
Solution 5 - JavaEugeneView Answer on Stackoverflow