Can you provide some examples of why it is hard to parse XML and HTML with a regex?

Html Problem Overview

One mistake I see people making over and over again is trying to parse XML or HTML with a regex. Here are a few of the reasons parsing XML and HTML is hard:

People want to treat a file as a sequence of lines, but this is valid:

<tag
attr="5"
/>

People want to treat < or <tag as the start of a tag, but stuff like this exists in the wild:

<img src="imgtag.gif" alt="<img>" />

People often want to match starting tags to ending tags, but XML and HTML allow tags to contain themselves (which traditional regexes cannot handle at all):

<span id="outer"><span id="inner">foo</span></span>

People often want to match against the content of a document (such as the famous "find all phone numbers on a given page" problem), but the data may be marked up (even if it appears to be normal when viewed):

<span class="phonenum">(<span class="area code">703</span>)
<span class="prefix">348</span>-<span class="linenum">3020</span></span>

Comments may contain poorly formatted or incomplete tags:

<a href="foo">foo</a>
<!-- FIXME:
    <a href="
-->
<a href="bar">bar</a>

What other gotchas are you aware of?

Html Solutions

Solution 1 - Html

Here's some fun valid XML for you:

<!DOCTYPE x [ <!ENTITY y "a]>b"> ]>
<x>
    <a b="&y;>" />
    <![CDATA[[a>b <a>b <a]]>
    <?x <a> <!-- <b> ?> c --> d
</x>

And this little bundle of joy is valid HTML:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd" [
    <!ENTITY % e "href='hello'">
    <!ENTITY e "<a %e;>">
]>
    <title>x</TITLE>
</head>
    <p id  =  a:b center>
    <span / hello </span>
    &amp<br left>
    <!---- >t<!---> < -->
    &e link </a>
</body>

Not to mention all the browser-specific parsing for invalid constructs.

Good luck pitting regex against that!

EDIT (Jörg W Mittag): Here is another nice piece of well-formed, valid HTML 4.01:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
  "http://www.w3.org/TR/html4/strict.dtd"> 
<HTML/
  <HEAD/
    <TITLE/>/
    <P/>

Solution 2 - Html

Actually

<img src="imgtag.gif" alt="<img>" />

is not valid HTML, and is not valid XML either.

It is not valid XML because the '<' and '>' are not valid characters inside attribute strings. They need to be escaped using the corresponding XML entities < and >

It is not valid HTML either because the short closing form is not allowed in HTML (but is correct in XML and XHTML). The 'img' tag is also an implicitly closed tag as per the HTML 4.01 specification. This means that manually closing it is actually wrong, and is equivalent to closing any other tag twice.

The correct version in HTML is

<img src="imgtag.gif" alt="&lt;img&gt;">

and the correct version in XHTML and XML is

<img src="imgtag.gif" alt="&lt;img&gt;"/>

The following example you gave is also invalid

<
tag
attr="5"
/>

This is not valid HTML or XML either. The name of the tag must be right behind the '<', although the attributes and the closing '>' may be wherever they want. So the valid XML is actually

<tag
attr="5"
/>

And here's another funkier one: you can actually choose to use either " or ' as your attribute quoting character

<img src="image.gif" alt='This is single quoted AND valid!'>

All the other reasons that were posted are correct, but the biggest problem with parsing HTML is that people usually don't understand all the syntax rules correctly. The fact that your browser interprets your tagsoup as HTML doesn't means that you have actually written valid HTML.

Edit: And even stackoverflow.com agrees with me regarding the definition of valid and invalid. Your invalid XML/HTML is not highlighted, while my corrected version is.

Basically, XML is not made to be parsed with regexps. But there is also no reason to do so. There are many, many XML parsers for each and every language. You have the choice between SAX parsers, DOM parsers and Pull parsers. All of these are guaranteed to be much faster than parsing with a regexp and you may then use cool technologies like XPath or XSLT on the resulting DOM tree.

My reply is therefore: not only is parsing XML with regexps hard, but it is also a bad idea. Just use one of the millions of existing XML parsers, and take advantage of all the advanced features of XML.

HTML is just too hard to even try parsing on your own. First the legal syntax has many little subtleties that you may not be aware of, and second, HTML in the wild is just a huge stinking pile of (you get my drift). There are a variety of lax parser libraries that do a good job at handling HTML like tag soup, just use these.

Solution 3 - Html

I wrote an entire blog entry on this subject: Regular Expression Limitations

The crux of the issue is that HTML and XML are recursive structures which require counting mechanisms in order to properly parse. A true regex is not capable of counting. You must have a context free grammar in order to count.

The previous paragraph comes with a slight caveat. Certain regex implementations now support the idea of recursion. However once you start adding recursion into your regex expressions, you are really stretching the boundaries and should consider a parser.

Solution 4 - Html

One gotcha not on your list is that attributes can appear in any order, so if your regex is looking for a link with the href "foo" and the class "bar", they can come in any order, and have any number of other things between them.

Solution 5 - Html

It depends on what you mean by "parsing". Generally speaking, XML cannot be parsed using regex since XML grammar is by no means regular. To put it simply, regexes cannot count (well, Perl regexes might actually be able to count things) so you cannot balance open-close tags.

Solution 6 - Html

Are people actually making a mistake by using a regex, or is it simply good enough for the task they're trying to achieve?

I totally agree that parsing html and xml using a regex is not possible as other people have answered.

However, if your requirement is not to parse html/xml but to just get at one small bit of data in a "known good" bit of html / xml then maybe a regular expression or even an even simpler "substring" is good enough.

Solution 7 - Html

I'm tempted to say "don't re-invent the wheel". Except that XML is a really, really complex format. So maybe I should say "don't reinvent the synchrotron."

Perhaps the correct cliche starts "when all you have is a hammer..." You know how to use regular expressions, regular expression are good at parsing, so why bother to learn an XML parsing library?

Because parsing XML is hard. Any effort you save by not having to learn to use an XML parsing library will be more than made up by the amount of creative work and bug-swatting you will have to do. For your own sake, google "XML library" and leverage somebody else's work.

Solution 8 - Html

People normally default to writing greedy patterns, often enough leading to an un-thought-through .* slurping large chunks of file into the largest possible <foo>.*</foo>.

Solution 9 - Html

I think the problems boil down to:

The regex is almost invariably incorrect. There are legitimate inputs which it will fail to match correctly. If you work hard enough you can make it 99% correct, or 99.999%, but making it 100% correct is almost impossible, if only because of the weird things that XML allows by using entities.
If the regex is incorrect, even for 0.00001% of inputs, then you have a security problem, because someone can discover the one input that will break your application.
If the regex is correct enough to cover 99.99% of cases then it is going to be thoroughly unreadable and unmaintainable.
It's very likely that a regex will perform very badly on moderate-sized input files. My very first encounter with XML was to replace a Perl script that (incorrectly) parsed incoming XML documents with a proper XML parser, and we not only replaced 300 lines of unreadable code with 100 lines that anyone could understand, but we improved user response time from 10 seconds to about 0.1 seconds.

Solution 10 - Html

I believe this classic has the information you are looking for. You can find the point in one of the comments there:

> I think the flaw here is that HTML is a Chomsky Type 2 grammar > (context free grammar) and RegEx is a Chomsky Type 3 grammar (regular > expression). Since a Type 2 grammar is fundamentally more complex than > a Type 3 grammar - you can't possibly hope to make this work. But many > will try, some will claim success and others will find the fault and > totally mess you up.

Some more info from Wikipedia: Chomsky Hierarchy

Solution 11 - Html

I gave a simplified answer to this problem here. While it doesn't account for the 100% mark, I explain how it's possible if you're willing to do some pre-processing work.

Solution 12 - Html

> Generally speaking, XML cannot be parsed using regex since XML grammar is by no means regular. To put it simply, regexes cannot count (well, Perl regexes might actually be able to count things) so you cannot balance open-close tags.

I disagree. If you will use recursive in regex, you can easily find open and close tags.

Here I showed example of regex to avoid parsing errors of examples in first message.

Content Type	Original Author	Original Content on Stackoverflow
Question	Chas. Owens	View Question on Stackoverflow
Solution 1 - Html	bobince	View Answer on Stackoverflow
Solution 2 - Html	LordOfThePigs	View Answer on Stackoverflow
Solution 3 - Html	JaredPar	View Answer on Stackoverflow
Solution 4 - Html	AmbroseChapel	View Answer on Stackoverflow
Solution 5 - Html	Anton Gogolev	View Answer on Stackoverflow
Solution 6 - Html	Robin Day	View Answer on Stackoverflow
Solution 7 - Html	Isaac Rabinovitch	View Answer on Stackoverflow
Solution 8 - Html	chaos	View Answer on Stackoverflow
Solution 9 - Html	Michael Kay	View Answer on Stackoverflow
Solution 10 - Html	Adam Arold	View Answer on Stackoverflow
Solution 11 - Html	Erutan409	View Answer on Stackoverflow
Solution 12 - Html	Maxim Suslov	View Answer on Stackoverflow