What characters do I need to escape in XML documents?

XmlEscapingCharacter

Xml Problem Overview


What characters must be escaped in XML documents, or where could I find such a list?

Xml Solutions


Solution 1 - Xml

If you use an appropriate class or library, they will do the escaping for you. Many XML issues are caused by string concatenation.

XML escape characters

There are only five:

"   "
'   '
<   &lt;
>   &gt;
&   &amp;

Escaping characters depends on where the special character is used.

The examples can be validated at the W3C Markup Validation Service.

##Text##

The safe way is to escape all five characters in text. However, the three characters ", ' and > needn't be escaped in text:

<?xml version="1.0"?>
<valid>"'></valid>

##Attributes##

The safe way is to escape all five characters in attributes. However, the > character needn't be escaped in attributes:

<?xml version="1.0"?>
<valid attribute=">"/>

The ' character needn't be escaped in attributes if the quotes are ":

<?xml version="1.0"?>
<valid attribute="'"/>

Likewise, the " needn't be escaped in attributes if the quotes are ':

<?xml version="1.0"?>
<valid attribute='"'/>

##Comments##

All five special characters must not be escaped in comments:

<?xml version="1.0"?>
<valid>
<!-- "'<>& -->
</valid>

##CDATA##

All five special characters must not be escaped in CDATA sections:

<?xml version="1.0"?>
<valid>
<![CDATA["'<>&]]>
</valid>

##Processing instructions##

All five special characters must not be escaped in XML processing instructions:

<?xml version="1.0"?>
<?process <"'&> ?>
<valid/>

XML vs. HTML

HTML has its own set of escape codes which cover a lot more characters.

Solution 2 - Xml

Perhaps this will help:

List of XML and HTML character entity references:

> In SGML, HTML and XML documents, the > logical constructs known as character > data and attribute values consist of > sequences of characters, in which each > character can manifest directly > (representing itself), or can be > represented by a series of characters > called a character reference, of which > there are two types: a numeric > character reference and a character > entity reference. This article lists > the character entity references that > are valid in HTML and XML documents.

That article lists the following five predefined XML entities:

quot  "
amp   &
apos  '
lt    <
gt    >

Solution 3 - Xml

According to the specifications of the World Wide Web Consortium (w3C), there are 5 characters that must not appear in their literal form in an XML document, except when used as markup delimiters or within a comment, a processing instruction, or a CDATA section. In all the other cases, these characters must be replaced either using the corresponding entity or the numeric reference according to the following table:

Original CharacterXML entity replacementXML numeric replacement
<                              &lt;                                    &#60;                                    
>                              &gt;                                   &#62;                                    
"                               &quot;                               &#34;                                    
&                              &amp;                               &#38;                                    
'                               &apos;                               &#39;                                    

Notice that the aforementioned entities can be used also in HTML, with the exception of &apos;, that was introduced with XHTML 1.0 and is not declared in HTML 4. For this reason, and to ensure retro-compatibility, the XHTML specification recommends the use of &#39; instead.

Solution 4 - Xml

New, simplified answer to an old, commonly asked question...

Simplified XML Escaping (prioritized, 100% complete)

  1. Always (90% important to remember)

    • Escape < as &lt; unless < is starting a <tag/> or other markup.
    • Escape & as &amp; unless & is starting an &entity;.
  2. Attribute Values (9% important to remember)

    • attr=" 'Single quotes' are ok within double quotes."
    • attr=' "Double quotes" are ok within single quotes.'
    • Escape " as &quot; and ' as &apos; otherwise.
  3. Comments, CDATA, and Processing Instructions (0.9% important to remember)

    • <!-- Within comments --> nothing has to be escaped but no -- strings are allowed.
    • <![CDATA[ Within CDATA ]]> nothing has to be escaped, but no ]]> strings are allowed.
    • <?PITarget Within PIs ?> nothing has to be escaped, but no ?> strings are allowed.
  4. Esoterica (0.1% important to remember)

Solution 5 - Xml

Escaping characters is different for tags and attributes.

For tags:

 < &lt;
 > &gt; (only for compatibility, read below)
 & &amp;

For attributes:

" &quot;
' &apos;

From Character Data and Markup:

> The ampersand character (&) and the left angle bracket (<) must not > appear in their literal form, except when used as markup delimiters, > or within a comment, a processing instruction, or a CDATA section. If > they are needed elsewhere, they must be escaped using either numeric > character references or the strings " &amp; " and " &lt; " > respectively. The right angle bracket (>) may be represented using the > string " &gt; ", and must, for compatibility, be escaped using either > " &gt; " or a character reference when it appears in the string " ]]> > " in content, when that string is not marking the end of a CDATA > section. > > To allow attribute values to contain both single and double quotes, > the apostrophe or single-quote character (') may be represented as " > &apos; ", and the double-quote character (") as " &quot; ".

Solution 6 - Xml

In addition to the commonly known five characters [<, >, &, ", and '], I would also escape the vertical tab character (0x0B). It is valid UTF-8, but not valid XML 1.0, and even many libraries (including the highly portable (ANSI C) library libxml2) miss it and silently output invalid XML.

Solution 7 - Xml

Abridged from: XML, Escaping

There are five predefined entities:

&lt; represents "<"
&gt; represents ">"
&amp; represents "&"
&apos; represents '
&quot; represents "

"All permitted Unicode characters may be represented with a numeric character reference." For example:

&#20013;

Most of the control characters and other Unicode ranges are specifically excluded, meaning (I think) they can't occur either escaped or direct:

Valid characters in XML

Solution 8 - Xml

The accepted answer is not correct. Best is to use a library for escaping xml.

As mentioned in this other question

"Basically, the control characters and characters out of the Unicode ranges are not allowed. This means also that calling for example the character entity  is forbidden."

If you only escape the five characters. You can have problems like https://stackoverflow.com/questions/5742543/an-invalid-xml-character-unicode-0xc-was-found

Solution 9 - Xml

It depends on the context. For the content, it is < and &, and ]]> (though a string of three instead of one character).

For attribute values, it is <, &, ", and '.

For CDATA, it is ]]>.

Solution 10 - Xml

Only < and & are required to be escaped if they are to be treated character data and not markup:

2.4 Character Data and Markup

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionJulius AView Question on Stackoverflow
Solution 1 - XmlWelbogView Answer on Stackoverflow
Solution 2 - XmlAndrew HareView Answer on Stackoverflow
Solution 3 - XmlAlbzView Answer on Stackoverflow
Solution 4 - XmlkjhughesView Answer on Stackoverflow
Solution 5 - XmlPeter BartelsView Answer on Stackoverflow
Solution 6 - XmlCharon MEView Answer on Stackoverflow
Solution 7 - XmlTim CooperView Answer on Stackoverflow
Solution 8 - XmlGabriel FurstenheimView Answer on Stackoverflow
Solution 9 - Xml把友情留在无盐View Answer on Stackoverflow
Solution 10 - XmlQuestionlessView Answer on Stackoverflow