What are invalid characters in XML

Xml

Xml Problem Overview


I am working with some XML that holds strings like:

<node>This is a string</node>

Some of the strings that I am passing to the nodes will have characters like &, #, $, etc.:

<node>This is a string & so is this</node>

This is not valid due to &.

I cannot wrap these strings in CDATA as they need to be as they are. I tried looking for a list of characters that cannot be put in XML nodes without being in a CDATA.

Can someone point me in the direction of one or provide me with a list of illegal characters?

Xml Solutions


Solution 1 - Xml

OK, let's separate the question of the characters that:

  1. aren't valid at all in any XML document.
  2. need to be escaped.

The answer provided by @dolmen in "https://stackoverflow.com/questions/730133/invalid-characters-in-xml/5110103#5110103" is still valid but needs to be updated with the XML 1.1 specification.

1. Invalid characters

The characters described here are all the characters that are allowed to be inserted in an XML document.

1.1. In XML 1.0

  • Reference: see [XML recommendation 1.0, §2.2 Characters][1]

The global list of allowed characters is:

> [2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

Basically, the control characters and characters out of the Unicode ranges are not allowed. This means also that calling for example the character entity &#x3; is forbidden.

1.2. In XML 1.1

  • Reference: see [XML recommendation 1.1, §2.2 Characters][2], and [1.3 Rationale and list of changes for XML 1.1][3]

The global list of allowed characters is:

> [2] Char ::= [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

> [2a] RestrictedChar ::= [#x1-#x8] | [#xB-#xC] | [#xE-#x1F] | [#x7F-#x84] | [#x86-#x9F]

This revision of the XML recommendation has extended the allowed characters so control characters are allowed, and takes into account a new revision of the Unicode standard, but these ones are still not allowed : NUL (x00), xFFFE, xFFFF...

However, the use of control characters and undefined Unicode char is discouraged.

It can also be noticed that all parsers do not always take this into account and XML documents with control characters may be rejected.

2. Characters that need to be escaped (to obtain a well-formed document):

The < must be escaped with a &#60; entity, since it is assumed to be the beginning of a tag.

The & must be escaped with a &#38; entity, since it is assumed to be the beginning a entity reference

The > should be escaped with &#62; entity. It is not mandatory -- it depends on the context -- but it is strongly advised to escape it.

The ' should be escaped with a &#39; entity -- mandatory in attributes defined within single quotes but it is strongly advised to always escape it.

The " should be escaped with a &#34; entity -- mandatory in attributes defined within double quotes but it is strongly advised to always escape it.

[1]: http://www.w3.org/TR/xml/#charsets "2.2 Characters" [2]: http://www.w3.org/TR/xml11/#charsets "2.2 Characters" [3]: http://www.w3.org/TR/xml11/#sec-xml11 "§1.3 Rationale and list of changes for XML 1.1"

Solution 2 - Xml

The list of valid characters is in the XML specification:

Char	   ::=   	#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]	/* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

Solution 3 - Xml

The only illegal characters are &, < and > (as well as " or ' in attributes, depending on which character is used to delimit the attribute value: attr="must use &quot; here, ' is allowed" and attr='must use &apos; here, " is allowed').

They're escaped using XML entities, in this case you want &amp; for &.

Really, though, you should use a tool or library that writes XML for you and abstracts this kind of thing away for you so you don't have to worry about it.

Solution 4 - Xml

This is a C# code to remove the XML invalid characters from a string and return a new valid string.

public static string CleanInvalidXmlChars(string text) 
{ 
    // From xml spec valid chars: 
    // #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]     
    // any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. 
    string re = @"[^\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD\u10000-\u10FFFF]"; 
    return Regex.Replace(text, re, ""); 
}

Solution 5 - Xml

The predeclared characters are:

& < > " '

See "What are the special characters in XML?" for more information.

Solution 6 - Xml

In addition to potame's answer, if you do want to escape using a CDATA block.

If you put your text in a CDATA block then you don't need to use escaping. In that case you can use all characters in the following range:

graphical representation of possible characters

Note: On top of that, you're not allowed to use the ]]> character sequence. Because it would match the end of the CDATA block.

If there are still invalid characters (e.g. control characters), then probably it's better to use some kind of encoding (e.g. base64).

Solution 7 - Xml

Another way to remove incorrect XML chars in C# is using XmlConvert.IsXmlChar (Available since .NET Framework 4.0)

public static string RemoveInvalidXmlChars(string content)
{
   return new string(content.Where(ch => System.Xml.XmlConvert.IsXmlChar(ch)).ToArray());
}

or you may check that all characters are XML-valid:

public static bool CheckValidXmlChars(string content)
{
   return content.All(ch => System.Xml.XmlConvert.IsXmlChar(ch));
}

.Net Fiddle

For example, the vertical tab symbol (\v) is not valid for XML, it is valid UTF-8, but not valid XML 1.0, and even many libraries (including libxml2) miss it and silently output invalid XML.

Solution 8 - Xml

Another easy way to escape potentially unwanted XML / XHTML chars in C# is:

WebUtility.HtmlEncode(stringWithStrangeChars)

Solution 9 - Xml

For Java folks, Apache has a utility class (StringEscapeUtils) that has a helper method escapeXml which can be used for escaping characters in a string using XML entities.

Solution 10 - Xml

"XmlWriter and lower ASCII characters" worked for me

string code = Regex.Replace(item.Code, @"[\u0000-\u0008,\u000B,\u000C,\u000E-\u001F]", "");

Solution 11 - Xml

In summary, valid characters in the text are:

  • tab, line-feed and carriage-return.
  • all non-control characters are valid except & and <.
  • > is not valid if following ]].

Sections 2.2 and 2.4 of the XML specification provide the answer in detail:

Characters

> Legal characters are tab, carriage return, line feed, and the legal characters of Unicode and ISO/IEC 10646

Character data

> The ampersand character (&) and the left angle bracket (<) must not > appear in their literal form, except when used as markup delimiters, > or within a comment, a processing instruction, or a CDATA section. If > they are needed elsewhere, they must be escaped using either numeric > character references or the strings " & " and " < " > respectively. The right angle bracket (>) may be represented using the > string " > ", and must, for compatibility, be escaped using either > " > " or a character reference when it appears in the string " ]]> > " in content, when that string is not marking the end of a CDATA > section.

Solution 12 - Xml

In the Woodstox XML processor, invalid characters are classified by this code:

if (c == 0) {
    throw new IOException("Invalid null character in text to output");
}
if (c < ' ' || (c >= 0x7F && c <= 0x9F)) {
    String msg = "Invalid white space character (0x" + Integer.toHexString(c) + ") in text to output";
    if (mXml11) {
        msg += " (can only be output using character entity)";
    }
    throw new IOException(msg);
}
if (c > 0x10FFFF) {
    throw new IOException("Illegal unicode character point (0x" + Integer.toHexString(c) + ") to output; max is 0x10FFFF as per RFC");
}
/*
 * Surrogate pair in non-quotable (not text or attribute value) content, and non-unicode encoding (ISO-8859-x,
 * Ascii)?
 */
if (c >= SURR1_FIRST && c <= SURR2_LAST) {
    throw new IOException("Illegal surrogate pair -- can only be output via character entities, which are not allowed in this content");
}
throw new IOException("Invalid XML character (0x"+Integer.toHexString(c)+") in text to output");

Source from here

Solution 13 - Xml

ampersand (&) is escaped to &amp;

double quotes (") are escaped to &quot;

single quotes (') are escaped to &apos; 

less than (<) is escaped to &lt; 

greater than (>) is escaped to &gt;

In C#, use System.Security.SecurityElement.Escape or System.Net.WebUtility.HtmlEncode to escape these illegal characters.

string xml = "<node>it's my \"node\" & i like it 0x12 x09 x0A  0x09 0x0A <node>";
string encodedXml1 = System.Security.SecurityElement.Escape(xml);
string encodedXml2= System.Net.WebUtility.HtmlEncode(xml);
    
   
encodedXml1
"&lt;node&gt;it&apos;s my &quot;node&quot; &amp; i like it 0x12 x09 x0A  0x09 0x0A &lt;node&gt;"
        
encodedXml2
"&lt;node&gt;it&#39;s my &quot;node&quot; &amp; i like it 0x12 x09 x0A  0x09 0x0A &lt;node&gt;"

Solution 14 - Xml

Anyone tried this System.Security.SecurityElement.Escape(yourstring)? This will replace invalid XML characters in a string with their valid equivalent.

Solution 15 - Xml

For XSL (on really lazy days) I use:

capture="&amp;(?!amp;)" capturereplace="&amp;amp;"

to translate all &-signs that aren't follwed på amp; to proper ones.

We have cases where the input is in CDATA but the system which uses the XML doesn't take it into account. It's a sloppy fix, beware...

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionRailsSonView Question on Stackoverflow
Solution 1 - XmlpotameView Answer on Stackoverflow
Solution 2 - XmldolmenView Answer on Stackoverflow
Solution 3 - XmlWelbogView Answer on Stackoverflow
Solution 4 - XmlmathifonsecaView Answer on Stackoverflow
Solution 5 - XmlcgpView Answer on Stackoverflow
Solution 6 - XmlbvdbView Answer on Stackoverflow
Solution 7 - XmlAlex VazhevView Answer on Stackoverflow
Solution 8 - XmltiandsView Answer on Stackoverflow
Solution 9 - XmlA Null PointerView Answer on Stackoverflow
Solution 10 - XmlKalpesh PopatView Answer on Stackoverflow
Solution 11 - XmlrghomeView Answer on Stackoverflow
Solution 12 - XmlJerome Saint-YvesView Answer on Stackoverflow
Solution 13 - Xmllive-loveView Answer on Stackoverflow
Solution 14 - XmlklaydzeView Answer on Stackoverflow
Solution 15 - XmlSamson WiklundView Answer on Stackoverflow