Why is "&reg" being rendered as "®" without the bounding semicolon

HtmlQuery String

Html Problem Overview


I've been running into a problem that was revealed through our Google adwords-driven marketing campaign. One of the standard parameters used is "region". When a user searches and clicks on a sponsored link, Google generates a long URL to track the click and sends a bunch of stuff along in the referrer. We capture this for our records, and we've noticed that the "Region" parameter is coming through incorrectly. What should be

http://ravercats.com/meow?foo=bar&region=catnip

is instead coming through as:

http://ravercats.com/meow?foo=bar®ion=catnip

I've verified that this occurs in all browsers. It's my understanding that HTML entity syntax is defined as follows:

&VALUE;

where the leading boundary is the ampersand and the closing boundary is the semicolon. Seems straightforward enough. The problem is that this isn't being respected for the ® entity, and it's wreaking all kinds of havoc throughout our system.

Does anyone know why this is occurring? Is it a bug in the DTD? (I'm looking for the current HTML DTD to see if I can make sense of it) I'm trying to figure out what would be common across browsers to make this happen, thus my looking for the DTD.

Here is a proof you can use. Take this code, make an HTML file out of it and render it in a browser:

<html>
<a href="http://foo.com/bar?foo=bar&region=US&register=lowpass&reg_test=fail&trademark=correct">http://foo.com/bar?foo=bar&region=US&register=lowpass&reg_test=fail&trademark=correct</a>
</html>

EDIT: To everyone who's suggesting that I need to escape the entire URL, the example URLs above are exactly that, examples. The real URL is coming directly from Google and I have no control over how it is constructed. These suggestions, while valid, don't answer the question: "Why is this happening".

Html Solutions


Solution 1 - Html

Although valid character references always have a semicolon at the end, some invalid named character references without a semicolon are, for backward compatibility reasons, recognized by modern browsers' HTML parsers.

Either you know what that entire list is, or you follow the HTML5 rules for when & is valid without being escaped (e.g. when followed by a space) or otherwise always escape & as &amp; whenever in doubt.

For reference, the full list of named character references that are recognized without a semicolon is:

> AElig, AMP, Aacute, Acirc, Agrave, Aring, Atilde, Auml, COPY, Ccedil, > ETH, Eacute, Ecirc, Egrave, Euml, GT, Iacute, Icirc, Igrave, Iuml, LT, > Ntilde, Oacute, Ocirc, Ograve, Oslash, Otilde, Ouml, QUOT, REG, THORN, > Uacute, Ucirc, Ugrave, Uuml, Yacute, aacute, acirc, acute, aelig, > agrave, amp, aring, atilde, auml, brvbar, ccedil, cedil, cent, copy, > curren, deg, divide, eacute, ecirc, egrave, eth, euml, frac12, frac14, > frac34, gt, iacute, icirc, iexcl, igrave, iquest, iuml, laquo, lt, > macr, micro, middot, nbsp, not, ntilde, oacute, ocirc, ograve, ordf, > ordm, oslash, otilde, ouml, para, plusmn, pound, quot, raquo, reg, > sect, shy, sup1, sup2, sup3, szlig, thorn, times, uacute, ucirc, > ugrave, uml, uuml, yacute, yen, yuml

However, it should be noted that only when in an attribute value, named character references in the above list are not processed as such by conforming HTML5 parsers if the next character is a = or a alphanumeric ASCII character.

For the full list of named character references with or without ending semicolons, see here.

Solution 2 - Html

This is a very messy business and depends on context (text content vs. attribute value).

Formally, by HTML specs up to and including HTML 4.01, an entity reference may appear without trailing semicolon, if the next character is not a name character. So e.g. &region= would be syntactically correct but undefined, as entity region has not been defined. XHTML makes the trailing semicolon required.

Browsers have traditionally played by other rules, though. Due to the common syntax of query URLs, they parse e.g. href="http://ravercats.com/meow?foo=bar&region=catnip" so that &region is not treated as an entity reference but as just text data. And authors mostly used such constructs, even though they are formally incorrect.

Contrary to what the question seems to be saying, href="http://ravercats.com/meow?foo=bar&region=catnip" actually works well. Problems arise when the string is not in an attribute value but inside text content, which is rather uncommon: we don’t normally write URLs in text. In text, &region= gets processed so that &reg is recognized as an entity reference (for “®”) and the rest is just character data. Such odd behavior is being made official in HTML5 CR, where clause 8.2.4.69 Tokenizing character references describes the “double standard”:

> If the character reference is being consumed as part of an attribute, > and the last character matched is not a ";" (U+003B) character, and > the next character is either a "=" (U+003D) character or in the range > ASCII digits, uppercase ASCII letters, or lowercase ASCII letters, > then, for historical reasons, all the characters that were matched > after the U+0026 AMPERSAND character (&) must be unconsumed, and > nothing is returned.

Thus, in an attribute value, even &reg= would not be treated as containing a character reference, and still less &region=. (But reg_test= is a different case, due to the underscore character.)

In text content, other rules apply. The construct &region= causes then a parse error (by HTML5 CR rules), but with well-defined error handling: &reg is recognized as a character reference.

Solution 3 - Html

Maybe try replacing your & as &amp;? Ampersands are characters that must be escaped in HTML as well, because they are reserved to be used as parts of entities.

Solution 4 - Html

1: The following markup is invalid in the first place (use the W3C Markup Validation Service to verify):

<a href="http://foo.com/bar?foo=bar&region=US&register=lowpass&reg_test=fail&trademark=correct"></a>

In the above example, the & character should be encoded as &amp;, like so:

<a href="http://foo.com/bar?foo=bar&amp;region=US&amp;register=lowpass&amp;reg_test=fail&amp;trademark=correct"></a>

2: Browsers are tolerant; they try to make sense out of broken HTML. In your case, all possibly valid HTML entities are converted to HTML entities.

Solution 5 - Html

Here is a simple solution and it may not work in all instances.

So from this:

http://ravercats.com/meow?status=Online&region=Atlantis

To This:

http://ravercats.com/meow?region=Atlantis&status=Online

Because the &reg as we know triggers the special character ®

Caveat: If you have no control over the order of your URL query string parameters then you'll have to change your variable name to something else.

Solution 6 - Html

Escape your output!

Simply enough, you need to encode the url format into html format for accurate representation (ideally you would do so with a template engine variable escaping function, but barring that, with htmlspecialchars($url) or htmlentities($url) in php).

See your test case and then the correctly encoded html at this jsfiddle: http://jsfiddle.net/tchalvakspam/Fp3W6/

Inactive code here:

<div>
Unescaped:
<br>
<a href="">http://foo.com/bar?foo=bar&region=US&register=lowpass&reg_test=fail&trademark=correct</a>
</div>

<div>
Correctly escaped:
<br>
http://foo.com/bar?foo=bar&amp;region=US&amp;register=lowpass&amp;reg_test=fail&amp;trademark=correct
</div>

Solution 7 - Html

It seems to me that what you have received from google is not an actual URL but a variable which refers to a url (query-string). So, thats why it's being parsed as registration mark when rendered.

I would say, you owe to url-encode it and decode it whenever processing it. Like any other variable containing special entities.

Solution 8 - Html

To prevent this from happening you should encode urls, which replaces characters like the ampersand with a % and a hexadecimal number behind it in the url.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionSpankyView Question on Stackoverflow
Solution 1 - HtmlAlohciView Answer on Stackoverflow
Solution 2 - HtmlJukka K. KorpelaView Answer on Stackoverflow
Solution 3 - HtmljchapaView Answer on Stackoverflow
Solution 4 - HtmlSalman AView Answer on Stackoverflow
Solution 5 - HtmlFrank TudorView Answer on Stackoverflow
Solution 6 - HtmlKzqaiView Answer on Stackoverflow
Solution 7 - HtmljjyepezView Answer on Stackoverflow
Solution 8 - HtmlSkullyhoofdView Answer on Stackoverflow