Which characters need to be escaped in HTML?

HtmlHtml EntitiesHtml EncodeHtml Escape-Characters

Html Problem Overview


Are they the same as XML, perhaps plus the space one ( )?

I've found some huge lists of HTML escape characters but I don't think they must be escaped. I want to know what needs to be escaped.

Html Solutions


Solution 1 - Html

If you're inserting text content in your document in a location where text content is expected1, you typically only need to escape the same characters as you would in XML. Inside of an element, this just includes the entity escape ampersand & and the element delimiter less-than and greater-than signs < >:

& becomes &amp;
< becomes &lt;
> becomes &gt;

Inside of attribute values you must also escape the quote character you're using:

" becomes &quot;
' becomes &#39;

In some cases it may be safe to skip escaping some of these characters, but I encourage you to escape all five in all cases to reduce the chance of making a mistake.

If your document encoding does not support all of the characters that you're using, such as if you're trying to use emoji in an ASCII-encoded document, you also need to escape those. Most documents these days are encoded using the fully Unicode-supporting UTF-8 encoding where this won't be necessary.

In general, you should not escape spaces as &nbsp;. &nbsp; is not a normal space, it's a non-breaking space. You can use these instead of normal spaces to prevent a line break from being inserted between two words, or to insert          extra        space       without it being automatically collapsed, but this is usually a rare case. Don't do this unless you have a design constraint that requires it.


1 By "a location where text content is expected", I mean inside of an element or quoted attribute value where normal parsing rules apply. For example: <p>HERE</p> or <p title="HERE">...</p>. What I wrote above does not apply to content that has special parsing rules or meaning, such as inside of a script or style tag, or as an element or attribute name. For example: <NOT-HERE>...</NOT-HERE>, <script>NOT-HERE</script>, <style>NOT-HERE</style>, or <p NOT-HERE="...">...</p>.

In these contexts, the rules are more complicated and it's much easier to introduce a security vulnerability. I strongly discourage you from ever inserting dynamic content in any of these locations. I have seen teams of competent security-aware developers introduce vulnerabilities by assuming that they had encoded these values correctly, but missing an edge case. There's usually a safer alternative, such as putting the dynamic value in an attribute and then handling it with JavaScript.

If you must, please read the Open Web Application Security Project's XSS Prevention Rules to help understand some of the concerns you will need to keep in mind.

Solution 2 - Html

It depends upon the context. Some possible contexts in HTML:

  • document body
  • inside common attributes
  • inside script tags
  • inside style tags
  • several more!

See OWASP's Cross Site Scripting Prevention Cheat Sheet, especially the "Why Can't I Just HTML Entity Encode Untrusted Data?" and "XSS Prevention Rules" sections. However, it's best to read the whole document.

Solution 3 - Html

Basically, there are three main characters which should be always escaped in your HTML and XML files, so they don't interact with the rest of the markups, so as you probably expect, two of them gonna be the syntax wrappers, which are <>, they are listed as below:

 1)  &lt; (<)
    
 2)  &gt; (>)
    
 3)  &amp; (&)

Also we may use double-quote (") as " and the single quote (') as &apos

Avoid putting dynamic content in <script> and <style>.These rules are not for applied for them. For example, if you have to include JSON in a

Solution 4 - Html

The exact answer depends on the context. In general, these characters must not be present (HTML 5.2 §3.2.4.2.5):

> Text nodes and attribute values must consist of Unicode characters, must not contain U+0000 characters, must not contain permanently undefined Unicode characters (noncharacters), and must not contain control characters other than space characters. This specification includes extra constraints on the exact value of Text nodes and attribute values depending on their precise context. > > For elements in HTML, the constraints of the Text content model also depends on the kind of element. For instance, an "<" inside a textarea element does not need to be escaped in HTML because textarea is an escapable raw text element.

These restrictions are scattered across the specification. E.g., attribute values (§8.1.2.3) must not contain an ambiguous ampersand and be either (i) empty, (ii) within single quotes (and thus must not contain U+0027 APOSTROPHE character '), (iii) within double quotes (must not contain U+0022 QUOTATION MARK character "), or (iv) unquoted — with the following restrictions:

> ... must not contain any literal space characters, any U+0022 QUOTATION MARK characters ("), U+0027 APOSTROPHE characters ('), U+003D EQUALS SIGN characters (=), U+003C LESS-THAN SIGN characters (<), U+003E GREATER-THAN SIGN characters (>), or U+0060 GRAVE ACCENT characters (`), and must not be the empty string.

Solution 5 - Html

If you want to escape a string of markup using JavaScript there is:

or, if you don't want to pull in a dependency, here is the same thing, though slightly slower because it uses split/map/join instead of charCodeAt/substring.

function escapeMarkup (dangerousInput) {
  const dangerousString = String(dangerousInput);
  const matchHtmlRegExp = /["'&<>]/;
  const match = matchHtmlRegExp.exec(dangerousString);
  if (!match) {
    return dangerousInput;
  }

  const encodedSymbolMap = {
    '"': '&quot;',
    '\'': '&#39;',
    '&': '&amp;',
    '<': '&lt;',
    '>': '&gt;'
  };
  const dangerousCharacters = dangerousString.split('');
  const safeCharacters = dangerousCharacters.map(function (character) {
    return encodedSymbolMap[character] || character;
  });
  const safeString = safeCharacters.join('');
  return safeString;
}

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionAhmetView Question on Stackoverflow
Solution 1 - HtmlJeremyView Answer on Stackoverflow
Solution 2 - HtmldaxelrodView Answer on Stackoverflow
Solution 3 - HtmlAlirezaView Answer on Stackoverflow
Solution 4 - HtmlAndreyView Answer on Stackoverflow
Solution 5 - HtmlJaredcheedaView Answer on Stackoverflow