What is the recommended way to escape HTML symbols in plain Java?

JavaHtmlEscaping

Java Problem Overview


Is there a recommended way to escape <, >, " and & characters when outputting HTML in plain Java code? (Other than manually doing the following, that is).

String source = "The less than sign (<) and ampersand (&) must be escaped before using them in HTML";
String escaped = source.replace("<", "&lt;").replace("&", "&amp;"); // ...

Java Solutions


Solution 1 - Java

StringEscapeUtils from Apache Commons Lang:

import static org.apache.commons.lang.StringEscapeUtils.escapeHtml;
// ...
String source = "The less than sign (<) and ampersand (&) must be escaped before using them in HTML";
String escaped = escapeHtml(source);

For version 3:

import static org.apache.commons.lang3.StringEscapeUtils.escapeHtml4;
// ...
String escaped = escapeHtml4(source);

Solution 2 - Java

An alternative to Apache Commons: Use Spring's HtmlUtils.htmlEscape(String input) method.

Solution 3 - Java

Nice short method:

public static String escapeHTML(String s) {
	StringBuilder out = new StringBuilder(Math.max(16, s.length()));
	for (int i = 0; i < s.length(); i++) {
		char c = s.charAt(i);
		if (c > 127 || c == '"' || c == '\'' || c == '<' || c == '>' || c == '&') {
			out.append("&#");
			out.append((int) c);
			out.append(';');
		} else {
			out.append(c);
		}
	}
	return out.toString();
}

Based on https://stackoverflow.com/a/8838023/1199155 (the amp is missing there). The four characters checked in the if clause are the only ones below 128, according to http://www.w3.org/TR/html4/sgml/entities.html

Solution 4 - Java

There is a newer version of the Apache Commons Lang library and it uses a different package name (org.apache.commons.lang3). The StringEscapeUtils now has different static methods for escaping different types of documents (http://commons.apache.org/proper/commons-lang/javadocs/api-3.0/index.html). So to escape HTML version 4.0 string:

import static org.apache.commons.lang3.StringEscapeUtils.escapeHtml4;

String output = escapeHtml4("The less than sign (<) and ampersand (&) must be escaped before using them in HTML");

Solution 5 - Java

For those who use Google Guava:

import com.google.common.html.HtmlEscapers;
[...]
String source = "The less than sign (<) and ampersand (&) must be escaped before using them in HTML";
String escaped = HtmlEscapers.htmlEscaper().escape(source);

Solution 6 - Java

On android (API 16 or greater) you can:

Html.escapeHtml(textToScape);

or for lower API:

TextUtils.htmlEncode(textToScape);

Solution 7 - Java

Be careful with this. There are a number of different 'contexts' within an HTML document: Inside an element, quoted attribute value, unquoted attribute value, URL attribute, javascript, CSS, etc... You'll need to use a different encoding method for each of these to prevent Cross-Site Scripting (XSS). Check the OWASP XSS Prevention Cheat Sheet for details on each of these contexts. You can find escaping methods for each of these contexts in the OWASP ESAPI library -- https://github.com/ESAPI/esapi-java-legacy.

Solution 8 - Java

For some purposes, HtmlUtils:

import org.springframework.web.util.HtmlUtils;
[...]
HtmlUtils.htmlEscapeDecimal("&"); //gives &#38;
HtmlUtils.htmlEscape("&"); //gives &amp;

Solution 9 - Java

org.apache.commons.lang3.StringEscapeUtils is now deprecated. You must now use org.apache.commons.text.StringEscapeUtils by

	<dependency>
		<groupId>org.apache.commons</groupId>
		<artifactId>commons-text</artifactId>
		<version>${commons.text.version}</version>
	</dependency>

Solution 10 - Java

While @dfa answer of org.apache.commons.lang.StringEscapeUtils.escapeHtml is nice and I have used it in the past it should not be used for escaping HTML (or XML) attributes otherwise the whitespace will be normalized (meaning all adjacent whitespace characters become a single space).

I know this because I have had bugs filed against my library (JATL) for attributes where whitespace was not preserved. Thus I have a drop in (copy n' paste) class (of which I stole some from JDOM) that differentiates the escaping of attributes and element content.

While this may not have mattered as much in the past (proper attribute escaping) it is increasingly become of greater interest given the use use of HTML5's data- attribute usage.

Solution 11 - Java

Java 8+ Solution:

public static String escapeHTML(String str) {
    return str.chars().mapToObj(c -> c > 127 || "\"'<>&".indexOf(c) != -1 ?
       "&#" + c + ";" : String.valueOf((char) c)).collect(Collectors.joining());
}

String#chars returns an IntStream of the char values from the String. We can then use mapToObj to escape the characters with a character code greater than 127 (non-ASCII characters) as well as the double quote ("), single quote ('), left angle bracket (<), right angle bracket (>), and ampersand (&). Collectors.joining concatenates the Strings back together.

To better handle Unicode characters, String#codePoints can be used instead.

public static String escapeHTML(String str) {
    return str.codePoints().mapToObj(c -> c > 127 || "\"'<>&".indexOf(c) != -1 ?
            "&#" + c + ";" : new String(Character.toChars(c)))
       .collect(Collectors.joining());
}

Solution 12 - Java

The most of libraries offer escaping everything they can including hundreds of symbols and thousands of non-ASCII characters which is not what you want in UTF-8 world.

Also, as Jeff Williams noted, there's no single “escape HTML” option, there are several contexts.

Assuming you never use unquoted attributes, and keeping in mind that different contexts exist, it've written my own version:

private static final long TEXT_ESCAPE =
        1L << '&' | 1L << '<';
private static final long DOUBLE_QUOTED_ATTR_ESCAPE =
        TEXT_ESCAPE | 1L << '"';
private static final long SINGLE_QUOTED_ATTR_ESCAPE =
        TEXT_ESCAPE | 1L << '\'';
private static final long ESCAPES =
        DOUBLE_QUOTED_ATTR_ESCAPE | SINGLE_QUOTED_ATTR_ESCAPE;

// 'quot' and 'apos' are 1 char longer than '#34' and '#39'
// which I've decided to use
private static final String REPLACEMENTS = "&#34;&amp;&#39;&lt;";
private static final int REPL_SLICES = /*  [0,   5,   10,  15, 19) */
        5<<5 | 10<<10 | 15<<15 | 19<<20;
// These 5-bit numbers packed into a single int
// are indices within REPLACEMENTS which is a 'flat' String[]

private static void appendEscaped(
        Appendable builder, CharSequence content, long escapes) {
    try {
        int startIdx = 0, len = content.length();
        for (int i = 0; i < len; i++) {
            char c = content.charAt(i);
            long one;
            if (((c & 63) == c) && ((one = 1L << c) & escapes) != 0) {
            // -^^^^^^^^^^^^^^^   -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
            // |                  | take only dangerous characters
            // | java shifts longs by 6 least significant bits,
            // | e. g. << 0b110111111 is same as >> 0b111111.
            // | Filter out bigger characters

                int index = Long.bitCount(ESCAPES & (one - 1));
                builder.append(content, startIdx, i /* exclusive */).append(
                        REPLACEMENTS,
                        REPL_SLICES >>> (5 * index) & 31,
                        REPL_SLICES >>> (5 * (index + 1)) & 31
                );
                startIdx = i + 1;
            }
        }
        builder.append(content, startIdx, len);
    } catch (IOException e) {
        // typically, our Appendable is StringBuilder which does not throw;
        // also, there's no way to declare 'if A#append() throws E,
        // then appendEscaped() throws E, too'
        throw new UncheckedIOException(e);
    }
}

Consider copy-pasting from Gist without line length limit.

UPD: As another answer suggests, > escaping is not necessary; also, " within attr='…' is allowed, too. I've updated the code accordingly.

You may check it out yourself:

<!DOCTYPE html>
<html lang="en">
<head><title>Test</title></head>
<body>

<p title="&lt;&#34;I'm double-quoted!&#34;>">&lt;"Hello!"></p>
<p title='&lt;"I&#39;m single-quoted!">'>&lt;"Goodbye!"></p>

</body>
</html>

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionBen LingsView Question on Stackoverflow
Solution 1 - JavadfaView Answer on Stackoverflow
Solution 2 - JavaAdamskiView Answer on Stackoverflow
Solution 3 - JavaBruno EberhardView Answer on Stackoverflow
Solution 4 - JavaMartin DimitrovView Answer on Stackoverflow
Solution 5 - JavaokraszView Answer on Stackoverflow
Solution 6 - JavaOriolJView Answer on Stackoverflow
Solution 7 - JavaJeff WilliamsView Answer on Stackoverflow
Solution 8 - JavaAUUView Answer on Stackoverflow
Solution 9 - JavaLuca StancapianoView Answer on Stackoverflow
Solution 10 - JavaAdam GentView Answer on Stackoverflow
Solution 11 - JavaUnmitigatedView Answer on Stackoverflow
Solution 12 - JavaMiha_x64View Answer on Stackoverflow