Why can some ASCII characters not be expressed in the form '\uXXXX' in Java source code?

Java

Java Problem Overview


I stumbled over this (again) today:

class Test {
    char ok = '\n';
    char okAsWell = '\u000B';
    char error = '\u000A';
}

It does not compile:

> Invalid character constant in line 4.

The compiler seems to insist that I write '\n' instead. I see no reason for this, yet it's very annoying.

Is there a logical explanation why characters that have a special notation (like \t, \n, \r) must be expressed in that form in Java source?

Java Solutions


Solution 1 - Java

Unicode characters are replaced by their value, so your line is replaced by the compiler with:

char error = '
';

which is not a valid Java statement.

This is dictated by the Language Specification:

> A compiler for the Java programming language ("Java compiler") first recognizes Unicode escapes in its input, translating the ASCII characters \u followed by four hexadecimal digits to the UTF-16 code unit (§3.1) of the indicated hexadecimal value, and passing all other characters unchanged. Representing supplementary characters requires two consecutive Unicode escapes. This translation step results in a sequence of Unicode input characters.

This can lead to surprising stuff, for example, this is a valid Java program (it contains hidden unicode characters) - courtesy of Peter Lawrey:

public static void main(String[] args) {
    for (char c‮h = 0; c‮h < Character.MAX_VALUE; c‮h++) {
        if (Character.isJavaIdentifierPart(c‮h) && !Character.isJavaIdentifierStart(c‮h)) {
            System.out.printf("%04x <%s>%n", (int) c‮h, "" + c‮h);
        }
    }
}

Solution 2 - Java

Unicode escape sequences like \u000a are replaced by the actual characters they represent before the Java compiler does anything else with the source code. And so, your program eventually ends up at

char ch = '
';

So the \u000a in your source code is replaced internally by a linefeed character. Note that this happens before the compiler actually reads and interprets your source code.

Referring to the Java Language Specification:

> It is a compile-time error for a line terminator (§3.4) to appear after the opening ' and before the closing '.

And as well all know by heart, \n is a line terminator, quoting:

 LineTerminator:
    the ASCII LF character, also known as "newline"
    the ASCII CR character, also known as "return"
    the ASCII CR character followed by the ASCII LF character

Other symbols that could cause problems are \, ' and " for example.

Solution 3 - Java

I think the reason is that \uXXXX sequences are expanded when the code is being parsed, see JLS §3.2. Lexical Translations.

Solution 4 - Java

It is described in 3.3. Unicode Escapes http://docs.oracle.com/javase/specs/jls/se7/html/jls-3.html. Javac first finds \uxxxx sequences in .java and replaces them with real characters then compiles. In case of

char error = '\u000A';

\u000A will be replace with newline character code (10) and the actual text will be

char error = '
';

Solution 5 - Java

Because the compiler treats them the same as unescaped text.

This is valid code:

 class \u00C9 {}

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionDurandalView Question on Stackoverflow
Solution 1 - JavaassyliasView Answer on Stackoverflow
Solution 2 - JavapoitroaeView Answer on Stackoverflow
Solution 3 - JavaNPEView Answer on Stackoverflow
Solution 4 - JavaEvgeniy DorofeevView Answer on Stackoverflow
Solution 5 - JavaMcDowellView Answer on Stackoverflow