Why can some ASCII characters not be expressed in the form '\uXXXX' in Java source code?
JavaJava Problem Overview
I stumbled over this (again) today:
class Test {
char ok = '\n';
char okAsWell = '\u000B';
char error = '\u000A';
}
It does not compile:
> Invalid character constant in line 4.
The compiler seems to insist that I write '\n' instead. I see no reason for this, yet it's very annoying.
Is there a logical explanation why characters that have a special notation (like \t
, \n
, \r
) must be expressed in that form in Java source?
Java Solutions
Solution 1 - Java
Unicode characters are replaced by their value, so your line is replaced by the compiler with:
char error = '
';
which is not a valid Java statement.
This is dictated by the Language Specification:
> A compiler for the Java programming language ("Java compiler") first recognizes Unicode escapes in its input, translating the ASCII characters \u followed by four hexadecimal digits to the UTF-16 code unit (§3.1) of the indicated hexadecimal value, and passing all other characters unchanged. Representing supplementary characters requires two consecutive Unicode escapes. This translation step results in a sequence of Unicode input characters.
This can lead to surprising stuff, for example, this is a valid Java program (it contains hidden unicode characters) - courtesy of Peter Lawrey:
public static void main(String[] args) {
for (char ch = 0; ch < Character.MAX_VALUE; ch++) {
if (Character.isJavaIdentifierPart(ch) && !Character.isJavaIdentifierStart(ch)) {
System.out.printf("%04x <%s>%n", (int) ch, "" + ch);
}
}
}
Solution 2 - Java
Unicode escape sequences like \u000a
are replaced by the actual characters they represent before the Java compiler does anything else with the source code. And so, your program eventually ends up at
char ch = '
';
So the \u000a
in your source code is replaced internally by a linefeed character. Note that this happens before the compiler actually reads and interprets your source code.
Referring to the Java Language Specification:
> It is a compile-time error for a line terminator (§3.4) to appear after the opening ' and before the closing '.
And as well all know by heart, \n
is a line terminator, quoting:
LineTerminator:
the ASCII LF character, also known as "newline"
the ASCII CR character, also known as "return"
the ASCII CR character followed by the ASCII LF character
Other symbols that could cause problems are \
, '
and "
for example.
Solution 3 - Java
I think the reason is that \uXXXX
sequences are expanded when the code is being parsed, see JLS §3.2. Lexical Translations.
Solution 4 - Java
It is described in 3.3. Unicode Escapes http://docs.oracle.com/javase/specs/jls/se7/html/jls-3.html. Javac first finds \uxxxx sequences in .java and replaces them with real characters then compiles. In case of
char error = '\u000A';
\u000A will be replace with newline
character code (10) and the actual text will be
char error = '
';
Solution 5 - Java
Because the compiler treats them the same as unescaped text.
This is valid code:
class \u00C9 {}