What are "connecting characters" in Java identifiers?

JavaUnicodeIdentifierScjp

Java Problem Overview


I am reading for SCJP and I have a question regarding this line:

> Identifiers must start with a letter, a currency character ($), or a > connecting character such as the underscore ( _ ). Identifiers cannot > start with a number!

It states that a valid identifier name can start with a connecting character such as underscore. I thought underscores were the only valid option? What other connecting characters are there?

Java Solutions


Solution 1 - Java

Here is a list of connecting characters. These are characters used to connect words.

http://www.fileformat.info/info/unicode/category/Pc/list.htm

U+005F _ LOW LINE
U+203F ‿ UNDERTIE
U+2040 ⁀ CHARACTER TIE
U+2054 ⁔ INVERTED UNDERTIE
U+FE33 ︳ PRESENTATION FORM FOR VERTICAL LOW LINE
U+FE34 ︴ PRESENTATION FORM FOR VERTICAL WAVY LOW LINE
U+FE4D ﹍ DASHED LOW LINE
U+FE4E ﹎ CENTRELINE LOW LINE
U+FE4F ﹏ WAVY LOW LINE
U+FF3F _ FULLWIDTH LOW LINE

This compiles on Java 7.

int _, ‿, ⁀, ⁔, ︳, ︴, ﹍, ﹎, ﹏, _;

An example. In this case tp is the name of a column and the value for a given row.

Column<Double> ︴tp︴ = table.getColumn("tp", double.class);

double tp = row.getDouble(︴tp︴);

The following

for (int i = Character.MIN_CODE_POINT; i <= Character.MAX_CODE_POINT; i++)
    if (Character.isJavaIdentifierStart(i) && !Character.isAlphabetic(i))
        System.out.print((char) i + " ");
}

prints

$ _ ¢ £ ¤ ¥ ؋ ৲ ৳ ৻ ૱ ௹ ฿ ៛ ‿ ⁀ ⁔ ₠ ₡ ₢ ₣ ₤ ₥ ₦ ₧ ₨ ₩ ₪ ₫ € ₭ ₮ ₯ ₰ ₱ ₲ ₳ ₴ ₵ ₶ ₷ ₸ ₹ ꠸ ﷼ ︳ ︴ ﹍ ﹎ ﹏ ﹩ $ _ ¢ £ ¥ ₩

Solution 2 - Java

iterate through the whole 65k chars and ask Character.isJavaIdentifierStart(c). The answer is : "undertie" decimal 8255

Solution 3 - Java

The definitive specification of a legal Java identifier can be found in the Java Language Specification.

Solution 4 - Java

Here is a List of connector Characters in Unicode. You will not find them on your keyboard.

U+005F LOW LINE _ U+203F UNDERTIE ‿ U+2040 CHARACTER TIE ⁀ U+2054 INVERTED UNDERTIE ⁔ U+FE33 PRESENTATION FORM FOR VERTICAL LOW LINE ︳ U+FE34 PRESENTATION FORM FOR VERTICAL WAVY LOW LINE ︴ U+FE4D DASHED LOW LINE ﹍ U+FE4E CENTRELINE LOW LINE ﹎ U+FE4F WAVY LOW LINE ﹏ U+FF3F FULLWIDTH LOW LINE _

Solution 5 - Java

A connecting character is used to connect two characters.

In Java, a connecting character is the one for which Character.getType(int codePoint)/Character.getType(char ch) returns a value equal to Character.CONNECTOR_PUNCTUATION.

Note that in Java, the character information is based on Unicode standard which identifies connecting characters by assigning them the general category Pc, which is an alias for Connector_Punctuation.

The following code snippet,

for (int i = Character.MIN_CODE_POINT; i <= Character.MAX_CODE_POINT; i++) {
    if (Character.getType(i) == Character.CONNECTOR_PUNCTUATION
            && Character.isJavaIdentifierStart(i)) {
        System.out.println("character: " + String.valueOf(Character.toChars(i))
                + ", codepoint: " + i + ", hexcode: " + Integer.toHexString(i));
    }
}

prints the connecting characters that can be used to start an identifer on jdk1.6.0_45

character: _, codepoint: 95, hexcode: 5f
character: ‿, codepoint: 8255, hexcode: 203f
character: ⁀, codepoint: 8256, hexcode: 2040
character: ⁔, codepoint: 8276, hexcode: 2054
character: ・, codepoint: 12539, hexcode: 30fb
character: ︳, codepoint: 65075, hexcode: fe33
character: ︴, codepoint: 65076, hexcode: fe34
character: ﹍, codepoint: 65101, hexcode: fe4d
character: ﹎, codepoint: 65102, hexcode: fe4e
character: ﹏, codepoint: 65103, hexcode: fe4f
character: _, codepoint: 65343, hexcode: ff3f
character: ・, codepoint: 65381, hexcode: ff65

The following compiles on jdk1.6.0_45,

int _, ‿, ⁀, ⁔, ・, ︳, ︴, ﹍, ﹎, ﹏, _, ・ = 0;

Apparently, the above declaration fails to compile on jdk1.7.0_80 & jdk1.8.0_51 for the following two connecting characters (backward compatibility...oops!!!),

character: ・, codepoint: 12539, hexcode: 30fb
character: ・, codepoint: 65381, hexcode: ff65

Anyway, details aside, the exam focuses only on the Basic Latin character set.

Also, for legal identifers in Java, the spec is provided here. Use the Character class APIs to get more details.

Solution 6 - Java

One of the most, well, fun characters that is allowed in Java identifiers (however not at the start) is the unicode character named "Zero Width Non Joiner" (&zwnj;, U+200C, https://en.wikipedia.org/wiki/Zero-width_non-joiner).

I had this once in a piece of XML inside an attribute value holding a reference to another piece of that XML. Since the ZWNJ is "zero width" it cannot be seen (except when walking along with the cursor, it is displayed right on the character before). It also couldn't be seen in the logfile and/or console output. But it was there all the time: copy & paste into search fields got it and thus did not find the referred position. Typing the (visible part of the) string into the search field however found the referred position. Took me a while to figure this out.

Typing a Zero-Width-Non-Joiner is actually quite easy (too easy) when using the European keyboard layout, at least in its German variant, e.g. "Europatastatur 2.02" - it is reachable with AltGr + ".", two keys which unfortunately are located directly next to each other on most keyboards and can easily be hit together accidentally.

Back to Java: I thought well, you could write some code like this:

void foo() {
    int i = 1;
    int i‌ = 2;
}

with the second i appended by a zero-width-non-joiner (can't do that in the above code snipped in stackoverflow's editor), but that didn't work. IntelliJ (16.3.3) did not complain, but JavaC (Java 8) did complain about an already defined identifier - it seems JavaC actually allows the ZWNJ character as part of an identifier, but when using reflection to see what it does, the ZWNJ character is stripped off the identifier - something that characters like ‿ aren't.

Solution 7 - Java

The list of characters you can use inside your identifiers (rather than just at the start) is much more fun:

for (int i = Character.MIN_CODE_POINT; i <= Character.MAX_CODE_POINT; i++)
    if (Character.isJavaIdentifierPart(i) && !Character.isAlphabetic(i))
        System.out.print((char) i + " ");

The list is:

I wanted to post the output, but it's forbidden by the SO spam filter. That's how fun it is!

It includes most of the control characters! I mean bells and stuff! You can make your source code ring the fn bell! Or use characters which will only be displayed sometimes, like the soft hyphen.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionLuckyLukeView Question on Stackoverflow
Solution 1 - JavaPeter LawreyView Answer on Stackoverflow
Solution 2 - JavaMarkus MikkolainenView Answer on Stackoverflow
Solution 3 - JavaGreg HewgillView Answer on Stackoverflow
Solution 4 - JavaSimulantView Answer on Stackoverflow
Solution 5 - JavasxnamitView Answer on Stackoverflow
Solution 6 - JavaUlrich GrepelView Answer on Stackoverflow
Solution 7 - JavaAleksandr DubinskyView Answer on Stackoverflow