What's the point of UTF-16?

Utf 8Character EncodingUtf 16UtfUtf 32

Utf 8 Problem Overview


I've never understood the point of UTF-16 encoding. If you need to be able to treat strings as random access (i.e. a code point is the same as a code unit) then you need UTF-32, since UTF-16 is still variable length. If you don't need this, then UTF-16 seems like a colossal waste of space compared to UTF-8. What are the advantages of UTF-16 over UTF-8 and UTF-32 and why do Windows and Java use it as their native encoding?

Utf 8 Solutions


Solution 1 - Utf 8

When Windows NT was designed UTF-16 didn't exist (NT 3.51 was born in 1993, while UTF-16 was born in 1996 with the Unicode 2.0 standard); there was instead UCS-2, which, at that time, was enough to hold every character available in Unicode, so the 1 code point = 1 code unit equivalence was actually true - no variable-length logic needed for strings.

They moved to UTF-16 later, to support the whole Unicode character set; however they couldn't move to UTF-8 or to UTF-32, because this would have broken binary compatibility in the API interface (among the other things).

As for Java, I'm not really sure; since it was released in ~1995 I suspect that UTF-16 was already in the air (even if it wasn't standardized yet), but I think that compatibility with NT-based operating systems may have played some role in their choice (continuous UTF-8 <-> UTF-16 conversions for every call to Windows APIs can introduce some slowdown).


Edit

Wikipedia explains that even for Java it went in the same way: it originally supported UCS-2, but moved to UTF-16 in J2SE 5.0.

So, in general when you see UTF-16 used in some API/Framework it is because it started as UCS-2 (to avoid complications in the string-management algorithms) but it moved to UTF-16 to support the code points outside the BMP, still maintaining the same code unit size.

Solution 2 - Utf 8

None of the replies indicating an advantage of UTF-16 over UTF-8 make any sense, except for the backwards-compatibility reply.

Well, there are two caveats to my comment.

Erik states: "UTF-16 covers the entire BMP with single units - So unless you have a need for the rarer characters outside the BMP, UTF-16 is effectively 2 bytes per character."

Caveat 1)

If you can be certain that your application will NEVER need any character outside of the BMP, and that any library code you write for use with it will NEVER be used with any application that will ever need a character outside the BMP, then you could use UTF-16, and write code that makes the implicit assumption that every character will be exactly two bytes in length.

That seems exceedingly dangerous (actually, stupid).

If your code assumes that all UTF-16 characters are two bytes in length, and your program interacts with an application or library where there is a single character outside of the BMP, then your code will break. Code that examines or manipulates UTF-16 must be written to handle the case of a UTF-16 character requiring more than 2 bytes; therefore, I am "dismissing" this caveat.

UTF-16 is not simpler to code for than UTF-8 (code for both must handle variable-length characters).

Caveat 2)

UTF-16 MIGHT be more computationally efficient, under some circumstances, if suitably written.

Like this: Suppose that certain long strings are seldom modified, but often examined (or better, never modified once built - i.e., a string builder creating unmodifiable strings). A flag could be set for each string, indicating whether the string contains only "fixed length" characters (i.e., contains no characters that are not exactly two bytes in length). Strings for which the flag is true could be examined with optimized code that assumes fixed length (2 byte) characters.

How about space-efficiency?

UTF-16 is, obviously, more efficient for A) characters for which UTF-16 requires fewer bytes to encode than does UTF-8.

UTF-8 is, obviously, more efficient for B) characters for which UTF-8 requires fewer bytes to encode than does UTF-16.

Except for very "specialized" text, it's likely that count(B) far exceeds count(A).

Solution 3 - Utf 8

UTF-16 covers the entire BMP with single units - So unless you have a need for the rarer characters outside the BMP, UTF-16 is effectively 2 bytes per character. UTF-32 takes more space, UTF-8 requires variable-length support.

Solution 4 - Utf 8

UTF-16 allows all of the basic multilingual plane (BMP) to be represented as single code units. Unicode code points beyond U+FFFF are represented by surrogate pairs.

The interesting thing is that Java and Windows (and other systems that use UTF-16) all operate at the code unit level, not the Unicode code point level. So the string consisting of the single character U+1D122 (MUSICAL SYMBOL F CLEF) gets encoded in Java as "\ud824\udd22" and "\ud824\udd22".length() == 2 (not 1). So it's kind of a hack, but it turns out that characters are not variable length.

The advantage of UTF-16 over UTF-8 is that one would give up too much if the same hack were used with UTF-8.

Solution 5 - Utf 8

UTF16 is generally used as a direct mapping to multi-byte character sets, ie onyl the original 0-0xFFFF assigned characters.

This gives you the best of both worlds, you have fixed character size but can still print all the characters anyone is likely to use (orthodox Klingon religous scripts excepted)

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestiondsimchaView Question on Stackoverflow
Solution 1 - Utf 8Matteo ItaliaView Answer on Stackoverflow
Solution 2 - Utf 8user3162129View Answer on Stackoverflow
Solution 3 - Utf 8ErikView Answer on Stackoverflow
Solution 4 - Utf 8Ted HoppView Answer on Stackoverflow
Solution 5 - Utf 8Martin BeckettView Answer on Stackoverflow