Do UTF-8, UTF-16, and UTF-32 differ in the number of characters they can store?

UnicodeCharacter EncodingUtf

Unicode Problem Overview


Okay. I know this looks like the typical "Why didn't he just Google it or go to www.unicode.org and look it up?" question, but for such a simple question the answer still eludes me after checking both sources.

I am pretty sure that all three of these encoding systems support all of the Unicode characters, but I need to confirm it before I make that claim in a presentation.

Bonus question: Do these encodings differ in the number of characters they can be extended to support?

Unicode Solutions


Solution 1 - Unicode

There is no Unicode character that can be stored in one encoding but not another. This is simply because the valid Unicode characters have been restricted to what can be stored in UTF-16 (which has the smallest capacity of the three encodings). In other words, UTF-8 and and UTF-32 could be used to represent a wider range of characters than UTF-16, but they aren't. Read on for more details.

UTF-8

UTF-8 is a variable-length code. Some characters require 1 byte, some require 2, some 3 and some 4. The bytes for each character are simply written one after another as a continuous stream of bytes.

While some UTF-8 characters can be 4 bytes long, UTF-8 cannot encode 2^32 characters. It's not even close. I'll try to explain the reasons for this.

The software that reads a UTF-8 stream just gets a sequence of bytes - how is it supposed to decide whether the next 4 bytes is a single 4-byte character, or two 2-byte characters, or four 1-byte characters (or some other combination)? Basically this is done by deciding that certain 1-byte sequences aren't valid characters, and certain 2-byte sequences aren't valid characters, and so on. When these invalid sequences appear, it is assumed that they form part of a longer sequence.

You've seen a rather different example of this, I'm sure: it's called escaping. In many programming languages it is decided that the \ character in a string's source code doesn't translate to any valid character in the string's "compiled" form. When a \ is found in the source, it is assumed to be part of a longer sequence, like \n or \xFF. Note that \x is an invalid 2-character sequence, and \xF is an invalid 3-character sequence, but \xFF is a valid 4-character sequence.

Basically, there's a trade-off between having many characters and having shorter characters. If you want 2^32 characters, they need to be on average 4 bytes long. If you want all your characters to be 2 bytes or less, then you can't have more than 2^16 characters. UTF-8 gives a reasonable compromise: all ASCII characters (ASCII 0 to 127) are given 1-byte representations, which is great for compatibility, but many more characters are allowed.

Like most variable-length encodings, including the kinds of escape sequences shown above, UTF-8 is an instantaneous code. This means that, the decoder just reads byte by byte and as soon as it reaches the last byte of a character, it knows what the character is (and it knows that it isn't the beginning of a longer character).

For instance, the character 'A' is represented using the byte 65, and there are no two/three/four-byte characters whose first byte is 65. Otherwise the decoder wouldn't be able to tell those characters apart from an 'A' followed by something else.

But UTF-8 is restricted even further. It ensures that the encoding of a shorter character never appears anywhere within the encoding of a longer character. For instance, none of the bytes in a 4-byte character can be 65.

Since UTF-8 has 128 different 1-byte characters (whose byte values are 0-127), all 2, 3 and 4-byte characters must be composed solely of bytes in the range 128-256. That's a big restriction. However, it allows byte-oriented string functions to work with little or no modification. For instance, C's strstr() function always works as expected if its inputs are valid UTF-8 strings.

UTF-16

UTF-16 is also a variable-length code; its characters consume either 2 or 4 bytes. 2-byte values in the range 0xD800-0xDFFF are reserved for constructing 4-byte characters, and all 4-byte characters consist of two bytes in the range 0xD800-0xDBFF followed by 2 bytes in the range 0xDC00-0xDFFF. For this reason, Unicode does not assign any characters in the range U+D800-U+DFFF.

UTF-32

UTF-32 is a fixed-length code, with each character being 4 bytes long. While this allows the encoding of 2^32 different characters, only values between 0 and 0x10FFFF are allowed in this scheme.

Capacity comparison:

  • UTF-8: 2,097,152 (actually 2,166,912 but due to design details some of them map to the same thing)
  • UTF-16: 1,112,064
  • UTF-32: 4,294,967,296 (but restricted to the first 1,114,112)

The most restricted is therefore UTF-16! The formal Unicode definition has limited the Unicode characters to those that can be encoded with UTF-16 (i.e. the range U+0000 to U+10FFFF excluding U+D800 to U+DFFF). UTF-8 and UTF-32 support all of these characters.

The UTF-8 system is in fact "artificially" limited to 4 bytes. It can be extended to 8 bytes without violating the restrictions I outlined earlier, and this would yield a capacity of 2^42. The original UTF-8 specification in fact allowed up to 6 bytes, which gives a capacity of 2^31. But RFC 3629 limited it to 4 bytes, since that is how much is needed to cover all of what UTF-16 does.

There are other (mainly historical) Unicode encoding schemes, notably UCS-2 (which is only capable of encoding U+0000 to U+FFFF).

Solution 2 - Unicode

No, they're simply different encoding methods. They all support encoding the same set of characters.

UTF-8 uses anywhere from one to four bytes per character depending on what character you're encoding. Characters within the ASCII range take only one byte while very unusual characters take four.

UTF-32 uses four bytes per character regardless of what character it is, so it will always use more space than UTF-8 to encode the same string. The only advantage is that you can calculate the number of characters in a UTF-32 string by only counting bytes.

UTF-16 uses two bytes for most charactes, four bytes for unusual ones.

http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings

Solution 3 - Unicode

UTF-8, UTF-16, and UTF-32 all support the full set of unicode code points. There are no characters that are supported by one but not another.

As for the bonus question "Do these encodings differ in the number of characters they can be extended to support?" Yes and no. The way UTF-8 and UTF-16 are encoded limits the total number of code points they can support to less than 2^32. However, the Unicode Consortium will not add code points to UTF-32 that cannot be represented in UTF-8 or UTF-16. Doing so would violate the spirit of the encoding standards, and make it impossible to guarantee a one-to-one mapping from UTF-32 to UTF-8 (or UTF-16).

Solution 4 - Unicode

I personally always check Joel's post about unicode, encodings and character sets when in doubt.

Solution 5 - Unicode

All of the UTF-8/16/32 encodings can map all Unicode characters. See Wikipedia's Comparison of Unicode Encodings.

This IBM article Encode your XML documents in UTF-8 is very helpful, and indicates if you have the choice, it's better to choose UTF-8. Mainly the reasons are wide tool support, and UTF-8 can usually pass through systems that are unaware of unicode.

From the section What the specs say in the IBM article: > Both the W3C and the IETF have > recently become more adamant about > choosing UTF-8 first, last, and > sometimes only. The W3C Character > Model for the World Wide Web 1.0: > Fundamentals states, "When a unique > character encoding is required, the > character encoding MUST be UTF-8, > UTF-16 or UTF-32. US-ASCII is > upwards-compatible with UTF-8 (an > US-ASCII string is also a UTF-8 > string, see [RFC 3629]), and UTF-8 is > therefore appropriate if compatibility > with US-ASCII is desired." In > practice, compatibility with US-ASCII > is so useful it's almost a > requirement. The W3C wisely explains, > "In other situations, such as for > APIs, UTF-16 or UTF-32 may be more > appropriate. Possible reasons for > choosing one of these include > efficiency of internal processing and > interoperability with other > processes."

Solution 6 - Unicode

As everyone has said, UTF-8, UTF-16, and UTF-32 can all encode all of the Unicode code points. However, the UCS-2 (sometimes mistakenly referred to as UCS-16) variant can't, and this is the one that you find e.g. in Windows XP/Vista.

See Wikipedia for more information.

Edit: I am wrong about Windows, NT was the only one to support UCS-2. However, many Windows applications will assume a single word per code point as in UCS-2, so you are likely to find bugs. See another Wikipedia article. (Thanks JasonTrue)

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionJohnFxView Question on Stackoverflow
Solution 1 - UnicodeArteliusView Answer on Stackoverflow
Solution 2 - UnicodeskoobView Answer on Stackoverflow
Solution 3 - UnicodeDerek ParkView Answer on Stackoverflow
Solution 4 - UnicodeAtanas KorchevView Answer on Stackoverflow
Solution 5 - UnicodeRobert PaulsonView Answer on Stackoverflow
Solution 6 - UnicodeMark RansomView Answer on Stackoverflow