Unicode encoding for string literals in C++11

C++UnicodeC++11UtfString Literals

C++ Problem Overview


Following a related question, I'd like to ask about the new character and string literal types in C++11. It seems that we now have four sorts of characters and five sorts of string literals. The character types:

char     a =  '\x30';         // character, no semantics
wchar_t  b = L'\xFFEF';       // wide character, no semantics
char16_t c = u'\u00F6';       // 16-bit, assumed UTF16?
char32_t d = U'\U0010FFFF';   // 32-bit, assumed UCS-4

And the string literals:

char     A[] =  "Hello\x0A";         // byte string, "narrow encoding"
wchar_t  B[] = L"Hell\xF6\x0A";      // wide string, impl-def'd encoding
char16_t C[] = u"Hell\u00F6";        // (1)
char32_t D[] = U"Hell\U000000F6\U0010FFFF"; // (2)
auto     E[] = u8"\u00F6\U0010FFFF"; // (3)

The question is this: Are the \x/\u/\U character references freely combinable with all string types? Are all the string types fixed-width, i.e. the arrays contain precisely as many elements as appear in the literal, or to \x/\u/\U references get expanded into a variable number of bytes? Do u"" and u8"" strings have encoding semantics, e.g. can I say char16_t x[] = u"\U0010FFFF", and the non-BMP codepoint gets encoded into a two-unit UTF16 sequence? And similarly for u8? In (1), can I write lone surrogates with \u? Finally, are any of the string functions encoding aware (i.e. they are character-aware and can detect invalid byte sequences)?

This is a bit of an open-ended question, but I'd like to get as complete a picture as possible of the new UTF-encoding and type facilities of the new C++11.

C++ Solutions


Solution 1 - C++

> Are the \x/\u/\U character references freely combinable with all string types?

No. \x can be used in anything, but \u and \U can only be used in strings that are specifically UTF-encoded. However, for any UTF-encoded string, \u and \U can be used as you see fit.

> Are all the string types fixed-width, i.e. the arrays contain precisely as many elements as appear in the literal, or to \x/\u/\U references get expanded into a variable number of bytes?

Not in the way you mean. \x, \u, and \U are converted based on the string encoding. The number of those "code units" (using Unicode terms. A char16_t is a UTF-16 code unit) values depends on the encoding of the containing string. The literal u8"\u1024" would create a string containing 2 chars plus a null terminator. The literal u"\u1024" would create a string containing 1 char16_t plus a null terminator.

The number of code units used is based on the Unicode encoding.

> Do u"" and u8"" strings have encoding semantics, e.g. can I say char16_t x[] = u"\U0010FFFF", and the non-BMP codepoint gets encoded into a two-unit UTF16 sequence?

u"" creates a UTF-16 encoded string. u8"" creates a UTF-8 encoded string. They will be encoded per the Unicode specification.

> In (1), can I write lone surrogates with \u?

Absolutely not. The specification expressly forbids using the UTF-16 surrogate pairs (0xD800-0xDFFF) as codepoints for \u or \U.

> Finally, are any of the string functions encoding aware (i.e. they are character-aware and can detect invalid byte sequences)?

Absolutely not. Well, allow me to rephrase that.

std::basic_string doesn't deal with Unicode encodings. They certainly can store UTF-encoded strings. But they can only think of them as sequences of char, char16_t, or char32_t; they can't think of them as a sequence of Unicode codepoints that are encoded with a particular mechanism. basic_string::length() will return the number of code units, not code points. And obviously, the C standard library string functions are totally useless

It should be noted however that "length" for a Unicode string does not mean the number of codepoints. Some code points are combining "characters" (an unfortunate name), which combine with the previous codepoint. So multiple codepoints can map to a single visual character.

Iostreams can in fact read/write Unicode-encoded values. To do so, you will have to use a locale to specify the encoding and properly imbue it into the various places. This is easier said than done, and I don't have any code on me to show you how.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionKerrek SBView Question on Stackoverflow
Solution 1 - C++Nicol BolasView Answer on Stackoverflow