Unicode in C++11

C++C++11UnicodeUtf 8Utf 16

C++ Problem Overview


I've been doing a bit of reading around the subject of Unicode -- specifically, UTF-8 -- (non) support in C++11, and I was hoping the gurus on Stack Overflow could reassure me that my understanding is correct, or point out where I've misunderstood or missed something if that is the case.

A short summary

First, the good: you can define UTF-8, UTF-16 and UCS-4 literals in your source code. Also, the <locale> header contains several std::codecvt implementations which can convert between any of UTF-8, UTF-16, UCS-4 and the platform multibyte encoding (although the API seems, to put it mildly, less than than straightforward). These codecvt implementations can be imbue()'d on streams to allow you to do conversion as you read or write a file (or other stream).

[EDIT: Cubbi points out in the comments that I neglected to mention the <codecvt> header, which provides std::codecvt implementations which do not depend on a locale. Also, the std::wstring_convert and wbuffer_convert functions can use these codecvts to convert strings and buffers directly, not relying on streams.]

C++11 also includes the C99/C11 <uchar.h> header which contains functions to convert individual characters from the platform multibyte encoding (which may or may not be UTF-8) to and from UCS-2 and UCS-4.

However, that's about the extent of it. While you can of course store UTF-8 text in a std::string, there are no ways that I can see to do anything really useful with it. For example, other than defining a literal in your code, you can't validate an array of bytes as containing valid UTF-8, you can't find out the length (i.e. number of Unicode characters, for some definition of "character") of a UTF-8-containing std::string, and you can't iterate over a std::string in any way other than byte-by-byte.

Similarly, even the C++11 addition of std::u16string doesn't really support UTF-16, but only the older UCS-2 -- it has no support for surrogate pairs, leaving you with just the BMP.

Observations

Given that UTF-8 is the standard way of handling Unicode on pretty much every Unix-derived system (including Mac OS X and* Linux) and has largely become the de-facto standard on the web, the lack of support in modern C++ seems like a pretty severe omission. Even on Windows, the fact that the new std::u16string doesn't really support UTF-16 seems somewhat regrettable.

* As pointed out in the comments and made clear here, the BSD-derived parts of Mac OS use UTF-8 while Cocoa uses UTF-16.

Questions

If you managed to read all that, thanks! Just a couple of quick questions, as this is Stack Overflow after all...

  • Is the above analysis correct, or are there any other Unicode-supporting facilities I'm missing?

  • The standards committee has done a fantastic job in the last couple of years moving C++ forward at a rapid pace. They're all smart people and I assume they're well aware of the above shortcomings. Is there a particular well-known reason that Unicode support remains so poor in C++?

  • Going forward, does anybody know of any proposals to rectify the situation? A quick search on isocpp.org didn't seem to reveal anything.

EDIT: Thanks everybody for your responses. I have to confess that I find them slightly disheartening -- it looks like the status quo is unlikely to change in the near future. If there is a consensus among the cognoscenti, it seems to be that complete Unicode support is just too hard, and that any solution must reimplement most of ICU to be considered useful.

I personally don't agree with this; I think there is valuable middle ground to be found. For example, the validation and normalisation algorithms for UTF-8 and UTF-16 are well-specified by the Unicode consortium, and could be supplied by the standard library as free functions in, say, a std::unicode namespace. These alone would be a great help for C++ programmes which need to interface with libraries expecting Unicode input. But based on the answer below (tinged, it must be said, with a hint of bitterness) it seems Puppy's proposal for just this sort of limited functionality was not well-received.

C++ Solutions


Solution 1 - C++

> Is the above analysis correct

Let's see.

> you can't validate an array of bytes as containing valid UTF-8

Incorrect. std::codecvt_utf8<char32_t>::length(start, end, max_lenght) returns the number of valid bytes in the array.

> you can't find out the length

Partially correct. One can convert to char32_t and find out the length of the result. There is no easy way to find out the length without doing the actual conversion (but see below). I must say that need to count characters (in any sense) arises rather infrequently.

> you can't iterate over a std::string in any way other than byte-by-byte

Incorrect. std::codecvt_utf8<char32_t>::length(start, end, 1) gives you a possibility to iterate over UTF-8 "characters" (Unicode code units), and of course determine their number (that's not an "easy" way to count the number of characters, but it's a way).

> doesn't really support UTF-16

Incorrect. One can convert to and from UTF-16 with e.g. std::codecvt_utf8_utf16<char16_t>. A result of conversion to UTF-16 is, well, UTF-16. It is not restricted to BMP.

Demo that illustrates these points.

If I have missed some other "you can't", please point it out and I will address it.

Important addendum. These facilities are deprecated in C++17. This probably means they will go away in some future version of C++. Use them at your own risk. All these things enumerated in original question now cannot (safely) be done again, using only the standard library.

Solution 2 - C++

> Is the above analysis correct, or are there any other > Unicode-supporting facilities I'm missing?

You're also missing the utter failure of UTF-8 literals. They don't have a distinct type to narrow-character literals, that may have a totally unrelated (e.g. codepages) encoding. So not only did they not add any serious new facilities in C++11, they broke what little there was because now you can't even assume that a char* is in narrow-string-encoding for your platform unless UTF-8 is the narrow string encoding. So the new feature here is "We totally broke char-based strings on every platform where UTF-8 isn't the existing narrow string encoding".

> The standards committee has done a fantastic job in the last couple of > years moving C++ forward at a rapid pace. They're all smart people and > I assume they're well aware of the above shortcomings. Is there a > particular well-known reason that Unicode support remains so poor in > C++?

The Committee simply doesn't seem to give a shit about Unicode.

Also, many of the Unicode support algorithms are just that- algorithms. This means that to offer a decent interface, we need ranges. And we all know that the Committee can't figure out what they want w.r.t. ranges. The new Iterables thing from Eric Niebler may have a shot.

> Going forward, does anybody know of any proposals to rectify the > situation? A quick search on isocpp.org didn't seem to reveal > anything.

There was N3572, which I authored. But when I went to Bristol and presented it, there were a number of problems.

Firstly, it turns out that the Committee don't bother to feedback on non-Committee-member-authored proposals between meetings, resulting in months of lost work when you iterate on a design they don't want.

Secondly, it turns out that it's voted on by whoever happens to wander by at the time. This means that if your paper gets rescheduled, you have a relatively random bunch of people who may or may not know anything about the subject matter. Or indeed, anything at all.

Thirdly, for some reason they don't seem to view the current situation as a serious problem. You can get endless discussion about how exactly optional<T>'s comparison operations should be defined, but dealing with user input? Who cares about that?

Fourthly, each paper needs a champion, effectively, to present and maintain it. Given the previous issues, plus the fact that there's no way I could afford to travel to other meetings, it was certainly not going to be me, will not be me in the future unless you want to donate all my travel expenses and pay a salary on top, and nobody else seemed to care enough to put the effort in.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionTristan BrindleView Question on Stackoverflow
Solution 1 - C++n. 1.8e9-where's-my-share m.View Answer on Stackoverflow
Solution 2 - C++PuppyView Answer on Stackoverflow