What do I need to know about Unicode?

UnicodeInternationalization

Unicode Problem Overview


Being a application developer, do I need to know Unicode?

Unicode Solutions


Solution 1 - Unicode

Unicode is a standard that defines numeric codes for glyphs used in written communication. Or, as they say it themselves:

> The standard for digital > representation of the characters used > in writing all of the world's > languages. Unicode provides a uniform > means for storing, searching, and > interchanging text in any language. It > is used by all modern computers and is > the foundation for processing text on > the Internet. Unicode is developed and > maintained by the Unicode Consortium.

There are many common, yet easily avoided, programming errors committed by developers who don't bother to educate themselves about Unicode and its encodings.

Some of the key concepts you should be aware of are:

  • Glyphs—concrete graphics used to represent written characters.
  • Composition—combining glyphs to create another glyph.
  • Encoding—converting Unicode points to a stream of bytes.
  • Collation—locale-sensitive comparison of Unicode strings.

Solution 2 - Unicode

At the risk of just adding another link, unicode.org is a spectacular resource.

In short, it's a replacement for ASCII that's designed to handle, literally, every character ever used by humans. Unicode has everal encoding schemes to handle all those characters - UTF-8, which is more or less the standard these days, works really hard to stay a single byte per character, and is identical to ASCII for the first 7 bits.

(As an addendum, there's a popular misconception amongst programmers that you only need to know about Unicode if you're going to be doing internationalization. While that's certainly one use, it's not the only one. For example, I'm working on a project that will only ever use English text - but with a huge number of fancy math symbols. Moving the whole project over to be fully Unicode solved more problems than I can count.)

Solution 3 - Unicode

Unicode is an industry agreed standard for consistently representing text that has capacity to represent the World's character systems. All developers need to know about it, as Globalization is a growing concern.

Solution 4 - Unicode

One (open) source of code for handling Unicode is ICU - Internationalization Components for Unicode. It includes ICU4J for Java and ICU4C for C and C++ (presents C interface; uses C++ compiler).

Solution 5 - Unicode

Unicode is a character set, that other than ASCII (which contains only letters for English, 127 characters, one third of them actually being non-printable control characters) contains roughly 2 million characters, including characters of every language known (Chinese, Russian, Greek, Arabian, etc.) and some languages you have probably never even heard of (even lots of dead language symbols not in use anymore, but useful for archiving ancient documents).

So instead of dealing with dozens of different character encodings, you have one encoding for all of them (which also makes it easier to mix characters from different languages within a single text string, as you don't need to switch the encoding somewhere in the middle of a text string). Actually there is still plenty of room left, we are far from having all 2 mio characters in use; the Unicode Consortium could easily add symbols for another 100 languages without even starting to fear running out of symbol space.

Pretty much any book in any language you can find in a library today can be expressed in Unicode. Unicode is the name of the encoding itself, how it is expressed as "bytes" is a different issue. There are several ways to write Unicode characters like UTF-8 (one to six bytes represent a single character, depending on character number, English is almost always one byte, other Roman languages might be two or three, Chinese/Japanese might be more), UTF-16 (most characters are two byte, some rarely used ones are four byte) and UTF-32, every character is four byte. There are others, but these are the dominant ones.

Unicode is the default encoding for many newer OSes (in Mac OS X almost anything is Unicode) and programming languages (Java uses Unicode as default encoding, usually UTF-16, I heard Python does as well and will use or already does use UTF-32). If you ever plan to write an app that should display, store, or process anything other than plain English text, you'd better get used to Unicode, the sooner the better.

Solution 6 - Unicode

You don't need to learn unicode to use it, it's a hell of complex norm. You just need to know the main issues and how your programming tools deal with it. To learn that, check the Galwegian's link and your programming language and ide documentation.

E.G :

You can convert any caracter from latin-1 to unicode but it doesn't work the other way for all caracters. PHP let you now that some function (like stristr) does not work with unicode. Python declare unicode string this way : u"Hello World".

That's the kind of thin you must know.

Knowing that, if you do not have a GOOD reason to not use unicode, then just use it.

Solution 7 - Unicode

Unicode is a standard that enumerates characters, and gives them unique numeric IDs (called "code points"). It includes a very large, and growing, set of characters for most modern written languages, and also a lot of exotic things like ancient Greek musical notation.

Unlike other character encoding schemes (like ASCII or the ISO-8859 standards), Unicode does not say anything about representing these characters in bytes; it just gives a universal set of IDs to characters. So it is wrong to say that Unicode is "a 16-bit replacement for ASCII".

There are various encoding schemes that can representing arbitrary Unicode characters in bytes, including UTF-8, UTF-16, and others.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionyesraajView Question on Stackoverflow
Solution 1 - UnicodeericksonView Answer on Stackoverflow
Solution 2 - UnicodeElectrons_AhoyView Answer on Stackoverflow
Solution 3 - UnicodejmcdView Answer on Stackoverflow
Solution 4 - UnicodeJonathan LefflerView Answer on Stackoverflow
Solution 5 - UnicodeMeckiView Answer on Stackoverflow
Solution 6 - Unicodee-satisView Answer on Stackoverflow
Solution 7 - UnicodeeliView Answer on Stackoverflow