What's the difference between Unicode and UTF-8?

UnicodeUtf 8

Unicode Problem Overview


Consider:

Alt text

Is it true that unicode=utf16?

Many are saying Unicode is a standard, not an encoding, but most editors support save as Unicode encoding actually.

Unicode Solutions


Solution 1 - Unicode

As Rasmus states in his article "The difference between UTF-8 and Unicode?":

> If asked the question, "What is the difference between UTF-8 and > Unicode?", would you confidently reply with a short and precise > answer? In these days of internationalization all developers should be > able to do that. I suspect many of us do not understand these concepts > as well as we should. If you feel you belong to this group, you should > read this ultra short introduction to character sets and encodings. > > Actually, comparing UTF-8 and Unicode is like comparing apples and > oranges: > >

UTF-8 is an encoding - Unicode is a character > set

> >

A character set is a list of characters with unique numbers (these > numbers are sometimes referred to as "code points"). For example, in > the Unicode character set, the number for A is 41.

> >

An encoding on the other hand, is an algorithm that translates a > list of numbers to binary so it can be stored on disk. For example > UTF-8 would translate the number sequence 1, 2, 3, 4 like this:

> >
00000001 00000010 00000011 00000100 
> >

Our data is now translated into binary and can now be saved to > disk.

> >

All together now

> >

Say an application reads the following from the disk:

> >
1101000 1100101 1101100 1101100 1101111 
> >

The app knows this data represent a Unicode string encoded with > UTF-8 and must show this as text to the user. First step, is to > convert the binary data to numbers. The app uses the UTF-8 algorithm > to decode the data. In this case, the decoder returns this:

> >
104 101 108 108 111 
> >

Since the app knows this is a Unicode string, it can assume each > number represents a character. We use the Unicode character set to > translate each number to a corresponding character. The resulting > string is "hello".

> >

Conclusion

> >

So when somebody asks you "What is the difference between UTF-8 and > Unicode?", you can now confidently answer short and precise:

> >

UTF-8 (Unicode Transformation Format) and Unicode cannot be compared. UTF-8 is an encoding > used to translate numbers into binary data. Unicode is a character set > used to translate characters into numbers.

Solution 2 - Unicode

> most editors support save as ‘Unicode’ encoding actually.

This is an unfortunate misnaming perpetrated by Windows.

Because Windows uses UTF-16LE encoding internally as the memory storage format for Unicode strings, it considers this to be the natural encoding of Unicode text. In the Windows world, there are ANSI strings (the system codepage on the current machine, subject to total unportability) and there are Unicode strings (stored internally as UTF-16LE).

This was all devised in the early days of Unicode, before we realised that UCS-2 wasn't enough, and before UTF-8 was invented. This is why Windows's support for UTF-8 is all-round poor.

This misguided naming scheme became part of the user interface. A text editor that uses Windows's encoding support to provide a range of encodings will automatically and inappropriately describe UTF-16LE as “Unicode”, and UTF-16BE, if provided, as “Unicode big-endian”.

(Other editors that do encodings themselves, like Notepad++, don't have this problem.)

If it makes you feel any better about it, ‘ANSI’ strings aren't based on any ANSI standard, either.

Solution 3 - Unicode

It's not that simple.

UTF-16 is a 16-bit, variable-width encoding. Simply calling something "Unicode" is ambiguous, since "Unicode" refers to an entire set of standards for character encoding. Unicode is not an encoding!

http://en.wikipedia.org/wiki/Unicode#Unicode_Transformation_Format_and_Universal_Character_Set

and of course, the obligatory Joel On Software - The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) link.

Solution 4 - Unicode

There's a lot of misunderstanding being displayed here. Unicode isn't an encoding, but the Unicode standard is devoted primarily to encoding anyway.

ISO 10646 is the international character set you (probably) care about. It defines a mapping between a set of named characters (e.g., "Latin Capital Letter A" or "Greek small letter alpha") and a set of code points (a number assigned to each -- for example, 61 hexadecimal and 3B1 hexadecimal for those two respectively; for Unicode code points, the standard notation would be U+0061 and U+03B1).

At one time, Unicode defined its own character set, more or less as a competitor to ISO 10646. That was a 16-bit character set, but it was not UTF-16; it was known as UCS-2. It included a rather controversial technique to try to keep the number of necessary characters to a minimum (Han Unification -- basically treating Chinese, Japanese and Korean characters that were quite a bit alike as being the same character).

Since then, the Unicode consortium has tacitly admitted that that wasn't going to work, and now concentrate primarily on ways to encode the ISO 10646 character set. The primary methods are UTF-8, UTF-16 and UCS-4 (aka UTF-32). Those (except for UTF-8) also have LE (little endian) and BE (big-endian) variants.

By itself, "Unicode" could refer to almost any of the above (though we can probably eliminate the others that it shows explicitly, such as UTF-8). Unqualified use of "Unicode" probably happens the most often on Windows, where it will almost certainly refer to UTF-16. Early versions of Windows NT adopted Unicode when UCS-2 was current. After UCS-2 was declared obsolete (around Win2k, if memory serves), they switched to UTF-16, which is the most similar to UCS-2 (in fact, it's identical for characters in the "basic multilingual plane", which covers a lot, including all the characters for most Western European languages).

Solution 5 - Unicode

UTF-16 and UTF-8 are both encodings of Unicode. They are both Unicode; one is not more Unicode than the other.

Don't let an unfortunate historical artifact from Microsoft confuse you.

Solution 6 - Unicode

> The development of Unicode was aimed > at creating a new standard for mapping > the characters in a great majority of > languages that are being used today, > along with other characters that are > not that essential but might be > necessary for creating the text. UTF-8 > is only one of the many ways that you > can encode the files because there are > many ways you can encode the > characters inside a file into Unicode.

Source:

http://www.differencebetween.net/technology/difference-between-unicode-and-utf-8/

Solution 7 - Unicode

In addition to Trufa's comment, Unicode explicitly isn't UTF-16. When they were first looking into Unicode, it was speculated that a 16-bit integer might be enough to store any code, but in practice that turned out not to be the case. However, UTF-16 is another valid encoding of Unicode - alongside the 8-bit and 32-bit variants - and I believe is the encoding that Microsoft use in memory at runtime on the NT-derived operating systems.

Solution 8 - Unicode

Let's start from keeping in mind that data is stored as bytes; Unicode is a character set where characters are mapped to code points (unique integers), and we need something to translate these code points data into bytes. That's where UTF-8 comes in so called encoding – simple!

Solution 9 - Unicode

It's weird. Unicode is a standard, not an encoding. As it is possible to specify the endianness I guess it's effectively UTF-16 or maybe 32.

Where does this menu provide from?

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionollydbgView Question on Stackoverflow
Solution 1 - Unicodevikas devdeView Answer on Stackoverflow
Solution 2 - UnicodebobinceView Answer on Stackoverflow
Solution 3 - UnicodeMatt BallView Answer on Stackoverflow
Solution 4 - UnicodeJerry CoffinView Answer on Stackoverflow
Solution 5 - UnicodeMark RansomView Answer on Stackoverflow
Solution 6 - UnicodeTrufaView Answer on Stackoverflow
Solution 7 - UnicodeTommyView Answer on Stackoverflow
Solution 8 - UnicodemrehanView Answer on Stackoverflow
Solution 9 - UnicodeMatTheCatView Answer on Stackoverflow