What's the difference between a character, a code point, a glyph and a grapheme?

StringUnicodeTerminology

String Problem Overview


Trying to understand the subtleties of modern Unicode is making my head hurt. In particular, the distinction between code points, characters, glyphs and graphemes - concepts which in the simplest case, when dealing with English text using ASCII characters, all have a one-to-one relationship with each other - is causing me trouble.

Seeing how these terms get used in documents like Matthias Bynens' JavaScript has a unicode problem or Wikipedia's piece on Han unification, I've gathered that these concepts are not the same thing and that it's dangerous to conflate them, but I'm kind of struggling to grasp what each term means.

The Unicode Consortium offers a glossary to explain this stuff, but it's full of "definitions" like this:

> Abstract Character. A unit of information used for the organization, control, or representation of textual data. ... > > ... > > Character. ... (2) Synonym for abstract character. (3) The basic unit of encoding for the Unicode character encoding. ... > > ... > > Glyph. (1) An abstract form that represents one or more glyph images. (2) A synonym for glyph image. In displaying Unicode character data, one or more glyphs may be selected to depict a particular character. > > ... > > Grapheme. (1) A minimally distinctive unit of writing in the context of a particular writing system. ...

Most of these definitions possess the quality of sounding very academic and formal, but lack the quality of meaning anything, or else defer the problem of definition to yet another glossary entry or section of the standard.

So I seek the arcane wisdom of those more learned than I. How exactly do each of these concepts differ from each other, and in what circumstances would they not have a one-to-one relationship with each other?

String Solutions


Solution 1 - String

  • Character is an overloaded term that can mean many things.

  • A code point is the atomic unit of information. Text is a sequence of code points. Each code point is a number which is given meaning by the Unicode standard.

  • A code unit is the unit of storage of a part of an encoded code point. In UTF-8 this means 8 bits, in UTF-16 this means 16 bits. A single code unit may represent a full code point, or part of a code point. For example, the snowman glyph () is a single code point but 3 UTF-8 code units, and 1 UTF-16 code unit.

  • A grapheme is a sequence of one or more code points that are displayed as a single, graphical unit that a reader recognizes as a single element of the writing system. For example, both a and ä are graphemes, but they may consist of multiple code points (e.g. ä may be two code points, one for the base character a followed by one for the diaeresis; but there's also an alternative, legacy, single code point representing this grapheme). Some code points are never part of any grapheme (e.g. the zero-width non-joiner, or directional overrides).

  • A glyph is an image, usually stored in a font (which is a collection of glyphs), used to represent graphemes or parts thereof. Fonts may compose multiple glyphs into a single representation, for example, if the above ä is a single code point, a font may choose to render that as two separate, spatially overlaid glyphs. For OTF, the font's GSUB and GPOS tables contain substitution and positioning information to make this work. A font may contain multiple alternative glyphs for the same grapheme, too.

Solution 2 - String

Outside the Unicode standard a character is an individual unit of text composed of one or more graphemes. What the Unicode standard defines as "characters" is actually a mix of graphemes and characters. Unicode provides rules for the interpretation of juxtaposed graphemes as individual characters.

A Unicode code point is a unique number assigned to each Unicode character (which is either a character or a grapheme).

Unfortunately, the Unicode rules allow some juxtaposed graphemes to be interpreted as other graphemes that already have their own code points (precomposed forms). This means that there is more than one way in Unicode to represent a character. Unicode normalization addresses this issue.

A glyph is the visual representation of a character. A font provides a set of glyphs for a certain set of characters (not Unicode characters). For every character, there is an infinite number of possible glyphs.

A Reply to Mark Amery

First, as I stated, there is an infinite number of possible glyphs for each character so no, a character is not "always represented by a single glyph". Unicode doesn't concern itself much with glyphs, and the things it defines in its code charts are certainly not glyphs. The problem is that neither are they all characters. So what are they?

Which is the greater entity, the grapheme or the character? What does one call those graphic elements in text that are not letters or punctuation? One term that springs quickly to mind is "grapheme". It's a word that precisely conjure up the idea of "a graphical unit in a text". I offer this definition: A grapheme is the smallest distinct component in a written text.

One could go the other way and say that graphemes are composed of characters, but then they would be called "Chinese graphemes", and all those bits and pieces Chinese graphemes are composed of would have to be called "characters" instead. However, that's all backwards. Graphemes are the distinct little bits and pieces. Characters are more developed. The phrase "glyphs are composable", would be better stated in the context of Unicode as "characters are composable".

Unicode defines characters but it also defines graphemes that are to be composed with other graphemes or characters. Those monstrosities you composed are a fine example of this. If they catch on maybe they'll get their own code points in a later version of Unicode ;)

There's a recursive element to all this. At higher levels, graphemes become characters become graphemes, but it's graphemes all the way down.

A Reply to T S

Chapter 1 of the standard states: "The Unicode character encoding treats alphabetic characters, ideographic characters, and symbols equivalently, which means they can be used in any mixture and with equal facility". Given this statement, we should be prepared for some conflation of terms in the standard. Sometimes the proper terminology only becomes clear in retrospect as a standard develops.

It often happens in formal definitions of a language that two fundamental things are defined in terms of each other. For example, in XML an element is defined as a starting tag possibly followed by content, followed by an ending tag. Content is defined in turn as either an element, character data, or a few other possible things. A pattern of self-referential definitions is also implicit in the Unicode standard:

> A grapheme is a code point or a character. > > A character is composed from a sequence of one or more graphemes.

When first confronted with these two definitions the reader might object to the first definition on the grounds that a code point is a character, but that's not always true. A sequence of two code points sometimes encodes a single code point under normalization, and that encoded code point represents the character, as illustrated in figure 2.7. Sequences of code points that encode other code points. This is getting a little tricky and we haven't even reached the layer where where character encoding schemes such as UTF-8 are used to encode code points into byte sequences.

In some contexts, for example a scholarly article on diacritics, and individual part of a character might show up in the text by itself. In that context, the individual character part could be considered a character, so it makes sense that the Unicode standard remain flexible as well.

As Mark Avery pointed out, a character can be composed into a more complex thing. That is, each character can can serve as a grapheme if desired. The final result of all composition is a thing that "the user thinks of as a character". There doesn't seem to be any real resistance, either in the standard or in this discussion, to the idea that at the highest level there are these things in the text that the user thinks of as individual characters. To avoid overloading that term, we can use "grapheme" in all cases where we want to refer to parts used to compose a character.

At times the Unicode standard is all over the place with its terminology. For example, Chapter 3 defines UTF-8 as an "encoding form" whereas the glossary defines "encoding form" as something else, and UTF-8 as a "Character Encoding Scheme". Another example is "Grapheme_Base" and "Grapheme_Extend", which are acknowledged to be mistakes but that persist because purging them is a bit of a task. There is still work to be done to tighten up the terminology employed by the standard.

The Proposal for addition of COMBINING GRAPHEME JOINER got it wrong when it stated that "Graphemes are sequences of one or more encoded characters that correspond to what users think of as characters." It should instead read, "A sequence of one or more graphemes composes what the user thinks of as a character." Then it could use the term "grapheme sequence" distinctly from the term "character sequence". Both terms are useful. "grapheme sequence" neatly implies the process of building up a character from smaller pieces. "character sequence" means what we all typically intuit it to mean: "A sequence of things the user thinks of as characters."

Sometimes a programmer really does want to operate at the level of grapheme sequences, so mechanisms to inspect and manipulate those sequences should be available, but generally, when processing text, it is sufficient to operate on "character sequences" (what the user thinks of as a character) and let the system manage the lower-level details.

In every case covered so far in this discussion, it's cleaner to use "grapheme" to refer to the indivisible components and "character" to refer to the composed entity. This usage also better reflects the long-established meanings of both terms.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionMark AmeryView Question on Stackoverflow
Solution 1 - StringKerrek SBView Answer on Stackoverflow
Solution 2 - StringPoor YorickView Answer on Stackoverflow