What's the complete range for Chinese characters in Unicode?

UnicodeCjk

Unicode Problem Overview


U+4E00..U+9FFF is part of the complete set, but not all

Unicode Solutions


Solution 1 - Unicode

May be you would find a complete list through the CJK Unicode FAQ (which does include "Chinese, Japanese, and Korean" characters)

The "East Asian Script" document does mention:

> Blocks Containing Han Ideographs > > Han ideographic characters are found in five main blocks of the Unicode Standard, as shown in Table 12-2

Table 12-2. Blocks Containing Han Ideographs

Block                                   Range       Comment
CJK Unified Ideographs                  4E00-9FFF   Common
CJK Unified Ideographs Extension A      3400-4DBF   Rare
CJK Unified Ideographs Extension B      20000-2A6DF Rare, historic
CJK Unified Ideographs Extension C      2A700–2B73F Rare, historic
CJK Unified Ideographs Extension D      2B740–2B81F Uncommon, some in current use
CJK Unified Ideographs Extension E      2B820–2CEAF Rare, historic
CJK Compatibility Ideographs            F900-FAFF   Duplicates, unifiable variants, corporate characters
CJK Compatibility Ideographs Supplement 2F800-2FA1F Unifiable variants

Note: the block ranges can evolve over time: latest is in CJK Unified Ideographs.

See also Wikipedia:

Solution 2 - Unicode

Unicode currently has 74605 CJK characters. CJK characters not only includes characters used by Chinese, but also Japanese Kanji, Korean Hanja, and Vietnamese Chu Nom. Some CJK characters are not Chinese characters.

1) 20941 characters from the CJK Unified Ideographs block.

Code points U+4E00 to U+9FCC.

  1. U+4E00 - U+62FF
  2. U+6300 - U+77FF
  3. U+7800 - U+8CFF
  4. U+8D00 - U+9FCC
2) 6582 characters from the CJKUI Ext A block.

Code points U+3400 to U+4DB5. Unicode 3.0 (1999).

3) 42711 characters from the CJKUI Ext B block.

Code points U+20000 to U+2A6D6. Unicode 3.1 (2001).

  1. U+20000 - U+215FF
  2. U+21600 - U+230FF
  3. U+23100 - U+245FF
  4. U+24600 - U+260FF
  5. U+26100 - U+275FF
  6. U+27600 - U+290FF
  7. U+29100 - U+2A6DF
3) 4149 characters from the CJKUI Ext C block.

Code points U+2A700 to U+2B734. Unicode 5.2 (2009).

4) 222 characters from the CJKUI Ext D block.

Code points U+2B740 to U+2B81D. Unicode 6.0 (2010).

5) CJKUI Ext E block.

Coming soon

If the above is not spaghetti enough, take a look at known issues. Have fun =)

Solution 3 - Unicode

The exact ranges for Chinese characters (except the extensions) are [\u2E80-\u2FD5\u3190-\u319f\u3400-\u4DBF\u4E00-\u9FCC\uF900-\uFAAD].

  1. [\u2e80-\u2fd5]

> CJK Radicals Supplement is a Unicode block containing alternative, > often positional, forms of the Kangxi radicals. They are used headers > in dictionary indices and other CJK ideograph collections organized by > radical-stroke.

  1. [\u3190-\u319f]

> Kanbun is a Unicode block containing annotation characters used in > Japanese copies of classical Chinese texts, to indicate reading order.

  1. [\u3400-\u4DBF]

> CJK Unified Ideographs Extension-A is a Unicode block containing rare > Han ideographs.

  1. [\u4E00-\u9FCC]

> CJK Unified Ideographs is a Unicode block containing the most common > CJK ideographs used in modern Chinese and Japanese.

  1. [\uF900-\uFAAD]

> CJK Compatibility Ideographs is a Unicode block created to contain Han > characters that were encoded in multiple locations in other > established character encodings, in addition to their CJK Unified > Ideographs assignments, in order to retain round-trip compatibility > between Unicode and those encodings.

For the details please refer to here, and the extensions are provided in other answers.

Solution 4 - Unicode

Unicode version 11.0.0

In Unicode the Chinese, Japanese and Korean (CJK) scripts share a common background, collectively known as CJK characters.

These ranges often contain non-assigned or reserved code points(such as U+2E9A , U+2EF4 - 2EFF),

Chinese characters

bottom	top	    reference (also have a look at wiki page)	block name
4E00	9FEF	http://www.unicode.org/charts/PDF/U4E00.pdf	CJK Unified Ideographs
3400	4DBF	http://www.unicode.org/charts/PDF/U3400.pdf	CJK Unified Ideographs Extension A
20000	2A6DF	http://www.unicode.org/charts/PDF/U20000.pdf	CJK Unified Ideographs Extension B
2A700	2B73F	http://www.unicode.org/charts/PDF/U2A700.pdf	CJK Unified Ideographs Extension C
2B740	2B81F	http://www.unicode.org/charts/PDF/U2B740.pdf	CJK Unified Ideographs Extension D
2B820	2CEAF	http://www.unicode.org/charts/PDF/U2B820.pdf	CJK Unified Ideographs Extension E
2CEB0	2EBEF	https://www.unicode.org/charts/PDF/U2CEB0.pdf	CJK Unified Ideographs Extension F
3007	3007	https://zh.wiktionary.org/wiki/%E3%80%87	in block CJK Symbols and Punctuation
		        

  • In CJK Unified Ideographs block, I notice many answers use upper bound 9FCC, but U+9FCD(鿍) is indeed a Chinese char. And all characters in this block are Chinese characters (also used in Japanese or Korean etc.).
  • Most of characters in CJK Unified Ideographs Ext (Except Ext F, only 17% in Ext F are Chinese characters), are traditional Chinese characters, which are rarely used in China.
  • 〇 is the Chinese character form of zero and still in use today

Therefore the range is

> [0x3007,0x3007],[0x3400,0x4DBF],[0x4E00,0x9FEF],[0x20000,0x2EBFF]

CJK characters but never used in Chinese

They are Common Han used only for compatibility.

It is almost impossible to see them appear in any Chinese books, articles, writings etc.

All characters here have one corresponding glyph-identical Chinese character, such as 金(U+F90A) and 金(U+91D1), they are identical glyphs.

 F900	 FAFF	https://www.unicode.org/charts/PDF/UF900.pdf  CJK Compatibility Ideographs
2F800	2FA1F	https://www.unicode.org/charts/PDF/U2F800.pdf CJK Compatibility Ideographs Supplement
2E80	2EFF	http://www.unicode.org/charts/PDF/U2E80.pdf	CJK Radicals Supplement
			
2F00	2FDF	http://www.unicode.org/charts/PDF/U2F00.pdf	Kangxi Radicals 
2FF0	2FFF	https://unicode.org/charts/PDF/U2FF0.pdf	Ideographic Description Character
3000	303F 	https://www.unicode.org/charts/PDF/U3000.pdf	CJK Symbols and Punctuation
3100	312f	https://unicode.org/charts/PDF/U3100.pdf	Bopomofo
31A0	31BF	https://unicode.org/charts/PDF/U31A0.pdf	Bopomofo Extended
31C0	31EF	http://www.unicode.org/charts/PDF/U31C0.pdf	CJK Strokes
3200	32FF	https://unicode.org/charts/PDF/U3200.pdf	Enclosed CJK Letters and Months
3300	33FF	https://unicode.org/charts/PDF/U3300.pdf	CJK Compatibility
FE30	FE4F	https://www.unicode.org/charts/PDF/UFE30.pdf	CJK Compatibility Forms
FF00	FFEF	https://www.unicode.org/charts/PDF/UFF00.pdf	Halfwidth and Fullwidth Forms
1F200	1F2FF	https://www.unicode.org/charts/PDF/U1F200.pdf	Enclosed Ideographic Supplement
  • some blocks such as Hangul Compatibility Jamo are excluded because of no relation to Chinese.
  • Kangxi Radicals is not Chinese characters, they are graphical components of Chinese characters, used specially to express radicals, .e.g. ⼻(U+2F3B) and 彳(U+5F73), ⻜(U+2EDC) and 飞 (U+98DE)

Other common punctuation appearing in Chinese

This is a wide range, some punctuation may be never used, some punctuations such as ……”“ are used so much in Chinese.

0000	007F	https://unicode.org/charts/PDF/U0000.pdf	C0 Controls and Basic Latin 
2000	206F	https://unicode.org/charts/PDF/U2000.pdf	General Punctuation
……

There are also many Chinese-related symbols, such as Yijing Hexagram Symbols or Kanbun, but it's off-topic anyway. I write non-chinese-characters in CJK to have a better explanation of what Chinese characters are. And the ranges above already cover almost all the characters which appear in Chinese writing except math and other specialty notation.

Supplementary

CJK Symbols and Punctuation

 、。〃〄々〆〇〈〉《》「」『』【】〒〓〔〕〖〗〘〙〚〛〜〝〞〟〠〡〢〣〤〥〦〧〨〩〪〭〮〯〫〬〰〱〲〳〴〵〶〷〸〹〺〻〼〽 〾 〿

Halfwidth and Fullwidth Forms

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~⦅⦆。「」、・ヲァィゥェォャュョッーアイウエオカキクケコサシスセソタチツテトナニヌネノハヒフヘホマミムメモヤユヨラリルレロワン゙゚ᄀᄁᆪᄂᆬᆭᄃᄄᄅᆰᆱᆲᆳᆴᆵᄚᄆᄇᄈᄡᄉᄊᄋᄌᄍᄎᄏᄐᄑ하ᅢᅣᅤᅥᅦᅧᅨᅩᅪᅫᅬᅭᅮᅯᅰᅱᅲᅳᅴᅵ¢£¬ ̄¦¥₩│←↑→↓■○

Refer

  1. https://zh.wikipedia.org/wiki/%E6%B1%89%E5%AD%97 (in chinese language, notice the right side bar)
  2. https://zh.wikipedia.org/wiki/%E4%B8%AD%E6%97%A5%E9%9F%93%E7%9B%B8%E5%AE%B9%E8%A1%A8%E6%84%8F%E6%96%87%E5%AD%97 (notice the bottom table)
  3. http://www.unicode.org

Solution 5 - Unicode

The Unicode code blocks that the others answers gave certainly cover most of the Chinese Unicode characters, but check out some of these other code blocks, too.

CJK_UNIFIED_IDEOGRAPHS
CJK_UNIFIED_IDEOGRAPHS_EXTENSION_A
CJK_UNIFIED_IDEOGRAPHS_EXTENSION_B
CJK_UNIFIED_IDEOGRAPHS_EXTENSION_C
CJK_UNIFIED_IDEOGRAPHS_EXTENSION_D
CJK_UNIFIED_IDEOGRAPHS_EXTENSION_E
CJK_COMPATIBILITY
CJK_COMPATIBILITY_FORMS
CJK_COMPATIBILITY_IDEOGRAPHS
CJK_COMPATIBILITY_IDEOGRAPHS_SUPPLEMENT
CJK_RADICALS_SUPPLEMENT
CJK_STROKES
CJK_SYMBOLS_AND_PUNCTUATION
ENCLOSED_CJK_LETTERS_AND_MONTHS
ENCLOSED_IDEOGRAPHIC_SUPPLEMENT
KANGXI_RADICALS
IDEOGRAPHIC_DESCRIPTION_CHARACTERS

See my fuller discussion here. And this site is convenient for browsing Unicode.

Solution 6 - Unicode

Unicode continually evolves, with the current goal to have "A new major version of the standard will be released each year. Starting with Unicode 14.0, each of those releases is targeted for the third quarter of each year."

Without a single community wiki that someone regularly updates, if you want to maintain coverage for corrections and additional extensions, to stay up-to-date be sure to also double check the latest standard, always found at: https://www.unicode.org/versions/latest/ And look for the East Asia chapter (unless that one day gets split as well).

As of this initial writing, the latest is v14, and Ch 18 "presents scripts used in East Asia. This includes major writing systems associated with Chinese, Japanese, and Korean. It also includes several scripts for minority languages". The first table reviews Blocks Containing Han Ideographs where we see they've gone up to Extension G:

Block                                   Range       Comment
-----------------------------------------------------------
CJK Unified Ideographs                  4E009FFF   Common
CJK Unified Ideographs Extension A      34004DBF   Rare
CJK Unified Ideographs Extension B      200002A6DF Rare, historic
CJK Unified Ideographs Extension C      2A700–2B73F Rare, historic
CJK Unified Ideographs Extension D      2B740–2B81F Uncommon, some in current use
CJK Unified Ideographs Extension E      2B820–2CEAF Rare, historic
CJK Unified Ideographs Extension F      2CEB0–2EBEF Rare, historic
CJK Unified Ideographs Extension G      300003134F Rare, historic
CJK Compatibility Ideographs            F900–FAFF   Duplicates, unifiable variants, corporate characters
CJK Compatibility Ideographs Supplement 2F800–2FA1F Unifiable variants

The second table Small Extensions to CJK Blocks notes additions: "The repertoire in the CJK Unified Ideographs block has subsequently been extended with small sets of unified ideographs or ideographic components needed for interoperability with various standards, or for other reasons, as shown in Table 18-2", some of which "have involved reserved ranges at the end of other CJK blocks."

For additional related blocks such as punctuation and other syllabaries (including for J+K) which should be more stable, check out that unicode chapter further as well as other answers around here, and https://en.wikipedia.org/wiki/Han_unification#Unicode_ranges. https://blog.miniasp.com/post/2019/01/02/Common-Regex-patterns-for-Unicode-characters has some interesting discussion as well even though it was written in 2019.

For fonts that try to render these, see https://en.wikipedia.org/wiki/List_of_CJK_fonts, but note that coverage information is sparse. You'll have to dig around to see those details, e.g. Adobe/Google's Source Han/Noto fonts don't cover all extensions or compatibility ideographs.

Solution 7 - Unicode

To summarize, it sounds like these are them:

var blocks = [
  [0x3400, 0x4DB5],
  [0x4E00, 0x62FF],
  [0x6300, 0x77FF],
  [0x7800, 0x8CFF],
  [0x8D00, 0x9FCC],
  [0x2e80, 0x2fd5],
  [0x3190, 0x319f],
  [0x3400, 0x4DBF],
  [0x4E00, 0x9FCC],
  [0xF900, 0xFAAD],
  [0x20000, 0x215FF],
  [0x21600, 0x230FF],
  [0x23100, 0x245FF],
  [0x24600, 0x260FF],
  [0x26100, 0x275FF],
  [0x27600, 0x290FF],
  [0x29100, 0x2A6DF],
  [0x2A700, 0x2B734],
  [0x2B740, 0x2B81D]
]

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionomgView Question on Stackoverflow
Solution 1 - UnicodeVonCView Answer on Stackoverflow
Solution 2 - UnicodePacerierView Answer on Stackoverflow
Solution 3 - UnicodeLerner ZhangView Answer on Stackoverflow
Solution 4 - UnicodeVoyagerView Answer on Stackoverflow
Solution 5 - UnicodeSuragchView Answer on Stackoverflow
Solution 6 - UnicodeqixView Answer on Stackoverflow
Solution 7 - UnicodeLanceView Answer on Stackoverflow