The Unicode codespace is divided into seventeen planes, each comprising 65,536 code points or 256 rows of 256 code points:
Plane | Range | Description | Abbreviation |
---|---|---|---|
0 | 0000–FFFF | Basic Multilingual Plane | BMP |
1 | 10000–1FFFF | Supplementary Multilingual Plane | SMP |
2 | 20000–2FFFF | Supplementary Ideographic Plane | SIP |
3 to 13 | 30000–DFFFF | currently unassigned | |
14 | E0000–EFFFF | Supplementary Special-purpose Plane | SSP |
15 | F0000–FFFFF | Supplementary Private Use Area-A | |
16 | 100000–10FFFF | Supplementary Private Use Area-B |
The bits of a Unicode character are distributed into the lower bit positions inside the UTF-8 bytes, with the lowest bit going into the last bit of the last byte. In this table x
represent the lowest 8 bits of the Unicode value, y
represent the next higher 8 bits, and z
represent the bits higher than that:
Unicode | Byte1 | Byte2 | Byte3 | Byte4 | example |
---|---|---|---|---|---|
U+0000-U+007F | 0xxxxxxx | '$' U+0024 → 00100100 → 0x24 | |||
U+0080-U+07FF | 110yyyxx | 10xxxxxx | '¢' U+00A2 → 11000010,10100010 → 0xC2,0xA2 | ||
U+0800-U+FFFF | 1110yyyy | 10yyyyxx | 10xxxxxx | '€' U+20AC → 11100010,10000010,10101100 → 0xE2,0x82,0xAC | |
U+10000-U+10FFFF | 11110zzz | 10zzyyyy | 10yyyyxx | 10xxxxxx | U+10ABCD → 11110100,10001010,10101111,10001101 → 0xF4,0x8A,0xAF,0x8D |
binary | hex | decimal | notes |
---|---|---|---|
00000000-01111111 | 00-7F | 0-127 | US-ASCII (single byte) |
10000000-10111111 | 80-BF | 128-191 | Second, third, or fourth byte of a multi-byte sequence |
11000000-11000001 | C0-C1 | 192-193 | Overlong encoding: start of a 2-byte sequence, but code point <= 127 |
11000010-11011111 | C2-DF | 194-223 | Start of 2-byte sequence |
11100000-11101111 | E0-EF | 224-239 | Start of 3-byte sequence |
11110000-11110100 | F0-F4 | 240-244 | Start of 4-byte sequence |
11110101-11110111 | F5-F7 | 245-247 | Restricted by RFC 3629: start of 4-byte sequence for codepoint above 10FFFF |
11111000-11111011 | F8-FB | 248-251 | Restricted by RFC 3629: start of 5-byte sequence |
11111100-11111101 | FC-FD | 252-253 | Restricted by RFC 3629: start of 6-byte sequence |
11111110-11111111 | FE-FF | 254-255 | Invalid: not defined by original UTF-8 specification |
The Unicode Character Code Charts of Unified CJK Ideographs(\url http://unicode.org/charts/PDF/U4E00.pdf) maps from U+4E00 to U+9FCF, then CJK uses 3 byte for encoding in UTF-8.
ReplyDelete