The Unicode codespace is divided into seventeen planes, each comprising 65,536 code points or 256 rows of 256 code points:
| Plane | Range | Description | Abbreviation |
|---|---|---|---|
| 0 | 0000–FFFF | Basic Multilingual Plane | BMP |
| 1 | 10000–1FFFF | Supplementary Multilingual Plane | SMP |
| 2 | 20000–2FFFF | Supplementary Ideographic Plane | SIP |
| 3 to 13 | 30000–DFFFF | currently unassigned | |
| 14 | E0000–EFFFF | Supplementary Special-purpose Plane | SSP |
| 15 | F0000–FFFFF | Supplementary Private Use Area-A | |
| 16 | 100000–10FFFF | Supplementary Private Use Area-B |
The bits of a Unicode character are distributed into the lower bit positions inside the UTF-8 bytes, with the lowest bit going into the last bit of the last byte. In this table x represent the lowest 8 bits of the Unicode value, y represent the next higher 8 bits, and z represent the bits higher than that:
| Unicode | Byte1 | Byte2 | Byte3 | Byte4 | example |
|---|---|---|---|---|---|
U+0000-U+007F | 0xxxxxxx | '$' U+0024→ 00100100→ 0x24 | |||
U+0080-U+07FF | 110yyyxx | 10xxxxxx | '¢' U+00A2→ 11000010,10100010→ 0xC2,0xA2 | ||
U+0800-U+FFFF | 1110yyyy | 10yyyyxx | 10xxxxxx | '€' U+20AC→ 11100010,10000010,10101100→ 0xE2,0x82,0xAC | |
U+10000-U+10FFFF | 11110zzz | 10zzyyyy | 10yyyyxx | 10xxxxxx | U+10ABCD→ 11110100,10001010,10101111,10001101→ 0xF4,0x8A,0xAF,0x8D |
| binary | hex | decimal | notes |
|---|---|---|---|
| 00000000-01111111 | 00-7F | 0-127 | US-ASCII (single byte) |
| 10000000-10111111 | 80-BF | 128-191 | Second, third, or fourth byte of a multi-byte sequence |
| 11000000-11000001 | C0-C1 | 192-193 | Overlong encoding: start of a 2-byte sequence, but code point <= 127 |
| 11000010-11011111 | C2-DF | 194-223 | Start of 2-byte sequence |
| 11100000-11101111 | E0-EF | 224-239 | Start of 3-byte sequence |
| 11110000-11110100 | F0-F4 | 240-244 | Start of 4-byte sequence |
| 11110101-11110111 | F5-F7 | 245-247 | Restricted by RFC 3629: start of 4-byte sequence for codepoint above 10FFFF |
| 11111000-11111011 | F8-FB | 248-251 | Restricted by RFC 3629: start of 5-byte sequence |
| 11111100-11111101 | FC-FD | 252-253 | Restricted by RFC 3629: start of 6-byte sequence |
| 11111110-11111111 | FE-FF | 254-255 | Invalid: not defined by original UTF-8 specification |
The Unicode Character Code Charts of Unified CJK Ideographs(\url http://unicode.org/charts/PDF/U4E00.pdf) maps from U+4E00 to U+9FCF, then CJK uses 3 byte for encoding in UTF-8.
ReplyDelete