Labels

[info]Unicode & UTF-8

Unicode defines a codespace of 1,114,112 code points in the range 0hex to 10FFFFhex.
The Unicode codespace is divided into seventeen planes, each comprising 65,536 code points or 256 rows of 256 code points:
Plane Range Description Abbreviation
0 0000–FFFF Basic Multilingual Plane BMP
1 10000–1FFFF Supplementary Multilingual Plane SMP
2 20000–2FFFF Supplementary Ideographic Plane SIP
3 to 13 30000–DFFFF currently unassigned
14 E0000–EFFFF Supplementary Special-purpose Plane SSP
15 F0000–FFFFF Supplementary Private Use Area-A
16 100000–10FFFF Supplementary Private Use Area-B

The bits of a Unicode character are distributed into the lower bit positions inside the UTF-8 bytes, with the lowest bit going into the last bit of the last byte. In this table x represent the lowest 8 bits of the Unicode value, y represent the next higher 8 bits, and z represent the bits higher than that:

Unicode Byte1 Byte2 Byte3 Byte4 example
U+0000-U+007F 0xxxxxxx


'$' U+0024
00100100
0x24
U+0080-U+07FF 110yyyxx 10xxxxxx

'¢' U+00A2
11000010,10100010
0xC2,0xA2
U+0800-U+FFFF 1110yyyy 10yyyyxx 10xxxxxx
'€' U+20AC
11100010,10000010,10101100
0xE2,0x82,0xAC
U+10000-U+10FFFF 11110zzz 10zzyyyy 10yyyyxx 10xxxxxx  U+10ABCD
11110100,10001010,10101111,10001101
0xF4,0x8A,0xAF,0x8D

binary hex decimal notes
00000000-01111111 00-7F 0-127 US-ASCII (single byte)
10000000-10111111 80-BF 128-191 Second, third, or fourth byte of a multi-byte sequence
11000000-11000001 C0-C1 192-193 Overlong encoding: start of a 2-byte sequence, but code point <= 127
11000010-11011111 C2-DF 194-223 Start of 2-byte sequence
11100000-11101111 E0-EF 224-239 Start of 3-byte sequence
11110000-11110100 F0-F4 240-244 Start of 4-byte sequence
11110101-11110111 F5-F7 245-247 Restricted by RFC 3629: start of 4-byte sequence for codepoint above 10FFFF
11111000-11111011 F8-FB 248-251 Restricted by RFC 3629: start of 5-byte sequence
11111100-11111101 FC-FD 252-253 Restricted by RFC 3629: start of 6-byte sequence
11111110-11111111 FE-FF 254-255 Invalid: not defined by original UTF-8 specification