geHype: [info]Unicode & UTF-8

Unicode defines a codespace of 1,114,112 code points in the range 0_hex to 10FFFF_hex.The Unicode codespace is divided into seventeen planes, each comprising 65,536 code points or 256 rows of 256 code points:

Plane	Range	Description	Abbreviation
0	0000–FFFF	Basic Multilingual Plane	BMP
1	10000–1FFFF	Supplementary Multilingual Plane	SMP
2	20000–2FFFF	Supplementary Ideographic Plane	SIP
3 to 13	30000–DFFFF	currently unassigned
14	E0000–EFFFF	Supplementary Special-purpose Plane	SSP
15	F0000–FFFFF	Supplementary Private Use Area-A
16	100000–10FFFF	Supplementary Private Use Area-B

The bits of a Unicode character are distributed into the lower bit positions inside the UTF-8 bytes, with the lowest bit going into the last bit of the last byte. In this table x represent the lowest 8 bits of the Unicode value, y represent the next higher 8 bits, and z represent the bits higher than that:

Unicode	Byte1	Byte2	Byte3	Byte4	example
`U+0000-U+007F`	`0xxxxxxx`				'$' `U+0024` → `00100100` → `0x24`
`U+0080-U+07FF`	`110yyyxx`	`10xxxxxx`			'¢' `U+00A2` → `11000010,10100010` → `0xC2,0xA2`
`U+0800-U+FFFF`	`1110yyyy`	`10yyyyxx`	`10xxxxxx`		'€' `U+20AC` → `11100010,10000010,10101100` → `0xE2,0x82,0xAC`
`U+10000-U+10FFFF`	`11110zzz`	`10zzyyyy`	`10yyyyxx`	`10xxxxxx`	`U+10ABCD` → `11110100,10001010,10101111,10001101` → `0xF4,0x8A,0xAF,0x8D`

binary	hex	decimal	notes
`00000000`-`01111111`	`00`-`7F`	`0`-`127`	US-ASCII (single byte)
`10000000`-`10111111`	`80`-`BF`	`128`-`191`	Second, third, or fourth byte of a multi-byte sequence
`11000000`-`11000001`	`C0`-`C1`	`192`-`193`	Overlong encoding: start of a 2-byte sequence, but code point <= `127`
`11000010`-`11011111`	`C2`-`DF`	`194`-`223`	Start of 2-byte sequence
`11100000`-`11101111`	`E0`-`EF`	`224`-`239`	Start of 3-byte sequence
`11110000`-`11110100`	`F0`-`F4`	`240`-`244`	Start of 4-byte sequence
`11110101`-`11110111`	`F5`-`F7`	`245`-`247`	Restricted by RFC 3629: start of 4-byte sequence for codepoint above `10FFFF`
`11111000`-`11111011`	`F8`-`FB`	`248`-`251`	Restricted by RFC 3629: start of 5-byte sequence
`11111100`-`11111101`	`FC`-`FD`	`252`-`253`	Restricted by RFC 3629: start of 6-byte sequence
`11111110`-`11111111`	`FE`-`FF`	`254`-`255`	Invalid: not defined by original UTF-8 specification

Labels

[info]Unicode & UTF-8

1 comment:

Blog Archive