A guide to UTF-8 and related encodings.
Utf-8 can be understood as consisting of 3 types of bytes: ASCII, start,
and continuation bytes.
Each ASCII byte has the same meaning as in ASCII.
Each continuation byte is in octal 2xx, and adds xx to the end of the
character's codepoint. The start byte determines how many continuation
bytes are allowed.
The start bytes are from octal 3xx range. The bits that they add are shown
in the below table, in octal, with Xes. Each continuation byte fills two
of the Xes, hence the number of continuation bytes is obvious.
0 1 2 3 4 5 6 7
30x xx 1xx 2xx 3xx 4xx 5xx 6xx 7xx
31x 10xx 11xx 12xx 13xx 14xx 15xx 16xx 17xx
32x 20xx 21xx 22xx 23xx 24xx 25xx 26xx 27xx
33x 30xx 31xx 32xx 33xx 34xx 35xx 36xx 37xx
34x xxxx 1xxxx 2xxxx 3xxxx 4xxxx 5xxxx 6xxxx 7xxxx
35x 10xxxx 11xxxx 12xxxx 13xxxx 14xxxx 15xxxx 16xxxx 17xxxx
36x xxxxxx 1xxxxxx 2xxxxxx 3xxxxxx 4xxxxxx
Overlong Encodings:
An overlong encoding uses the start bytes \300, or \340 or \360, followed
by a \200. This produces a code point with leading zeros, which is
invalid. Since the start bytes \300 \301 could only be used to produce
overlong encodings of ASCII characters, they are invalid.
WTF-8 / CESU-8:
Many programs accepting/producing UTF-8 allow to encode a character
as its surrogate pair from UTF-16, encoded into two UTF-8 sequences.
the twenty bit code point is split into two 10 bit parts, and then
the first part is added to 0xD800 and the second to 0xDC00, and then
each part is encoded to a utf-8 sequence beginning with \355.
UTF-8-CP-1252:
On some UTF-8 handling programs, lone continuation bytes without a start
byte, or start bytes without enough continuation bytes, are interpreted
instead as if the encoding was Windows code page 1252. Since text in
CP-1252 would almost never form valid UTF-8 sequences, free mixing of
these encodings is rarely problematic.
Note on invalid bytes:
The bytes from 365 up are not valid in current UTF-8 but in the earlier
standard they would have continued:
0 1 2 3
4 5 6 7
360 xxxxxx 1xxxxxx 2xxxxxx 3xxxxxx
364 4xxxxxx 5xxxxxx 6xxxxxx 7xxxxxx
370 xxxxxxxx 1xxxxxxxx 2xxxxxxxx 3xxxxxxxx
374 xxxxxxxxxx 1xxxxxxxxxx
This would have allowed to represent any 31-bit value, since 2^31-1 is
octal 17777777777. With a regular extension, one could theoretically use
\376 as xxxxxxxxxxxx allowing 6 continuation bytes and supporting codes
up to 2^33-1.