generated html version of utf8guide.txtHOME

A guide to UTF-8 and related encodings.

Utf-8 can be understood as consisting of 3 types of bytes: ASCII, start, 
and continuation bytes. 
Each ASCII byte has the same meaning as in ASCII.
Each continuation byte is in octal 2xx, and adds xx to the end of the 
character's codepoint. The start byte determines how many continuation
bytes are allowed.
The start bytes are from octal 3xx range. The bits that they add are shown 
in the below table, in octal, with Xes. Each continuation byte fills two 
of the Xes, hence the number of continuation bytes is obvious.
       0        1       2       3       4       5       6       7
30x      xx     1xx     2xx     3xx     4xx     5xx     6xx     7xx
31x    10xx    11xx    12xx    13xx    14xx    15xx    16xx    17xx
32x    20xx    21xx    22xx    23xx    24xx    25xx    26xx    27xx
33x    30xx    31xx    32xx    33xx    34xx    35xx    36xx    37xx
34x    xxxx   1xxxx   2xxxx   3xxxx   4xxxx   5xxxx   6xxxx   7xxxx
35x  10xxxx  11xxxx  12xxxx  13xxxx  14xxxx  15xxxx  16xxxx  17xxxx
36x  xxxxxx 1xxxxxx 2xxxxxx 3xxxxxx 4xxxxxx

Overlong Encodings:
An overlong encoding uses the start bytes \300, or \340 or \360, followed 
by a \200. This produces a code point with leading zeros, which is
invalid. Since the start bytes \300 \301 could only be used to produce
overlong encodings of ASCII characters, they are invalid.

WTF-8 / CESU-8:
Many programs accepting/producing UTF-8 allow to encode a character
as its surrogate pair from UTF-16, encoded into two UTF-8 sequences.
the twenty bit code point is split into two 10 bit parts, and then
the first part is added to 0xD800 and the second to 0xDC00, and then
each part is encoded to a utf-8 sequence beginning with \355.

UTF-8-CP-1252:
On some UTF-8 handling programs, lone continuation bytes without a start 
byte, or start bytes without enough continuation bytes, are interpreted
instead as if the encoding was Windows code page 1252. Since text in
CP-1252 would almost never form valid UTF-8 sequences, free mixing of 
these encodings is rarely problematic.

Note on invalid bytes:
The bytes from 365 up are not valid in current UTF-8 but in the earlier
standard they would have continued:
         0          1           2         3      
         4          5           6         7
360     xxxxxx     1xxxxxx      2xxxxxx   3xxxxxx   
364    4xxxxxx     5xxxxxx      6xxxxxx   7xxxxxx 
370   xxxxxxxx   1xxxxxxxx    2xxxxxxxx 3xxxxxxxx 
374 xxxxxxxxxx 1xxxxxxxxxx  
This would have allowed to represent any 31-bit value, since 2^31-1 is
octal 17777777777. With a regular extension, one could theoretically use
\376 as xxxxxxxxxxxx allowing 6 continuation bytes and supporting codes
up to 2^33-1.