A guide to UTF-8 and related encodings. Utf-8 can be understood as consisting of 3 types of bytes: ASCII, start, and continuation bytes. Each ASCII byte has the same meaning as in ASCII. Each continuation byte is in octal 2xx, and adds xx to the end of the character's codepoint. The start byte determines how many continuation bytes are allowed. The start bytes are from octal 3xx range. The bits that they add are shown in the below table, in octal, with Xes. Each continuation byte fills two of the Xes, hence the number of continuation bytes is obvious. 0 1 2 3 4 5 6 7 30x xx 1xx 2xx 3xx 4xx 5xx 6xx 7xx 31x 10xx 11xx 12xx 13xx 14xx 15xx 16xx 17xx 32x 20xx 21xx 22xx 23xx 24xx 25xx 26xx 27xx 33x 30xx 31xx 32xx 33xx 34xx 35xx 36xx 37xx 34x xxxx 1xxxx 2xxxx 3xxxx 4xxxx 5xxxx 6xxxx 7xxxx 35x 10xxxx 11xxxx 12xxxx 13xxxx 14xxxx 15xxxx 16xxxx 17xxxx 36x xxxxxx 1xxxxxx 2xxxxxx 3xxxxxx 4xxxxxx Overlong Encodings: An overlong encoding uses the start bytes \300, or \340 or \360, followed by a \200. This produces a code point with leading zeros, which is invalid. Since the start bytes \300 \301 could only be used to produce overlong encodings of ASCII characters, they are invalid. WTF-8 / CESU-8: Many programs accepting/producing UTF-8 allow to encode a character as its surrogate pair from UTF-16, encoded into two UTF-8 sequences. the twenty bit code point is split into two 10 bit parts, and then the first part is added to 0xD800 and the second to 0xDC00, and then each part is encoded to a utf-8 sequence beginning with \355. UTF-8-CP-1252: On some UTF-8 handling programs, lone continuation bytes without a start byte, or start bytes without enough continuation bytes, are interpreted instead as if the encoding was Windows code page 1252. Since text in CP-1252 would almost never form valid UTF-8 sequences, free mixing of these encodings is rarely problematic. Note on invalid bytes: The bytes from 365 up are not valid in current UTF-8 but in the earlier standard they would have continued: 0 1 2 3 4 5 6 7 360 xxxxxx 1xxxxxx 2xxxxxx 3xxxxxx 364 4xxxxxx 5xxxxxx 6xxxxxx 7xxxxxx 370 xxxxxxxx 1xxxxxxxx 2xxxxxxxx 3xxxxxxxx 374 xxxxxxxxxx 1xxxxxxxxxx This would have allowed to represent any 31-bit value, since 2^31-1 is octal 17777777777. With a regular extension, one could theoretically use \376 as xxxxxxxxxxxx allowing 6 continuation bytes and supporting codes up to 2^33-1.