Saville: Unicode and Encodings

Here is a summary of all things Unicode:

Unicode

Character Reference and Code Tables

Planes

A plane is a continuous group of 65,536 (= 2^16) code points
There are 17 planes, identified by the numbers 0 to 16
The Basic Multilingual Plane (BMP) is plane 0 (0000–FFFF)
Planes 1–16, are called “supplementary planes”
The code points in each plane have the hexadecimal values xx0000 to xxFFFF, where xx is a hex value from 00 to 10, signifying the plane to which the values belong

UTF-8 Encoding

UTF-8 is a way of storing those code points using less than 4 bytes per character
The first 127 values of UTF-8 map directly to Unicode code points, and hence to ASCII codes
Above 127, UTF-8 uses between two and four bytes for each code point
UTF-8 is not compatible with ISO-8859-1
Encoding design: https://en.wikipedia.org/wiki/UTF-8#Description
Example: https://en.wikipedia.org/wiki/UTF-8#Examples

UTF-16

Encodes code-points as one or two 16-bit code units
The code-points defined by the BMP are encoded as single 16-bit code units that are numerically equal to the corresponding code points
Code points from the Supplementary Planes are encoded by pairs of 16-bit code units called surrogate pairs: https://en.wikipedia.org/wiki/UTF-16#Code_points_U.2B010000_to_U.2B10FFFF

UTF-32

Byte Order Mark (BOM)

U+FEFF
If the endian architecture of the decoder matches that of the encoder, the decoder detects the 0xFEFF value, but an opposite-endian decoder interprets the BOM as the non-character value U+FFFE reserved for this purpose. This incorrect result provides a hint to perform byte-swapping for the remaining values
In UTF-16, a BOM (U+FEFF) may be placed as the first character of a file or character stream
The UTF-8 representation of the BOM is the byte sequence 0xEF,0xBB,0xBF
The Unicode Standard neither requires nor recommends the use of the BOM for UTF-8

HTML

URL Unicode Encoding

Compiled from:

Saville