Unicode
- Unicode maps 32-bit (4 byte) integers (code points) to characters
- The first 127 code points (hex values 00 to 7f) are the same as ASCII
- The next 128 code points (0×80-0xff) are the same as ISO-8859-1
- An encoding is a mapping from bytes to Unicode code points
- https://en.wikibooks.org/wiki/Unicode/Character_reference/0000-0FFF
- http://www.unicode.org/charts/
- ASCII: http://www.unicode.org/charts/PDF/U0000.pdf
- Latin-1: http://www.unicode.org/charts/PDF/U0080.pdf
- Combining Diacritical Marks: http://www.unicode.org/charts/PDF/U0300.pdf
- A plane is a continuous group of 65,536 (= 2^16) code points
- There are 17 planes, identified by the numbers 0 to 16
- The Basic Multilingual Plane (BMP) is plane 0 (0000–FFFF)
- Planes 1–16, are called “supplementary planes”
- The code points in each plane have the hexadecimal values xx0000 to xxFFFF, where xx is a hex value from 00 to 10, signifying the plane to which the values belong
- UTF-8 is a way of storing those code points using less than 4 bytes per character
- The first 127 values of UTF-8 map directly to Unicode code points, and hence to ASCII codes
- Above 127, UTF-8 uses between two and four bytes for each code point
- UTF-8 is not compatible with ISO-8859-1
- Encoding design: https://en.wikipedia.org/wiki/UTF-8#Description
- Example: https://en.wikipedia.org/wiki/UTF-8#Examples
- Encodes code-points as one or two 16-bit code units
- The code-points defined by the BMP are encoded as single 16-bit code units that are numerically equal to the corresponding code points
- Code points from the Supplementary Planes are encoded by pairs of 16-bit code units called surrogate pairs: https://en.wikipedia.org/wiki/UTF-16#Code_points_U.2B010000_to_U.2B10FFFF
- Uses exactly 32 bits per Unicode code point.
- The UTF-32 form of a character is a direct representation of its codepoint
- Example: 00 00 00 61 is UTF-32 for Unicode code point 61, which is 'a'
- U+FEFF
- If the endian architecture of the decoder matches that of the encoder, the decoder detects the 0xFEFF value, but an opposite-endian decoder interprets the BOM as the non-character value U+FFFE reserved for this purpose. This incorrect result provides a hint to perform byte-swapping for the remaining values
- In UTF-16, a BOM (U+FEFF) may be placed as the first character of a file or character stream
- The UTF-8 representation of the BOM is the byte sequence 0xEF,0xBB,0xBF
- The Unicode Standard neither requires nor recommends the use of the BOM for UTF-8
- HTML Entity: å (decimal) or å (hex) (= å)
- UTF-16: %uXXXX, e.g. %u00e9 -> é
- UTF-8: %XX[%XX][%XX][%XX], e.g. %c2%a9 -> © %e2%89%a0 -> ≠
- http://www.darkcoding.net/software/finally-understanding-unicode-and-utf-8/
- http://de.selfhtml.org/inter/unicode.htm
- https://en.wikipedia.org/wiki/Plane_%28Unicode%29
- https://en.wikipedia.org/wiki/UTF-8
- https://en.wikipedia.org/wiki/UTF-16
- https://en.wikipedia.org/wiki/UTF-32
- https://en.wikipedia.org/wiki/Byte_order_mark