UTF Encoding

The .NET framework can read ASCII, UTF-8, UTF-16 and UTF32 streams without specifying an encoding type. Using only ASCII characters with UTF-8 encoding or direct characters with UTF-7 encoding produces ASCII compatible output. The other UTF encodings will produce output that is not compatible with ASCII.

Strings in the .NET framework are encoded in UTF-16.

The purpose of UTF is to represent 32-bit unicode code points (not characters!) in byte patterns. Each of the four encoding types does this in a different way with it’s advantages (+) and disadvantages (-).

Encodings

UTF-7

Uses seven bit, variable length (1 to 8 bytes per code point). It differentiates between direct characters (all alphanumeric characters and some symbols), optional direct characters (U+0020 to U+007E without backslash and space) and other characters. Direct characters are represented with their ASCII code and other characters are first encoded in UTF-16 and than in modified base64. The ‘+’ character is encoded as ‘+-‘.
 
– Little space efficient: Uses many bytes for non-ASCII characters.
– Encodes ‘+’ as ‘+-‘.
+ Works on only 7-bit wide transmission channels.
 

UTF-8

Uses eight bit, variable length (1 to 4 bytes per code point). ASCII (U+0000 to U+007F) is represented as a single byte. All other code points have the MSB set to 1 so they cannot be mistaken for 7-bit ASCII values. The highest bits in the first byte of a multi-byte sequence indicate the number of bytes in that sequence (110 – 2, 1110 – 3 and 11110 – 4 bytes).
 
– Less space efficient as U+0800 to U+FFFF need 3 bytes.
+ Compatible with ASCII.
+ Standard encoding for XML documents. Needs not to be specified.
+ Sorting and searching with standard byte-oriented algorithms can be used.
 

UTF-16

Uses sixteen bit, variable length (2 to 4 bytes per code point). The 65,536 code points from the BMP (basic multi-langual plane) without the 2,048 special surrogate code points are represented directly as 16-bit unsigned values. The non-BMP characters are represented as a surrogate pair of 16-bit words.
 
– Non-bmp characters are treated as special case.
– Byte ordering complicates protocol.
– Missing bytes can result in completly meaningless text.
+ Standard encoding for XML documents. Needs not to be specified.
+ U+0800 to U+FFFF need only 2 bytes.
 

UTF-32

Uses thirty two bit, fixed length (always 4 bytes per code point).
– Least space efficient. Worst for ASCII only text.
– Length calculation is not easier because unicode can have more than one code point per character.
+ Truncation is easier.
+ Simplest transformation.

Leave a Reply

Your email address will not be published. Required fields are marked *