16-bit Unicode

What is 16-bit Unicode?

16-bit Unicode (UTF-16) or Unicode Transformation Format is a way to encode data with capability of 1,112,064 possible characters.

UTF-16 (Unicode Transformation Format), also referred to as 16-bit Unicode, is the encoding mechanism utilized to represent all 1,112,064 possible characters in the Unicode character set.

There are three alternative encoding algorithms that accompany the fundamental 16-bit sequence model. These encoding approaches enable the conversion of code points to 8-bit or octet sequences. At the outset, Unicode was devised as a 16-bit encoding to encompass all contemporary scripts.

Upon further reflection however, it became clear that a larger number of bits was required for most users because of the incorporation of approximately 14,500 composite characters to be compatible with existing sets. This led to the invention of UTF-16. With the help of UTF-16, up to 60,000 characters can be accessed in single Unicode 16-bit units. Furthermore, by utilizing surrogate pairings, an additional one million characters can be accessed.

Two ranges of Unicode code values have been allocated for the highest and lowest values of the pairings. The set of low values falls between 0xDC00 and 0xDFFF, while the set of high values is bounded between 0xD800 and 0xDBFF. Characters requiring surrogate pairs are rare because the first 64,000 values have already encoded the most commonly used characters. Most frequently used characters can be held in UTF-16 with a single code unit for each code point, thereby achieving an optimal trade-off between management and storage capacity. This is the standard encoding defined by Unicode.

Is Unicode a 16-bit encoding?

Unicode employs an 8-bit or 16-bit encoding scheme contingent on the type of data being encoded. As a rule of thumb, each character in the 16-bit encoding form is two bytes wide. This sixteen-bit encoding format is usually denoted as U+hhhh, with hhhh being the character’s hexadecimal code point. Most of the world’s major languages can be encoded using this encoding form, which yields in excess of 65 000 code components.

The Unicode standard also features a mechanism which allows for the encoding of up to one million extra characters. This extension process encodes an extended or supplemental character using two high and low surrogate code points. The coding value for the primary (or high) surrogate code point ranges from U+D800 to U+DBFF, while the coding value for the secondary (or low) surrogate code point ranges from U+DC00 to U+DFFF.

Can Unicode text be represented in more than one way?

Unicode data can be presented in different formats, such as UTF-8, UTF-16, and UTF-32. All of them can accommodate the entirety of Unicode, though they differ in terms of their code unit bit lengths. Additionally, UTS #6: A Standard Compression Scheme for Unicode (SCSU) outlines a compression modification.

What is a UTF?

The Unicode Transformation Format (UTF) is a character encoding format enabling the representation of all Unicode character code points. The most commonly used variety is UTF-8, which is an 8-bit code unit variable length encoding developed to be compatible with ASCII encoding. Another term for the Unicode Transformation Format is the Universal Transformation Format.

Unicode utilizes two distinct encodings: the Unicode Transformation Format (UTF) and the Universal Character Set (UCS). These encodings provide a range of code points which are mapped into collections of coded values, the number of bits used by each code value being indicated by the encoding names. In this way, code points are assigned to uniquely identify each character.