What Are Paired Surrogates
This section provides a quick introduction of paired surrogates which are pairs of 16-bit integers to represent Unicode code points in the U+10000...0x10FFFF range.
The goal UTF-16 encoding is to:
- Map Unicode code points in the range of U+0000...0xFFFF with 2 bytes (16 bits).
- Map Unicode code points in the range of U+10000...0x10FFFF with 4 bytes (32 bits).
The mapping for the U+0000...0xFFFF range is straightforward.
But the mapping for the U+10000...0x10FFFF range is tricky,
because we want the resulting 4-byte stream can be recognized as 1 character in the U+10000...0x10FFFF range
instead of 2 characters in the U+0000...0xFFFF range. This is achieved by using paired surrogates.
What Are Paired Surrogates?
Paired surrogates are pairs of 2 16-bit unsigned integers in the surrogate area between 0xD800
and 0xDFFF. Since there are no Unicode characters assigned with code points in the surrogate area,
Paired surrogates can be easily recognized as 1 character in the U+10000...0x10FFFF range.
The UTF-16 specification defines that the first surrogate must be in the high surrogate
area between 0xD800 and 0xDBFF and the second surrogate in the low surrogate
area between 0xDC00 and 0xDFFF.
Based on my understanding of the specification, here is the algorithm to convert a Unicode code point in the
range of U+10000...0x10FFFF to a surrogate pair:
- Let U be the unsigned integer value of the give code point.
- Let U' = U - 0x10000. U' is less than or equal to 0xFFFFF and now can be
expressed as an unassigned 20-bit integer.
- Divide 20 bits of U' into 2 blocks with 10 bits in each block as
- Let S1 = 0xD800 + 0byyyyyyyyyy, or S1 = 0b110110yyyyyyyyyy. S1 is the first surrogate of the surrogate pair.
- Let S2 = 0xDC00 + 0bxxxxxxxxxx, or S2 = 0b110111xxxxxxxxxx. S2 is the second surrogate of the surrogate pair.
Exercise: Write an algorithm to convert a surrogate pair back to a Unicode code point.
Table of Contents
About This Book
Character Sets and Encodings
ASCII Character Set and Encoding
GB2312 Character Set and Encoding
GB18030 Character Set and Encoding
JIS X0208 Character Set and Encodings
Unicode Character Set
UTF-8 (Unicode Transformation Format - 8-Bit)
►UTF-16, UTF-16BE and UTF-16LE Encodings
►What Are Paired Surrogates
UTF-32, UTF-32BE and UTF-32LE Encodings
Java Language and Unicode Characters
Character Encoding in Java
Character Set Encoding Maps
Encoding Conversion Programs for Encoded Text Files
Using Notepad as a Unicode Text Editor
Using Microsoft Word as a Unicode Text Editor
Using Microsoft Excel as a Unicode Text Editor
Unicode Code Point Blocks: 0000 - 0FFF
Unicode Code Point Blocks: 1000 - FFFF
Unicode Code Point Blocks: 10000 - 11FFF
Unicode Code Point Blocks: 12000 - 10FFFF
Full Version in PDF/EPUB