What Are Paired Surrogates?

This section provides a quick introduction of paired surrogates which are pairs of 16-bit integers to represent Unicode code points in the U+10000...0x10FFFF range.

The goal UTF-16 encoding is to:

The mapping for the U+0000...0xFFFF range is straightforward.

But the mapping for the U+10000...0x10FFFF range is tricky, because we want the resulting 4-byte stream can be recognized as 1 character in the U+10000...0x10FFFF range instead of 2 characters in the U+0000...0xFFFF range. This is achieved by using paired surrogates.

What Are Paired Surrogates? Paired surrogates are pairs of 2 16-bit unsigned integers in the surrogate area between 0xD800 and 0xDFFF. Since there are no Unicode characters assigned with code points in the surrogate area, Paired surrogates can be easily recognized as 1 character in the U+10000...0x10FFFF range.

The UTF-16 specification defines that the first surrogate must be in the high surrogate area between 0xD800 and 0xDBFF and the second surrogate in the low surrogate area between 0xDC00 and 0xDFFF.

Based on my understanding of the specification, here is the algorithm to convert a Unicode code point in the range of U+10000...0x10FFFF to a surrogate pair:

Exercise: Write an algorithm to convert a surrogate pair back to a Unicode code point.

Last update: 2009.

Table of Contents

 About This Book

 Character Sets and Encodings

 ASCII Character Set and Encoding

 GB2312 Character Set and Encoding

 GB18030 Character Set and Encoding

 JIS X0208 Character Set and Encodings

 Unicode Character Set

 UTF-8 (Unicode Transformation Format - 8-Bit)

UTF-16, UTF-16BE and UTF-16LE Encodings

What Are Paired Surrogates?

 UTF-16 Encoding

 UTF-16BE Encoding

 UTF-16LE Encoding

 UTF-32, UTF-32BE and UTF-32LE Encodings

 Java Language and Unicode Characters

 Character Encoding in Java

 Character Set Encoding Maps

 Encoding Conversion Programs for Encoded Text Files

 Using Notepad as a Unicode Text Editor

 Using Microsoft Word as a Unicode Text Editor

 Using Microsoft Excel as a Unicode Text Editor

 Unicode Fonts

 Unicode Code Point Blocks - Code Charts

 Outdated Tutorials

 References

 PDF Printing Version