Unicode Tutorials - Herong's Tutorial Examples - v5.32, by Herong Yang
Character Set Encoding Maps - Unicode UTF-8
This section provides a tutorial example of analyzing and printing character set encoding maps for encoding: UTF-8 (Unicode Transformation Format - 8-bit), the most popular encoding for Unicode character set.
Here is the output of my sample program, EncodingAnalyzer2.java, for UTF-8 encoding with Java SE 7:
C:\herong>java EncodingAnalyzer2 UTF-8 UTF-8 encoding: 00000000 > 00 - 0000007F > 7F 00000080 > C2 80 - 000000BF > C2 BF 000000C0 > C3 80 - 000000FF > C3 BF 00000100 > C4 80 - 0000013F > C4 BF ...... 000007C0 > DF 80 - 000007FF > DF BF 00000800 > E0 A0 80 - 0000083F > E0 A0 BF 00000840 > E0 A1 80 - 0000087F > E0 A1 BF 00000880 > E0 A2 80 - 000008BF > E0 A2 BF ...... 00000FC0 > E0 BF 80 - 00000FFF > E0 BF BF 00001000 > E1 80 80 - 0000103F > E1 80 BF 00001040 > E1 81 80 - 0000107F > E1 81 BF 00001080 > E1 82 80 - 000010BF > E1 82 BF ...... 00001FC0 > E1 BF 80 - 00001FFF > E1 BF BF 00002000 > E2 80 80 - 0000203F > E2 80 BF 00002040 > E2 81 80 - 0000207F > E2 81 BF 00002080 > E2 82 80 - 000020BF > E2 82 BF ...... 00002FC0 > E2 BF 80 - 00002FFF > E2 BF BF 00003000 > E3 80 80 - 0000303F > E3 80 BF 00003040 > E3 81 80 - 0000307F > E3 81 BF 00003080 > E3 82 80 - 000030BF > E3 82 BF ...... 00003FC0 > E3 BF 80 - 00003FFF > E3 BF BF 00004000 > E4 80 80 - 0000403F > E4 80 BF 00004040 > E4 81 80 - 0000407F > E4 81 BF 00004080 > E4 82 80 - 000040BF > E4 82 BF ...... 00004FC0 > E4 BF 80 - 00004FFF > E4 BF BF 00005000 > E5 80 80 - 0000503F > E5 80 BF 00005040 > E5 81 80 - 0000507F > E5 81 BF 00005080 > E5 82 80 - 000050BF > E5 82 BF ...... 00005FC0 > E5 BF 80 - 00005FFF > E5 BF BF 00006000 > E6 80 80 - 0000603F > E6 80 BF 00006040 > E6 81 80 - 0000607F > E6 81 BF 00006080 > E6 82 80 - 000060BF > E6 82 BF ...... 00006FC0 > E6 BF 80 - 00006FFF > E6 BF BF 00007000 > E7 80 80 - 0000703F > E7 80 BF 00007040 > E7 81 80 - 0000707F > E7 81 BF 00007080 > E7 82 80 - 000070BF > E7 82 BF ...... 00007FC0 > E7 BF 80 - 00007FFF > E7 BF BF 00008000 > E8 80 80 - 0000803F > E8 80 BF 00008040 > E8 81 80 - 0000807F > E8 81 BF 00008080 > E8 82 80 - 000080BF > E8 82 BF ...... 00008FC0 > E8 BF 80 - 00008FFF > E8 BF BF 00009000 > E9 80 80 - 0000903F > E9 80 BF 00009040 > E9 81 80 - 0000907F > E9 81 BF 00009080 > E9 82 80 - 000090BF > E9 82 BF ...... 00009FC0 > E9 BF 80 - 00009FFF > E9 BF BF 0000A000 > EA 80 80 - 0000A03F > EA 80 BF 0000A040 > EA 81 80 - 0000A07F > EA 81 BF 0000A080 > EA 82 80 - 0000A0BF > EA 82 BF ...... 0000AFC0 > EA BF 80 - 0000AFFF > EA BF BF 0000B000 > EB 80 80 - 0000B03F > EB 80 BF 0000B040 > EB 81 80 - 0000B07F > EB 81 BF 0000B080 > EB 82 80 - 0000B0BF > EB 82 BF ...... 0000BFC0 > EB BF 80 - 0000BFFF > EB BF BF 0000C000 > EC 80 80 - 0000C03F > EC 80 BF 0000C040 > EC 81 80 - 0000C07F > EC 81 BF 0000C080 > EC 82 80 - 0000C0BF > EC 82 BF ...... 0000CFC0 > EC BF 80 - 0000CFFF > EC BF BF 0000D000 > ED 80 80 - 0000D03F > ED 80 BF 0000D040 > ED 81 80 - 0000D07F > ED 81 BF 0000D080 > ED 82 80 - 0000D0BF > ED 82 BF ...... 0000D7C0 > ED 9F 80 - 0000D7FF > ED 9F BF 0000D800 > 3F - 0000DFFF > 3F: Invalid range 0000E000 > EE 80 80 - 0000E03F > EE 80 BF 0000E040 > EE 81 80 - 0000E07F > EE 81 BF 0000E080 > EE 82 80 - 0000E0BF > EE 82 BF ...... 0000EFC0 > EE BF 80 - 0000EFFF > EE BF BF 0000F000 > EF 80 80 - 0000F03F > EF 80 BF 0000F040 > EF 81 80 - 0000F07F > EF 81 BF 0000F080 > EF 82 80 - 0000F0BF > EF 82 BF ...... 0000FFC0 > EF BF 80 - 0000FFFF > EF BF BF 00010000 > F0 90 80 80 - 0001003F > F0 90 80 BF 00010040 > F0 90 81 80 - 0001007F > F0 90 81 BF 00010080 > F0 90 82 80 - 000100BF > F0 90 82 BF ...... 00020000 > F0 A0 80 80 - 0002003F > F0 A0 80 BF 00020040 > F0 A0 81 80 - 0002007F > F0 A0 81 BF 00020080 > F0 A0 82 80 - 000200BF > F0 A0 82 BF ...... 0010FF40 > F4 8F BD 80 - 0010FF7F > F4 8F BD BF 0010FF80 > F4 8F BE 80 - 0010FFBF > F4 8F BE BF 0010FFC0 > F4 8F BF 80 - 0010FFFF > F4 8F BF BF Code Point > Byte Sequence - Code Point > Byte Sequence
The encoding map of UTF-8, which is the most popular encodings used for the Unicode character set, is complex:
Table of Contents
ASCII Character Set and Encoding
GB2312 Character Set and Encoding
GB18030 Character Set and Encoding
JIS X0208 Character Set and Encodings
UTF-8 (Unicode Transformation Format - 8-Bit)
UTF-16, UTF-16BE and UTF-16LE Encodings
UTF-32, UTF-32BE and UTF-32LE Encodings
Python Language and Unicode Characters
Java Language and Unicode Characters
Character Set Encoding Map Analyzer
Character Set Encoding Maps - US-ASCII and ISO-8859-1/Latin 1
Character Set Encoding Maps - CP1252/Windows-1252
►Character Set Encoding Maps - Unicode UTF-8
Character Set Encoding Maps - Unicode UTF-16, UTF-16BE, UTF-16LE
Character Set Encoding Maps - Unicode UTF-32, UTF-32BE, UTF-32LE
Character Counter Program for Any Given Encoding
Character Set Encoding Comparison
Encoding Conversion Programs for Encoded Text Files
Using Notepad as a Unicode Text Editor
Using Microsoft Word as a Unicode Text Editor