Character Set Encoding Comparison

This section provides a tutorial example on how to compare some commonly used character set encodings in number of characters, byte sequence sizes and ASCII compatibilities.

Here is the output of my sample program, EncodingCounter2.java, for US-ASCII encoding:

C:\herong>javac EncodingCounter2.java

C:\herong>java EncodingCounter2 US-ASCII
US-ASCII encoding:
00000000 > 00 - 0000007F > 7F = 128
00000080 > XX - 000FFFFF > XX = 1048448
Total characters = 1048576
Valid characters = 128
Invalid characters = 1048448

This tells us that the US-ASCII character set has only 128 characters.

Run EncodingCounter.java again with ISO-8859-1 (Latin 1) as argument, you will get:

C:\herong>java EncodingCounter2 ISO-8859-1
ISO-8859-1 encoding:
00000000 > 00 - 000000FF > FF = 256
00000100 > XX - 000FFFFF > XX = 1048320
Total characters = 1048576
Valid characters = 256
Invalid characters = 1048320

This tells us that the ISO-8859-1 character set has only 256 characters.

The following table is based on the output of the EncodingCouter.java program. It provides a brief comparison between the some commonly used encodings:

Encoding        Map    US-ASCII 
Name            Size   Compatible   Notes

US-ASCII         128   Y   7-bit characters only
ISO-8859-1       256   Y   8-bit (single byte) characters
CP1252           251   Y   One byte output, with code points up to 0x2122
UTF-8        1046528   Y   1-4 bytes, complex algorithm
UTF-16BE     1046528   N   2-4 bytes, code point and surrogate pairs
UTF-16LE     1046528   N   2-4 bytes, reversing byte pair of UTF-16BE
UTF-16       1046528   N   4-6 bytes, same as UTF-16BE with leading BOM
UTF-32BE     1046528   N   4 bytes, code point
UTF-32LE     1046528   N   4 bytes, reversing byte sequence of UTF-32BE
UTF-32       1046528   N   4 bytes, same as UTF-32BE
GB2312          7573   Y   1-2 bytes, Chinese 1980 standard
GBK            24068   Y   1-2 bytes, Chinese 1993 standard
GB18030      1046528   Y   1-4 bytes, superset of GBK, 2000 standard
BIG5           13831   Y   1-2 bytes, traditional Chinese character set

Table of Contents

 About This Book

 Character Sets and Encodings

 ASCII Character Set and Encoding

 GB2312 Character Set and Encoding

 GB18030 Character Set and Encoding

 JIS X0208 Character Set and Encodings

 Unicode Character Set

 UTF-8 (Unicode Transformation Format - 8-Bit)

 UTF-16, UTF-16BE and UTF-16LE Encodings

 UTF-32, UTF-32BE and UTF-32LE Encodings

 Python Language and Unicode Characters

 Java Language and Unicode Characters

 Character Encoding in Java

Character Set Encoding Maps

 Character Set Encoding Map Analyzer

 Character Set Encoding Maps - US-ASCII and ISO-8859-1/Latin 1

 Character Set Encoding Maps - CP1252/Windows-1252

 Character Set Encoding Maps - Unicode UTF-8

 Character Set Encoding Maps - Unicode UTF-16, UTF-16BE, UTF-16LE

 Character Set Encoding Maps - Unicode UTF-32, UTF-32BE, UTF-32LE

 Character Counter Program for Any Given Encoding

Character Set Encoding Comparison

 Encoding Conversion Programs for Encoded Text Files

 Using Notepad as a Unicode Text Editor

 Using Microsoft Word as a Unicode Text Editor

 Using Microsoft Excel as a Unicode Text Editor

 Unicode Fonts

 Archived Tutorials

 References

 Full Version in PDF/EPUB