Character Set Encoding Comparison

Unicode Tutorials - Herong's Tutorial Examples

∟Character Set Encoding Comparison

This section provides a tutorial example on how to compare some commonly used character set encodings in number of characters, byte sequence sizes and ASCII compatibilities.

Here is the output of my sample program, EncodingCounter2.java, for US-ASCII encoding:

C:\herong>javac EncodingCounter2.java

C:\herong>java EncodingCounter2 US-ASCII
US-ASCII encoding:
00000000 > 00 - 0000007F > 7F = 128
00000080 > XX - 000FFFFF > XX = 1048448
Total characters = 1048576
Valid characters = 128
Invalid characters = 1048448

This tells us that the US-ASCII character set has only 128 characters.

Run EncodingCounter.java again with ISO-8859-1 (Latin 1) as argument, you will get:

C:\herong>java EncodingCounter2 ISO-8859-1
ISO-8859-1 encoding:
00000000 > 00 - 000000FF > FF = 256
00000100 > XX - 000FFFFF > XX = 1048320
Total characters = 1048576
Valid characters = 256
Invalid characters = 1048320

This tells us that the ISO-8859-1 character set has only 256 characters.

The following table is based on the output of the EncodingCouter.java program. It provides a brief comparison between the some commonly used encodings:

Encoding        Map    US-ASCII 
Name            Size   Compatible   Notes

US-ASCII         128   Y   7-bit characters only
ISO-8859-1       256   Y   8-bit (single byte) characters
CP1252           251   Y   One byte output, with code points up to 0x2122
UTF-8        1046528   Y   1-4 bytes, complex algorithm
UTF-16BE     1046528   N   2-4 bytes, code point and surrogate pairs
UTF-16LE     1046528   N   2-4 bytes, reversing byte pair of UTF-16BE
UTF-16       1046528   N   4-6 bytes, same as UTF-16BE with leading BOM
UTF-32BE     1046528   N   4 bytes, code point
UTF-32LE     1046528   N   4 bytes, reversing byte sequence of UTF-32BE
UTF-32       1046528   N   4 bytes, same as UTF-32BE
GB2312          7573   Y   1-2 bytes, Chinese 1980 standard
GBK            24068   Y   1-2 bytes, Chinese 1993 standard
GB18030      1046528   Y   1-4 bytes, superset of GBK, 2000 standard
BIG5           13831   Y   1-2 bytes, traditional Chinese character set