What Is Character Encoding

Unicode Tutorials - Herong's Tutorial Examples

∟What Is Character Encoding

This section provides a quick introduction of Unicode character encodings and other local language encodings that are supported by Java.

Character Encoding: A map scheme between code points of a coded character set and sequences of bytes.

Coded Character Set: A character set in which each character has an assigned integral number.

Code Point: An integral number assigned to a character in a coded character set. As of Unicode 6.1, introduced in January, 2012, Unicode code point values have a range from 0x0000 to 0x10FFFF.

Unicode: A coded character set that contains all characters used in the written languages of the world and special symbols. As as Unicode 6.1, introduced in January, 2012, Unicode character set contains 110,181 characters.

The standard Unicode encoding is called UTF-32BE (Unicode Transformation Format - 32-bit Big Endian), which maps every Unicode character to a sequence of 4 bytes. For any given Unicode character, the UTF-32BE encoded byte sequence can be obtained by putting the character's code point integer number in the 4-byte binary format with the most significant byte listed first.

There are also other character encodings used on the Unicode character set, as described in previous chapters:

UTF-32BE - The standard Unicode character encoding as mentioned above.
UTF-32LE - Same as UTF-32BE, except that the least significant byte is listed first.
UTF-16BE - Every Unicode character is mapped to a sequence of 2 or 4 bytes with the most significant byte listed first.
UTF-16LE - Same as UTF-16BE, except that the least significant byte is listed first.
UTF-8 - Every Unicode character is mapped to a sequence of 1, 2, 3 or 4 byte.

Since Unicode character set is a super set of many local language character sets, many other character encodings can also be applied to different subsets of the Unicode character set. Here are some examples of local language character encodings:

ASCII - The standard encoding for the ASCII character set.
ISO-8859-1 - The ISO standard encoding for Latin character set.
GBK - A standard encoding for simplified Chinese character set.
Big5 - A standard encoding for traditional Chinese character set.

As of Java 11, released in July 2011, Java language can support the Unicode character set defined in Unicode 10.0. UTF-32, UTF-16, and UTF-8 encodings are fully supported in Java.

Java can also help to you to perform local language character encodings too. See the next tutorial for full list of encodings supported in Java 11.

Java offers the following built-in classes to support Unicode character set, local language character subsets, and their encodings:

java.nio.charset.Charset - Defined in the JDK document as "A named mapping between sequences of sixteen-bit Unicode code units and sequences of bytes. This class defines methods for creating decoders and encoders and for retrieving the various names associated with a charset. Instances of this class are immutable." The "Charset" class represents a particular character encoding defined a particular character set.
java.nio.charset.CharsetEncoder - Defined in the JDK document as "An engine that can transform a sequence of sixteen-bit Unicode characters into a sequence of bytes in a specific charset."
java.nio.charset.CharsetDecoder - Defined in the JDK document as "An engine that can transform a sequence of bytes in a specific charset into a sequence of sixteen-bit Unicode characters."