What Is Character Encoding

This section provides a quick introduction of Unicode character encodings and other local language encodings that are supported by Java.

Character Encoding: A map scheme between code points of a coded character set and sequences of bytes.

Coded Character Set: A character set in which each character has an assigned integral number.

Code Point: An integral number assigned to a character in a coded character set. As of Unicode 6.1, introduced in January, 2012, Unicode code point values have a range from 0x0000 to 0x10FFFF.

Unicode: A coded character set that contains all characters used in the written languages of the world and special symbols. As as Unicode 6.1, introduced in January, 2012, Unicode character set contains 110,181 characters.

The standard Unicode encoding is called UTF-32BE (Unicode Transformation Format - 32-bit Big Endian), which maps every Unicode character to a sequence of 4 bytes. For any given Unicode character, the UTF-32BE encoded byte sequence can be obtained by putting the character's code point integer number in the 4-byte binary format with the most significant byte listed first.

There are also other character encodings used on the Unicode character set, as described in previous chapters:

Since Unicode character set is a super set of many local language character sets, many other character encodings can also be applied to different subsets of the Unicode character set. Here are some examples of local language character encodings:

As of Java 11, released in July 2011, Java language can support the Unicode character set defined in Unicode 10.0. UTF-32, UTF-16, and UTF-8 encodings are fully supported in Java.

Java can also help to you to perform local language character encodings too. See the next tutorial for full list of encodings supported in Java 11.

Java offers the following built-in classes to support Unicode character set, local language character subsets, and their encodings:

Table of Contents

 About This Book

 Character Sets and Encodings

 ASCII Character Set and Encoding

 GB2312 Character Set and Encoding

 GB18030 Character Set and Encoding

 JIS X0208 Character Set and Encodings

 Unicode Character Set

 UTF-8 (Unicode Transformation Format - 8-Bit)

 UTF-16, UTF-16BE and UTF-16LE Encodings

 UTF-32, UTF-32BE and UTF-32LE Encodings

 Python Language and Unicode Characters

 Java Language and Unicode Characters

Character Encoding in Java

What Is Character Encoding

 List of Supported Character Encodings in Java

 EncodingSampler.java - Testing encode() Methods

 Examples of CP1252 and ISO-8859-1 Encodings

 Examples of US-ASCII, UTF-8, UTF-16 and UTF-32 Encodings

 Examples of GB18030 Encoding

 Testing decode() Methods

 Character Set Encoding Maps

 Encoding Conversion Programs for Encoded Text Files

 Using Notepad as a Unicode Text Editor

 Using Microsoft Word as a Unicode Text Editor

 Using Microsoft Excel as a Unicode Text Editor

 Unicode Fonts

 Archived Tutorials

 References

 Full Version in PDF/EPUB