Unicode Tutorials - Herong's Tutorial Examples - v5.32, by Herong Yang
Character.toChars() - "char" Sequence of Code Point
This section provides tutorial example on how to test 'Character' class toChars() static methods to convert Unicode code points to 'char' sequences, which is really identical to the byte sequences from the UTF-16BE encoding of the code point.
One interesting static method offered in the "Character" class is the "toChars(int codePoint)" method, which always returns "char" sequence for any given Unicode character. It returns 1 "char" if a BMP character is given; and 2 "char"s if a supplementary character is given.
Here is a tutorial example on how to use "toChars()" and other related methods:
/* UnicodeCharacterToChars.java
* Copyright (c) 2019 HerongYang.com. All Rights Reserved.
*/
import java.io.*;
import java.nio.*;
import java.nio.charset.*;
class UnicodeCharacterToChars {
static int[] unicodeList = {0x43, 0x2103, 0x1F132, 0x1F1A0,
0x20FFFF};
static char hexDigit[] = {'0', '1', '2', '3', '4', '5', '6', '7',
'8', '9', 'A', 'B', 'C', 'D', 'E', 'F'};
public static void main(String[] arg) {
try {
for (int i=0; i<unicodeList.length; i++) {
// Starting with the code point value
int codePoint = unicodeList[i];
// Dumping data in HEX numbers
System.out.print("\n");
System.out.print("\n Code point: "
+intToHex(codePoint));
// Getting Unicode character basic properties
System.out.print("\n isDefined(): "
+Character.isDefined(codePoint));
System.out.print("\n getName(): "
+Character.getName(codePoint));
System.out.print("\n isBmpCodePoint(): "
+Character.isBmpCodePoint(codePoint));
System.out.print("\n isSupplementaryCodePoint(): "
+Character.isSupplementaryCodePoint(codePoint));
System.out.print("\n charCount(): "
+Character.charCount(codePoint));
// Getting surrogate char pair
char charHigh = Character.highSurrogate(codePoint);
char charLow = Character.lowSurrogate(codePoint);
System.out.print("\n highSurrogate(): "
+charToHex(charHigh));
System.out.print("\n lowSurrogate(): "
+charToHex(charLow));
System.out.print("\n isSurrogatePair(): "
+Character.isSurrogatePair(charHigh, charLow));
// Getting char sequence
char[] charSeq = Character.toChars(codePoint);
System.out.print("\n toChars():");
for (int j=0; j<charSeq.length; j++)
System.out.print(" "+charToHex(charSeq[j]));
// Getting UTF-16BE byte sequence
int[] intArray = {codePoint};
String charString = new String(intArray, 0, 1);
byte[] utf16Seq = charString.getBytes("UTF-16BE");
System.out.print("\n UTF-16BE byte sequence:");
for (int j=0; j<utf16Seq.length; j++)
System.out.print(" "+byteToHex(utf16Seq[j]));
}
} catch (Exception e) {
System.out.print("\n"+e.toString());
}
}
public static String byteToHex(byte b) {
char[] a = { hexDigit[(b >> 4) & 0x0f], hexDigit[b & 0x0f] };
return new String(a);
}
public static String charToHex(char c) {
byte hi = (byte) (c >>> 8);
byte lo = (byte) (c & 0xff);
return byteToHex(hi) + byteToHex(lo);
}
public static String intToHex(int i) {
char hi = (char) (i >>> 16);
char lo = (char) (i & 0xffff);
return charToHex(hi) + charToHex(lo);
}
}
Compile and run it with Java 11:
C:\herong>javac UnicodeCharacterToChars.java
C:\herong>java UnicodeCharacterToChars
Code point: 00000043
isDefined(): true
getName(): LATIN CAPITAL LETTER C
isBmpCodePoint(): true
isSupplementaryCodePoint(): false
charCount(): 1
highSurrogate(): D7C0
lowSurrogate(): DC43
isSurrogatePair(): false
toChars(): 0043
UTF-16BE byte sequence: 00 43
Code point: 00002103
isDefined(): true
getName(): DEGREE CELSIUS
isBmpCodePoint(): true
isSupplementaryCodePoint(): false
charCount(): 1
highSurrogate(): D7C8
lowSurrogate(): DD03
isSurrogatePair(): false
toChars(): 2103
UTF-16BE byte sequence: 21 03
Code point: 0001F132
isDefined(): true
getName(): SQUARED LATIN CAPITAL LETTER C
isBmpCodePoint(): false
isSupplementaryCodePoint(): true
charCount(): 2
highSurrogate(): D83C
lowSurrogate(): DD32
isSurrogatePair(): true
toChars(): D83C DD32
UTF-16BE byte sequence: D8 3C DD 32
Code point: 0001F1A0
isDefined(): false
getName(): null
isBmpCodePoint(): false
isSupplementaryCodePoint(): true
charCount(): 2
highSurrogate(): D83C
lowSurrogate(): DDA0
isSurrogatePair(): true
toChars(): D83C DDA0
UTF-16BE byte sequence: D8 3C DD A0
Code point: 0020FFFF
isDefined(): false
java.lang.IllegalArgumentException
The output confirms that:
Table of Contents
ASCII Character Set and Encoding
GB2312 Character Set and Encoding
GB18030 Character Set and Encoding
JIS X0208 Character Set and Encodings
UTF-8 (Unicode Transformation Format - 8-Bit)
UTF-16, UTF-16BE and UTF-16LE Encodings
UTF-32, UTF-32BE and UTF-32LE Encodings
Python Language and Unicode Characters
►Java Language and Unicode Characters
Unicode Versions Supported in Java History
'int' and 'String' - Basic Data Types for Unicode
"Character" Class with Unicode Utility Methods
►Character.toChars() - "char" Sequence of Code Point
Character.getNumericValue() - Numeric Value of Code Point
"String" Class with Unicode Utility Methods
String.length() Is Not Number of Characters
String.toCharArray() Returns the UTF-16BE Sequence
String Literals and Source Code Encoding
Encoding Conversion Programs for Encoded Text Files
Using Notepad as a Unicode Text Editor
Using Microsoft Word as a Unicode Text Editor