Unicode Tutorials - Herong's Tutorial Examples
∟Java Language and Unicode Characters
∟"String" Class with Unicode Utility Methods
This section provides an introduction on 'String' class methods added and modified since J2SE 5.0 to support Unicode character processing.
Since designers of J2SE 5.0 did not change the internal storage mechanism for the "String" class,
Unicode supplementary characters will be stored as surrogate "char" pairs in "String" objects.
In other words, a single supplementary character will take 2 storage positions in a "String" object.
If all characters in a "String" object are supplementary characters, the length of the "String" object
is 2 times of the number of characters.
If a "String" object contains both BMP characters and supplementary characters, there is no 1-to-1
relation between Unicode character positions and "char" storage positions.
The n-th Unicode character may not be stored at the n-th or 2*n-th "char" position in a "String" object.
To help manage this inconvenience, designers of J2SE 5.0 enhanced some existing methods and
added some new methods in the "String" class. Here are some examples:
- String(int[] codePoints, int offset, int count) constructor -
Allocates a new String that contains characters from a subarray of the Unicode code point array argument.
The offset argument is the index of the first code point of the subarray and the count argument
specifies the length of the subarray. The contents of the subarray are converted to chars;
subsequent modification of the int array does not affect the newly created string.
- String(char[] value) constructor -
Allocates a new String so that it represents the sequence of characters currently contained
in the character array argument. The contents of the character array are copied; subsequent modification
of the character array does not affect the newly created string.
- int length() -
Returns the length of this string. The length is equal to the number of Unicode code units in the string.
- char charAt(int index) -
Returns the char value at the specified index. An index ranges from 0 to length() - 1.
The first char value of the sequence is at index 0, the next at index 1, and so on, as for array indexing.
If the char value specified by the index is a surrogate, the surrogate value is returned.
- int codePointAt(int index) -
Returns the character (Unicode code point) at the specified index.
The index refers to char values (Unicode code units) and ranges from 0 to length() - 1.
If the char value specified at the given index is in the high-surrogate range,
the following index is less than the length of this String,
and the char value at the following index is in the low-surrogate range,
then the supplementary code point corresponding to this surrogate pair is returned.
Otherwise, the char value at the given index is returned.
- int codePointCount(int beginIndex, int endIndex) -
Returns the number of Unicode code points in the specified text range of this String.
The text range begins at the specified beginIndex and extends to the char at index endIndex - 1.
Thus the length (in chars) of the text range is endIndex-beginIndex. Unpaired surrogates within
the text range count as one code point each.
- byte[] getBytes(Charset charset) -
Encodes this String into a sequence of bytes using the given charset, storing the result into a new byte array.
This method always replaces malformed-input and unmappable-character sequences with
this charset's default replacement byte array. The CharsetEncoder class should be used when more control
over the encoding process is required.
- int indexOf(int ch) -
Returns the index within this string of the first occurrence of the specified character.
If a character with value ch occurs in the character sequence represented by this String object,
then the index (in Unicode code units) of the first such occurrence is returned.
For values of ch in the range from 0 to 0xFFFF (inclusive), this is the smallest value k such that:
this.charAt(k) == ch, is true.
For other values of ch, it is the smallest value k such that:
this.codePointAt(k) == ch, is true.
In either case, if no such character occurs in this string, then -1 is returned.
- String substring(int beginIndex, int endIndex) -
Returns a new string that is a substring of this string. The substring begins at the specified beginIndex
and extends to the character at index endIndex - 1. Thus the length of the substring is endIndex-beginIndex.
- char[] toCharArray() - Converts this string to a new character array.
- static String valueOf(char[] data) -
Returns the string representation of the char array argument. The contents of the character array
are copied; subsequent modification of the character array does not affect the newly created string.
Table of Contents
About This Book
Character Sets and Encodings
ASCII Character Set and Encoding
GB2312 Character Set and Encoding
GB18030 Character Set and Encoding
JIS X0208 Character Set and Encodings
Unicode Character Set
UTF-8 (Unicode Transformation Format - 8-Bit)
UTF-16, UTF-16BE and UTF-16LE Encodings
UTF-32, UTF-32BE and UTF-32LE Encodings
Python Language and Unicode Characters
►Java Language and Unicode Characters
Unicode Versions Supported in Java History
'int' and 'String' - Basic Data Types for Unicode
"Character" Class with Unicode Utility Methods
Character.toChars() - "char" Sequence of Code Point
Character.getNumericValue() - Numeric Value of Code Point
►"String" Class with Unicode Utility Methods
String.length() Is Not Number of Characters
String.toCharArray() Returns the UTF-16BE Sequence
String Literals and Source Code Encoding
Character Encoding in Java
Character Set Encoding Maps
Encoding Conversion Programs for Encoded Text Files
Using Notepad as a Unicode Text Editor
Using Microsoft Word as a Unicode Text Editor
Using Microsoft Excel as a Unicode Text Editor
Unicode Fonts
Archived Tutorials
References
Full Version in PDF/EPUB