"String" Class with Unicode Utility Methods

Unicode Tutorials - Herong's Tutorial Examples

∟"String" Class with Unicode Utility Methods

This section provides an introduction on 'String' class methods added and modified since J2SE 5.0 to support Unicode character processing.

Since designers of J2SE 5.0 did not change the internal storage mechanism for the "String" class, Unicode supplementary characters will be stored as surrogate "char" pairs in "String" objects. In other words, a single supplementary character will take 2 storage positions in a "String" object. If all characters in a "String" object are supplementary characters, the length of the "String" object is 2 times of the number of characters.

If a "String" object contains both BMP characters and supplementary characters, there is no 1-to-1 relation between Unicode character positions and "char" storage positions. The n-th Unicode character may not be stored at the n-th or 2*n-th "char" position in a "String" object.

To help manage this inconvenience, designers of J2SE 5.0 enhanced some existing methods and added some new methods in the "String" class. Here are some examples:

String(int[] codePoints, int offset, int count) constructor - Allocates a new String that contains characters from a subarray of the Unicode code point array argument. The offset argument is the index of the first code point of the subarray and the count argument specifies the length of the subarray. The contents of the subarray are converted to chars; subsequent modification of the int array does not affect the newly created string.
String(char[] value) constructor - Allocates a new String so that it represents the sequence of characters currently contained in the character array argument. The contents of the character array are copied; subsequent modification of the character array does not affect the newly created string.
int length() - Returns the length of this string. The length is equal to the number of Unicode code units in the string.
char charAt(int index) - Returns the char value at the specified index. An index ranges from 0 to length() - 1. The first char value of the sequence is at index 0, the next at index 1, and so on, as for array indexing. If the char value specified by the index is a surrogate, the surrogate value is returned.
int codePointAt(int index) - Returns the character (Unicode code point) at the specified index. The index refers to char values (Unicode code units) and ranges from 0 to length() - 1. If the char value specified at the given index is in the high-surrogate range, the following index is less than the length of this String, and the char value at the following index is in the low-surrogate range, then the supplementary code point corresponding to this surrogate pair is returned. Otherwise, the char value at the given index is returned.
int codePointCount(int beginIndex, int endIndex) - Returns the number of Unicode code points in the specified text range of this String. The text range begins at the specified beginIndex and extends to the char at index endIndex - 1. Thus the length (in chars) of the text range is endIndex-beginIndex. Unpaired surrogates within the text range count as one code point each.
byte[] getBytes(Charset charset) - Encodes this String into a sequence of bytes using the given charset, storing the result into a new byte array. This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement byte array. The CharsetEncoder class should be used when more control over the encoding process is required.
int indexOf(int ch) - Returns the index within this string of the first occurrence of the specified character. If a character with value ch occurs in the character sequence represented by this String object, then the index (in Unicode code units) of the first such occurrence is returned. For values of ch in the range from 0 to 0xFFFF (inclusive), this is the smallest value k such that: this.charAt(k) == ch, is true. For other values of ch, it is the smallest value k such that: this.codePointAt(k) == ch, is true. In either case, if no such character occurs in this string, then -1 is returned.
String substring(int beginIndex, int endIndex) - Returns a new string that is a substring of this string. The substring begins at the specified beginIndex and extends to the character at index endIndex - 1. Thus the length of the substring is endIndex-beginIndex.
char[] toCharArray() - Converts this string to a new character array.
static String valueOf(char[] data) - Returns the string representation of the char array argument. The contents of the character array are copied; subsequent modification of the character array does not affect the newly created string.