String.length() Is Not Number of Characters

Unicode Tutorials - Herong's Tutorial Examples

∟String.length() Is Not Number of Characters

This section provides tutorial example on showing the difference between length() and codePointCount() methods. The difference between charAt(int index) and codePointAt(int index) is also demonstrated.

Because Unicode characters are stored in "String" objects as a mixed of single "char" elements and surrogate "char" element pairs, the "char" element index and Unicode character location are difficult to calculate.

Here is a tutorial example to show you this problem:

/* UnicodeStringIndex.java
 * Copyright (c) 2019 HerongYang.com. All Rights Reserved.
 */
import java.io.*;
class UnicodeStringIndex {
   static int[] unicodeList = {0x43, 0x2103, 0x1F132, 0x1F1A0, 
      0x37, 0x0667, 0x2166, 0x3286, 0x4E03, 0x1F108};
   public static void main(String[] arg) {
      try {    

// Constructing a String from a list of code points
         int num = unicodeList.length;
         String str = new String(unicodeList, 0, num);

// String length and code point count
         System.out.print("\n # of Unicode characters: "+num);
         System.out.print("\n        codePointCount(): "
            +str.codePointCount(0,str.length()));
         System.out.print("\n                length(): "
            +str.length());

// String element at a BMP position
         System.out.print("\n               charAt(1): "
            +Integer.toHexString(str.charAt(1)));
         System.out.print("\n          codePointAt(1): "
            +Integer.toHexString(str.codePointAt(1)));

// String element at a high surrogate position
         char high = str.charAt(2);
         System.out.print("\n               charAt(2): "
            +Integer.toHexString(high));
         System.out.print("\n          codePointAt(2): "
            +Integer.toHexString(str.codePointAt(2)));

// String element at a low surrogate position
         char low = str.charAt(3);
         System.out.print("\n               charAt(3): "
            +Integer.toHexString(low));
         System.out.print("\n          codePointAt(3): "
            +Integer.toHexString(str.codePointAt(3)));

// validating the surrogate char pair
         int code = Character.toCodePoint(high, low);
         System.out.print("\n Character.toCodePoint(): "
            +Integer.toHexString(Character.toCodePoint(high, low)));
      } catch (Exception e) {
         System.out.print("\n"+e.toString());
      }
   }
}

Compile and run it with Java 11:

C:\herong>javac UnicodeStringIndex.java

C:\herong>java UnicodeStringIndex
 # of Unicode characters: 10
        codePointCount(): 10
                length(): 13
               charAt(1): 2103
          codePointAt(1): 2103
               charAt(2): d83c
          codePointAt(2): 1f132
               charAt(3): dd32
          codePointAt(3): dd32
 Character.toCodePoint(): 1f132

The output confirms that:

codePointCount() returns the number Unicode characters in the "String" object.
length() returns the number of "char" elements in the "String" object. length() is always greater than or equal to codePointCount().
charAt() always return the "char" value at the given "char" index. It returns the high surrogate "char", if the given index points to the first "char" of a supplementary character - see charAt(2) in the output. It returns the low surrogate "char", if the given index points to the second "char" of a supplementary character - see charAt(3) in the output.
codePointAt() returns the correct code point value, if the given index points to a BMP character - see codePointAt(1). It returns the correct code point value, if the given index points to the first "char" of a supplementary character. - see codePointAt(2). It returns the low surrogate "char", if the given index points to the second "char" of a supplementary character. - see codePointAt(3).