String Literals and Source Code Encoding

Unicode Tutorials - Herong's Tutorial Examples

∟String Literals and Source Code Encoding

This section provides tutorial example on how to represent non-ASCII characters in UTF-8 encoding byte sequences as part of String literals in the Java source code.

In previous tutorials, we have learned how to represent non-ASCII characters in \uXXXX escape sequences as part of String literals in Java source code.

In this tutorial, we will learn how to represent non-ASCII characters in UTF-8 encoding byte sequences as part of String literals in Java source code.

Here is our test string that contains 2 Non-ASCII characters:

Delicious food U+1F60B takes time U+23F3

Where: 
   U+1F60B: FACE SAVOURING DELICIOUS FOOD
   U+23F3: HOURGLASS WITH FLOWING SAND

Our test string should be displayed like this, if you have the correct Unicode font installed on your computer.

Delicious food  takes time

In our first test program, we will continue to use \uXXXX sequences in our source code. Note that U+1F60B character needs to be encoded as a surrogate pair of \uD83D\uDE0B based on the UTF-16 encoding rule.

/* UnicodeStringLiterals.java
 * Copyright (c) 2019 HerongYang.com. All Rights Reserved.
 */
class UnicodeStringLiterals {
   public static void main(String[] arg) {
      try {
         String str = "Delicious food \uD83D\uDE0B takes time \u23F3";
         System.out.print("\ncodePointCount(): "
            +str.codePointCount(0,str.length()));
         System.out.print("\n        length(): "
            +str.length());
         System.out.print("\n     String dump: ");
         printString(str);
      } catch (Exception e) {
         System.out.print("\n"+e.toString());
      }
   }
   public static void printString(String s) {
      char[] chars = s.toCharArray();
      for (char c : chars) {
         byte hi = (byte) (c >>> 8);
         byte lo = (byte) (c & 0xff);
         System.out.print(String.format("%02X%02X ", hi, lo));
      }
   }
}

Compile and run it with Java 11:

C:\herong>javac UnicodeStringLiterals.java

C:\herong>java UnicodeStringLiterals
codePointCount(): 29
        length(): 30
     String dump: 0044 0065 006C 0069 0063 0069 006F 0075 0073 0020 
                  0066 006F 006F 0064 0020 D83D DE0B 0020 0074 0061 
                  006B 0065 0073 0020 0074 0069 006D 0065 0020 23F3

In our second test program, we will continue to use UTF-8 encoding byte sequences in our source code. This program is definitely better than the first program, because you can actually see non-ASCII characters displayed in the source code.

/* UnicodeStringLiteralsUTF8.java
 * Copyright (c) 2019 HerongYang.com. All Rights Reserved.
 */
import java.io.*;
class UnicodeStringLiteralsUTF8 {
   public static void main(String[] arg) {
      try {
         String str = "Delicious food 😋 takes time ⏳";
         System.out.print("\ncodePointCount(): "
            +str.codePointCount(0,str.length()));
         System.out.print("\n        length(): "
            +str.length());
         System.out.print("\n     String dump: ");
         printString(str);
      } catch (Exception e) {
         System.out.print("\n"+e.toString());
      }
   }
   public static void printString(String s) {
      char[] chars = s.toCharArray();
      for (char c : chars) {
         byte hi = (byte) (c >>> 8);
         byte lo = (byte) (c & 0xff);
         System.out.print(String.format("%02X%02X ", hi, lo));
      }
   }
}

This time, we need to make sure that UnicodeStringLiteralsUTF8.java is saved as a UTF-8 encoding file and compile with the "-encoding UTF8" option:

C:\herong>javac -encoding UTF8 UnicodeStringLiteralsUTF8.java

C:\herong>java UnicodeStringLiteralsUTF8
codePointCount(): 29
        length(): 30
     String dump: 0044 0065 006C 0069 0063 0069 006F 0075 0073 0020 
                  0066 006F 006F 0064 0020 D83D DE0B 0020 0074 0061 
                  006B 0065 0073 0020 0074 0069 006D 0065 0020 23F3

The output is identical to the first program. This proves that we have properly represented non-ASCII characters in UTF-8 encoding byte sequences as part of String literals in the Java source code.