Unicode Tutorials - Herong's Tutorial Examples - v5.32, by Herong Yang
String Literals and Source Code Encoding
This section provides tutorial example on how to represent non-ASCII characters in UTF-8 encoding byte sequences as part of String literals in the Java source code.
In previous tutorials, we have learned how to represent non-ASCII characters in \uXXXX escape sequences as part of String literals in Java source code.
In this tutorial, we will learn how to represent non-ASCII characters in UTF-8 encoding byte sequences as part of String literals in Java source code.
Here is our test string that contains 2 Non-ASCII characters:
Delicious food U+1F60B takes time U+23F3 Where: U+1F60B: FACE SAVOURING DELICIOUS FOOD U+23F3: HOURGLASS WITH FLOWING SAND
Our test string should be displayed like this, if you have the correct Unicode font installed on your computer.
Delicious food takes time
In our first test program, we will continue to use \uXXXX sequences in our source code. Note that U+1F60B character needs to be encoded as a surrogate pair of \uD83D\uDE0B based on the UTF-16 encoding rule.
/* UnicodeStringLiterals.java * Copyright (c) 2019 HerongYang.com. All Rights Reserved. */ class UnicodeStringLiterals { public static void main(String[] arg) { try { String str = "Delicious food \uD83D\uDE0B takes time \u23F3"; System.out.print("\ncodePointCount(): " +str.codePointCount(0,str.length())); System.out.print("\n length(): " +str.length()); System.out.print("\n String dump: "); printString(str); } catch (Exception e) { System.out.print("\n"+e.toString()); } } public static void printString(String s) { char[] chars = s.toCharArray(); for (char c : chars) { byte hi = (byte) (c >>> 8); byte lo = (byte) (c & 0xff); System.out.print(String.format("%02X%02X ", hi, lo)); } } }
Compile and run it with Java 11:
C:\herong>javac UnicodeStringLiterals.java C:\herong>java UnicodeStringLiterals codePointCount(): 29 length(): 30 String dump: 0044 0065 006C 0069 0063 0069 006F 0075 0073 0020 0066 006F 006F 0064 0020 D83D DE0B 0020 0074 0061 006B 0065 0073 0020 0074 0069 006D 0065 0020 23F3
In our second test program, we will continue to use UTF-8 encoding byte sequences in our source code. This program is definitely better than the first program, because you can actually see non-ASCII characters displayed in the source code.
/* UnicodeStringLiteralsUTF8.java * Copyright (c) 2019 HerongYang.com. All Rights Reserved. */ import java.io.*; class UnicodeStringLiteralsUTF8 { public static void main(String[] arg) { try { String str = "Delicious food 😋 takes time ⏳"; System.out.print("\ncodePointCount(): " +str.codePointCount(0,str.length())); System.out.print("\n length(): " +str.length()); System.out.print("\n String dump: "); printString(str); } catch (Exception e) { System.out.print("\n"+e.toString()); } } public static void printString(String s) { char[] chars = s.toCharArray(); for (char c : chars) { byte hi = (byte) (c >>> 8); byte lo = (byte) (c & 0xff); System.out.print(String.format("%02X%02X ", hi, lo)); } } }
This time, we need to make sure that UnicodeStringLiteralsUTF8.java is saved as a UTF-8 encoding file and compile with the "-encoding UTF8" option:
C:\herong>javac -encoding UTF8 UnicodeStringLiteralsUTF8.java C:\herong>java UnicodeStringLiteralsUTF8 codePointCount(): 29 length(): 30 String dump: 0044 0065 006C 0069 0063 0069 006F 0075 0073 0020 0066 006F 006F 0064 0020 D83D DE0B 0020 0074 0061 006B 0065 0073 0020 0074 0069 006D 0065 0020 23F3
The output is identical to the first program. This proves that we have properly represented non-ASCII characters in UTF-8 encoding byte sequences as part of String literals in the Java source code.
Table of Contents
ASCII Character Set and Encoding
GB2312 Character Set and Encoding
GB18030 Character Set and Encoding
JIS X0208 Character Set and Encodings
UTF-8 (Unicode Transformation Format - 8-Bit)
UTF-16, UTF-16BE and UTF-16LE Encodings
UTF-32, UTF-32BE and UTF-32LE Encodings
Python Language and Unicode Characters
►Java Language and Unicode Characters
Unicode Versions Supported in Java History
'int' and 'String' - Basic Data Types for Unicode
"Character" Class with Unicode Utility Methods
Character.toChars() - "char" Sequence of Code Point
Character.getNumericValue() - Numeric Value of Code Point
"String" Class with Unicode Utility Methods
String.length() Is Not Number of Characters
String.toCharArray() Returns the UTF-16BE Sequence
►String Literals and Source Code Encoding
Encoding Conversion Programs for Encoded Text Files
Using Notepad as a Unicode Text Editor
Using Microsoft Word as a Unicode Text Editor