Unicode Tutorials - Herong's Tutorial Examples


Copyright © 1995-2022 Herong Yang. All rights reserved.

Unicode Tutorials This Unicode tutorial book is a collection of notes and sample codes written by the author while he was learning Unicode himself. Topics include Character Sets and Encodings; GB2312/GB18030 Character Set and Encodings; JIS X0208 Character Set and Encodings; Unicode Character Set; Basic Multilingual Plane (BMP); Unicode Transformation Formats (UTF); Surrogates and Supplementary Characters; Unicode Character Blocks; Java Character Set and Encoding; Java Encoding Maps, Counts and Conversion. Updated in 2022 (Version v5.31) with minor changes.

Table of Contents

About This Book

Character Sets and Encodings

What Is Character Set

Commonly Used Character Sets and Encodings

ASCII Character Set and Encoding


Listing of ASCII Characters and Encoded Bytes

GB2312 Character Set and Encoding

GB2312 Character Set for Chinese Characters

GB2312 Encoding for GB2312 Character Set

Relation of GB2312 and Unicode

GB18030 Character Set and Encoding

History of GB Character Sets

GB18030 Encoding for GB18030 Character Set

JIS X0208 Character Set and Encodings

JIS X0208 Character Set for Japanese Characters

JIS X0208 Character Code Values

EUC-JP Encoding

ISO-2022-JP Encoding

Shift-JIS Encoding

Unicode Character Set

What Is Unicode

Examples of Unicode Characters

Unique Features of Unicode

Unicode Standard Releases

Code Point Blocks

Unicode 13.0 Character Samples

Unicode 8.0 Character Samples

Unicode 7.0 Character Samples

Unicode 6.0 Character Samples

Unicode 5.0 Character Samples

Unicode 4.0 Character Samples

UTF-8 (Unicode Transformation Format - 8-Bit)

UTF-8 Encoding

UTF-8 Encoding Algorithm

Features of UTF-8 Encoding

UTF-16, UTF-16BE and UTF-16LE Encodings

What Are Paired Surrogates

UTF-16 Encoding

UTF-16BE Encoding

UTF-16LE Encoding

UTF-32, UTF-32BE and UTF-32LE Encodings

UTF-32 Encoding

UTF-32BE Encoding

UTF-32LE Encoding

Java Language and Unicode Characters

Unicode Versions Supported in Java-History

'int' and 'String' - Basic Data Types for Unicode

"Character" Class with Unicode Utility Methods

Character.toChars() - "char" Sequence of Code Point

Character.getNumericValue() - Numeric Value of Code Point

"String" Class with Unicode Utility Methods

String.length() Is Not Number of Characters

String.toCharArray() Returns the UTF-16BE Sequence

String Literals and Source Code Encoding

Character Encoding in Java

What Is Character Encoding

List of Supported Character Encodings in Java

EncodingSampler.java - Testing encode() Methods

Examples of CP1252 and ISO-8859-1 Encodings

Examples of US-ASCII, UTF-8, UTF-16 and UTF-32 Encodings

Examples of GB18030 Encoding

Testing decode() Methods

Character Set Encoding Maps

Character Set Encoding Map Analyzer

Character Set Encoding Maps - US-ASCII and ISO-8859-1/Latin 1

Character Set Encoding Maps - CP1252/Windows-1252

Character Set Encoding Maps - Unicode UTF-8

Character Set Encoding Maps - Unicode UTF-16, UTF-16BE, UTF-16LE

Character Set Encoding Maps - Unicode UTF-32, UTF-32BE, UTF-32LE

Character Counter Program for Any Given Encoding

Character Set Encoding Comparison

Encoding Conversion Programs for Encoded Text Files

\uxxxx - Entering Unicode Data in Java Programs

HexWriter.java - Converting Encoded Byte Sequences to Hex Values

EncodingConverter.java - Encoding Conversion Sample Program

Viewing Encoded Text Files in Web Browsers

Unicode Signs in Different Encodings

Using Notepad as a Unicode Text Editor

What Is Notepad

Opening UTF-8 Text Files

Opening UTF-16BE Text Files

Opening UTF-16LE Text Files

Saving Files in UTF-8 Option

Byte Order Mark (BOM) - FEFF - EFBBBF

Saving Files in "Unicode Big Endian" Option

Saving Files in "Unicode" Option

Supported Save and Open File Formats

Using Microsoft Word as a Unicode Text Editor

What Is Microsoft Word

Opening UTF-8 Text Files

Opening UTF-16BE Text Files

Opening UTF-16LE Text Files

Saving Files in "Unicode (UTF-8)" Option

Saving Files in "Unicode (Big-Endian)" Option

Saving Files in Unicode Option

Supported Save and Open File Formats

Using Microsoft Excel as a Unicode Text Editor

What Is Microsoft Excel

Opening UTF-8 Text Files

Opening UTF-16BE Text Files

Opening UTF-16LE Text Files

Saving UTF-8 Text Files

Saving Files in "Unicode Text (*.txt)" Option

Opening UTF-16 Text Files

Supported Save and Open File Formats

Unicode Fonts

What Is a Font

What Is a Unicode Font

Downloading and Installing GNU Unifont

Windows Tool "Character Map"

Unicode Code Point Blocks: 0000 - 0FFF

0000: C0 Controls and Basic Latin

0080: C1 Controls and Latin-1 Supplement

0100: Latin Extended-A

0180: Latin Extended-B

0250: IPA Extensions

02B0: Spacing Modifier Letters

0300: Combining Diacritical Marks

0370: Greek and Coptic

0400: Cyrillic

0500: Cyrillic Supplement

0530: Armenian

0590: Hebrew

0600: Arabic

0700: Syriac

0750: Arabic Supplement

0780: Thaana

07C0: N'Ko

0800: Samaritan

0840: Mandaic

0860: Syriac Supplement

08A0: Arabic Extended-A

0900: Devanagari

0980: Bengali

0A00: Gurmukhi

0A80: Gujarati

0B00: Oriya

0B80: Tamil

0C00: Telugu

0C80: Kannada

0D00: Malayalam

0D80: Sinhala

0E00: Thai

0E80: Lao

0F00: Tibetan

Unicode Code Point Blocks: 1000 - FFFF

1000: Myanmar

10A0: Georgian

1100: Hangul Jamo

1200: Ethiopic

1380: Ethiopic Supplement

13A0: Cherokee

1400: Unified Canadian Aboriginal Syllabics

1680: Ogham

16A0: Runic

1700: Tagalog

1720: Hanunoo

1740: Buhid

1760: Tagbanwa

1780: Khmer

1800: Mongolian

18B0: Unified Canadian Aboriginal Syllabics Extended

1900: Limbu

1950: Tai Le

1980: New Tai Lue

19E0: Khmer Symbols

1A00: Buginese

1AB0: Combining Diacritical Marks Extended

1A20: Tai Tham

1B00: Balinese

1B80: Sundanese

1BC0: Batak

1C00: Lepcha

1C50: Ol Chiki

1C80: Cyrillic Extended-C

1C90: Georgian Extended

1CC0: Sundanese Supplement

1CD0: Vedic Extensions

1D00: Phonetic Extensions

1D80: Phonetic Extensions Supplement

1DC0: Combining Diacritical Marks Supplement

1E00: Latin Extended Additional

1F00: Greek Extended

2000: General Punctuation

2070: Superscripts and Subscripts

20A0: Currency Symbols

20D0: Combining Diacritical Marks for Symbols

2100: Letterlike Symbols

2150: Number Forms

2190: Arrows

2200: Mathematical Operators

2300: Miscellaneous Technical

2400: Control Pictures

2440: Optical Character Recognition

2460: Enclosed Alphanumerics

2500: Box Drawing

2580: Block Elements

25A0: Geometric Shapes

2600: Miscellaneous Symbols

2700: Dingbats

27C0: Miscellaneous Mathematical Symbols-A

27F0: Supplemental Arrows-A

2800: Braille Patterns

2900: Supplemental Arrows-B

2980: Miscellaneous Mathematical Symbols-B

2A00: Supplemental Mathematical Operators

2B00: Miscellaneous Symbols and Arrows

2C00: Glagolitic

2C60: Latin Extended-C

2C80: Coptic

2D00: Georgian Supplement

2D30: Tifinagh

2D80: Ethiopic Extended

2DE0: Cyrillic Extended-A

2E00: Supplemental Punctuation

2E80: CJK Radicals Supplement

2F00: Kangxi Radicals

2FF0: Ideographic Description Characters

3000: CJK Symbols and Punctuation

3040: Hiragana

30A0: Katakana

3100: Bopomofo

3130: Hangul Compatibility Jamo

3190: Kanbun

31A0: Bopomofo Extended

31C0: CJK Strokes

31F0: Katakana Phonetic Extensions

3200: Enclosed CJK Letters and Months

3300: CJK Compatibility

3400: CJK Unified Ideographs Extension A

4DC0: Yijing Hexagram Symbols

4E00: CJK Unified Ideographs

A000: Yi Syllables

A490: Yi Radicals

A4D0: Lisu

A500: Vai

A640: Cyrillic Extended-B

A6A0: Bamum

A700: Modifier Tone Letters

A720: Latin Extended-D

A800: Syloti Nagri

A830: Common Indic Number Forms

A840: Phags-pa

A880: Saurashtra

A8E0: Devanagari Extended

A900: Kayah Li

A930: Rejang

A960: Hangul Jamo Extended-A

A980: Javanese

A9E0: Myanmar Extended-B

AA00: Cham

AA60: Myanmar Extended-A

AA80: Tai Viet

AAE0: Meetei Mayek Extensions

AB00: Ethiopic Extended-A

AB30: Latin Extended-E

AB70: Cherokee Supplement

ABC0: Meetei Mayek

AC00: Hangul Syllables

D7B0: Hangul Jamo Extended-B

D800: High Surrogates

DB80: High Private Use Surrogates

DC00: Low Surrogates

E000: Private Use Area

F900: CJK Compatibility Ideographs

FB00: Alphabetic Presentation Forms

FB50: Arabic Presentation Forms-A

FE00: Variation Selectors

FE10: Vertical Forms

FE20: Combining Half Marks

FE30: CJK Compatibility Forms

FE50: Small Form Variants

FE70: Arabic Presentation Forms-B

FF00: Halfwidth and Fullwidth Forms

FFF0: Specials

Unicode Code Point Blocks: 10000 - 11FFF

10000: Linear B Syllabary

10080: Linear B Ideograms

10100: Aegean Numbers

10140: Ancient Greek Numbers

10190: Ancient Symbols

101D0: Phaistos Disc

10280: Lycian

102A0: Carian

102E0: Coptic Epact Numbers

10300: Old Italic

10330: Gothic

10350: Old Permic

10380: Ugaritic

103A0: Old Persian

10400: Deseret

10450: Shavian

10480: Osmanya

104B0: Osage

10500: Elbasan

10530: Caucasian Albanian

10600: Linear A

10800: Cypriot Syllabary

10840: Imperial Aramaic

10860: Palmyrene

10880: Nabataean

108E0: Hatran

10900: Phoenician

10920: Lydian

10980: Meroitic Hieroglyphs

109A0: Meroitic Cursive

10A00: Kharoshthi

10A60: Old South Arabian

10A80: Old North Arabian

10AC0: Manichaean

10B00: Avestan

10B40: Inscriptional Parthian

10B60: Inscriptional Pahlavi

10B80: Psalter Pahlavi

10C00: Old Turkic

10C80: Old Hungarian

10D00: Hanifi Rohingya

10E60: Rumi Numeral Symbols

10E80: Yezidi

10F00: Old Sogdian

10F30: Sogdian

10FB0: Chorasmian

10FE0: Elymaic

11000: Brahmi

11080: Kaithi

110D0: Sora Sompeng

11100: Chakma

11150: Mahajani

11180: Sharada

111E0: Sinhala Archaic Numbers

11200: Khojki

11280: Multani

112B0: Khudawadi

11300: Grantha

11400: Newa

11480: Tirhuta

11580: Siddham

11600: Modi

11660: Mongolian Supplement

11680: Takri

11700: Ahom

11800: Dogra

118A0: Warang Citi

11900: Dives Akuru

119A0: Nandinagari

11A00: Zanabazar Square

11A50: Soyombo

11AC0: Pau Cin Hau

11C00: Bhaiksuki

11C70: Marchen

11D00: Masaram Gondi

11D60: Gunjala Gondi

11EE0: Makasar

11FB0: Lisu Supplement

11FC0: Tamil Supplement

Unicode Code Point Blocks: 12000 - 10FFFF

12000: Cuneiform

12400: Cuneiform Numbers and Punctuation

12480: Early Dynastic Cuneiform

13000: Egyptian Hieroglyphs

13430: Egyptian Hieroglyph Format Controls

14400: Anatolian Hieroglyphs

16800: Bamum Supplement

16A40: Mro

16AD0: Bassa Vah

16B00: Pahawh Hmong

16E40: Medefaidrin

16F00: Miao

16FE0: Ideographic Symbols and Punctuation

17000: Tangut

18800: Tangut Components

18B00: Khitan Small Script

18D00: Tangut Supplement

1B000: Kana Supplement

1B100: Kana Extended-A

1B130: Small Kana Extension

1B170: Nushu

1BC00: Duployan

1BCA0: Shorthand Format Controls

1D000: Byzantine Musical Symbols

1D100: Musical Symbols

1D200: Ancient Greek Musical Notation

1D2E0: Mayan Numerals

1D300: Tai Xuan Jing Symbols

1D360: Counting Rod Numerals

1D400: Mathematical Alphanumeric Symbols

1D800: Sutton SignWriting

1E000: Glagolitic Supplement

1E100: Nyiakeng Puachue Hmong

1E2C0: Wancho

1E800: Mende Kikakui

1E900: Adlam

1EC70: Indic Siyaq Numbers

1ED00: Ottoman Siyaq Numbers

1EE00: Arabic Mathematical Alphabetic Symbols

1F000: Mahjong Tiles

1F030: Domino Tiles

1F0A0: Playing Cards

1F100: Enclosed Alphanumeric Supplement

1F200: Enclosed Ideographic Supplement

1F300: Miscellaneous Symbols And Pictographs

1F600: Emoticons

1F650: Ornamental Dingbats

1F680: Transport and Map Symbols

1F700: Alchemical Symbols

1F780: Geometric Shapes Extended

1F800: Supplemental Arrows-C

1F900: Supplemental Symbols and Pictographs

1FA00: Chess Symbols

1FA70: Symbols and Pictographs Extended-A

1FB00: Symbols for Legacy Computing

20000: CJK Unified Ideographs Extension B

2A700: CJK Unified Ideographs Extension C

2B740: CJK Unified Ideographs Extension D

2B820: CJK Extension-E

2CEB0: CJK Extension-F

2F800: CJK Compatibility Ideographs Supplement

30000: CJK Unified Ideographs Extension G

E0000: Tags

E0100: Variation Selectors Supplement

F0000: Supplementary Private Use Area-A

100000: Supplementary Private Use Area-B

Archived Tutorials

Archived: EncodingSampler.java - BMP Character Encoding


Full Version in PDF/EPUB

Keywords: Unicode, Universal, Character, Encoding, Tutorial, Book