Unicode Tutorials - Herong's Tutorial Examples

http://www.herongyang.com/Unicode

Copyright © 2015 by Dr. Herong Yang. All rights reserved.

Unicode Tutorials This free Unicode tutorial book is a collection of notes and sample codes written by the author while he was learning Unicode himself, an ideal tutorial guide for beginners. Topics include ASCII, BMP, character set, encoding, decoding, GB, GB18030, GB2312, GBK, ISO-8859, Java, JDK, JIS, Surrogate, UTF, Unicode.

Table of Contents

About This Book

Character Sets and Encodings

What Is Character Set?

Commonly Used Character Sets and Encodings

ASCII Character Set and Encoding

What Is ASCII?

Listing of ASCII Characters and Encoded Bytes

GB2312 Character Set and Encoding

GB2312 Character Set for Chinese Characters

GB2312 Encoding for GB2312 Character Set

Relation of GB2312 and Unicode

GB18030 Character Set and Encoding

History of GB Character Sets

GB18030 Encoding for GB18030 Character Set

JIS X0208 Character Set and Encodings

JIS X0208 Character Set for Japanese Characters

JIS X0208 Character Code Values

EUC-JP Encoding

ISO-2022-JP Encoding

Shift-JIS Encoding

Unicode Character Set

What Is Unicode?

Examples of Unicode Characters

Unique Features of Unicode

Unicode Standard Releases

Code Point Blocks

Unicode 8.0 Character Samples

Unicode 7.0 Character Samples

Unicode 6.0 Character Samples

Unicode 5.0 Character Samples

Unicode 4.0 Character Samples

UTF-8 (Unicode Transformation Format - 8-Bit)

UTF-8 Encoding

UTF-8 Encoding Algorithm

Features of UTF-8 Encoding

UTF-16, UTF-16BE and UTF-16LE Encodings

What Are Paired Surrogates?

UTF-16 Encoding

UTF-16BE Encoding

UTF-16LE Encoding

UTF-32, UTF-32BE and UTF-32LE Encodings

UTF-32 Encoding

UTF-32BE Encoding

UTF-32LE Encoding

Java Language and Unicode Characters

Unicode Versions Supported in Java-History

'int' and 'String' - Basic Data Types for Unicode

"Character" Class with Unicode Utility Methods

Character.toChars() - "char" Sequence of Code Point

Character.getNumericValue() - Numeric Value of Code Point

"String" Class with Unicode Utility Methods

String.length() Is Not Number of Characters

String.toCharArray() Returns the UTF-16BE Sequence

Character Encoding in Java

What Is Character Encoding?

List of Supported Character Encodings in Java

EncodingSampler.java - Testing encode() Methods

Examples of CP1252 and ISO-8859-1 Encodings

Examples of US-ASCII, UTF-8, UTF-16 and UTF-32 Encodings

Examples of GB18030 Encoding

Testing decode() Methods

Character Set Encoding Maps

Character Set Encoding Map Analyzer

Character Set Encoding Maps - US-ASCII and ISO-8859-1/Latin 1

Character Set Encoding Maps - CP1252/Windows-1252

Character Set Encoding Maps - Unicode UTF-8

Character Set Encoding Maps - Unicode UTF-16, UTF-16BE, UTF-16LE

Character Set Encoding Maps - Unicode UTF-32, UTF-32BE, UTF-32LE

Character Counter Program for Any Given Encoding

Character Set Encoding Comparison

Encoding Conversion Programs for Encoded Text Files

\uxxxx - Entering Unicode Data in Java Programs

HexWriter.java - Converting Encoded Byte Sequences to Hex Values

EncodingConverter.java - Encoding Conversion Sample Program

Viewing Encoded Text Files in Web Browsers

Unicode Signs in Different Encodings

Using Notepad as a Unicode Text Editor

What Is Notepad?

Opening UTF-8 Text Files

Opening UTF-16BE Text Files

Opening UTF-16LE Text Files

Saving Files in UTF-8 Option

Byte Order Mark (BOM) - FEFF - EFBBBF

Saving Files in "Unicode Big Endian" Option

Saving Files in "Unicode" Option

Supported Save and Open File Formats

Using Microsoft Word as a Unicode Text Editor

What Is Microsoft Word?

Opening UTF-8 Text Files

Opening UTF-16BE Text Files

Opening UTF-16LE Text Files

Saving Files in "Unicode (UTF-8)" Option

Saving Files in "Unicode (Big-Endian)" Option

Saving Files in Unicode Option

Supported Save and Open File Formats

Using Microsoft Excel as a Unicode Text Editor

What Is Microsoft Excel?

Opening UTF-8 Text Files

Opening UTF-16BE Text Files

Opening UTF-16LE Text Files

Saving UTF-8 Text Files

Saving Files in "Unicode Text (*.txt)" Option

Opening UTF-16 Text Files

Supported Save and Open File Formats

Unicode Fonts

What Is a Font?

What Is a Unicode Font?

Downloading and Installing GNU Unifont

Windows Tool "Character Map"

Unicode Code Point Blocks - Code Charts

U0000: C0 Controls and Basic Latin

U0080: C1 Controls and Latin-1 Supplement

U0100: Latin Extended-A

U0180: Latin Extended-B

U0250: IPA Extensions

U02B0: Spacing Modifier Letters

U0300: Combining Diacritical Marks

U0370: Greek and Coptic

U0400: Cyrillic

U0500: Cyrillic Supplement

U0530: Armenian

U0590: Hebrew

U0600: Arabic

U0700: Syriac

U0750: Arabic Supplement

U0780: Thaana

U07C0: NKo

U0800: Samaritan

U0840: Mandaic

U08A0: Arabic Extended-A

U0900: Devanagari

U0980: Bengali

U0A00: Gurmukhi

U0A80: Gujarati

U0B00: Oriya

U0B80: Tamil

U0C00: Telugu

U0C80: Kannada

U0D00: Malayalam

U0D80: Sinhala

U0E00: Thai

U0E80: Lao

U0F00: Tibetan

U1000: Myanmar

U10A0: Georgian

U1100: Hangul Jamo

U1200: Ethiopic

U1380: Ethiopic Supplement

U13A0: Cherokee

U1400: Unified Canadian Aboriginal Syllabics

U1680: Ogham

U16A0: Runic

U1700: Tagalog

U1720: Hanunoo

U1740: Buhid

U1760: Tagbanwa

U1780: Khmer

U1800: Mongolian

U18B0: Unified Canadian Aboriginal Syllabics Extended

U1900: Limbu

U1950: Tai Le

U1980: New Tai Lue

U19E0: Khmer Symbols

U1A00: Buginese

U1A20: Tai Tham

U1B00: Balinese

U1B80: Sundanese

U1BC0: Batak

U1C00: Lepcha

U1C50: Ol Chiki

U1CC0: Sundanese Supplement

U1CD0: Vedic Extensions

U1D00: Phonetic Extensions

U1D80: Phonetic Extensions Supplement

U1DC0: Combining Diacritical Marks Supplement

U1E00: Latin Extended Additional

U1F00: Greek Extended

U2000: General Punctuation

U2070: Superscripts and Subscripts

U20A0: Currency Symbols

U20D0: Combining Diacritical Marks for Symbols

U2100: Letterlike Symbols

U2150: Number Forms

U2190: Arrows

U2200: Mathematical Operators

U2300: Miscellaneous Technical

U2400: Control Pictures

U2440: Optical Character Recognition

U2460: Enclosed Alphanumerics

U2500: Box Drawing

U2580: Block Elements

U25A0: Geometric Shapes

U2600: Miscellaneous Symbols

U2700: Dingbats

U27C0: Miscellaneous Mathematical Symbols-A

U27F0: Supplemental Arrows-A

U2800: Braille Patterns

U2900: Supplemental Arrows-B

U2980: Miscellaneous Mathematical Symbols-B

U2A00: Supplemental Mathematical Operators

U2B00: Miscellaneous Symbols and Arrows

U2C00: Glagolitic

U2C60: Latin Extended-C

U2C80: Coptic

U2D00: Georgian Supplement

U2D30: Tifinagh

U2D80: Ethiopic Extended

U2DE0: Cyrillic Extended-A

U2E00: Supplemental Punctuation

U2E80: CJK Radicals Supplement

U2F00: Kangxi Radicals

U2FF0: Ideographic Description Characters

U3000: CJK Symbols and Punctuation

U3040: Hiragana

U30A0: Katakana

U3100: Bopomofo

U3130: Hangul Compatibility Jamo

U3190: Kanbun

U31A0: Bopomofo Extended

U31C0: CJK Strokes

U31F0: Katakana Phonetic Extensions

U3200: Enclosed CJK Letters and Months

U3300: CJK Compatibility

U3400: CJK Unified Ideographs Extension A

U4DC0: Yijing Hexagram Symbols

U4E00: CJK Unified Ideographs

UA000: Yi Syllables

UA490: Yi Radicals

UA4D0: Lisu

UA500: Vai

UA640: Cyrillic Extended-B

UA6A0: Bamum

UA700: Modifier Tone Letters

UA720: Latin Extended-D

UA800: Syloti Nagri

UA830: Common Indic Number Forms

UA840: Phags-pa

UA880: Saurashtra

UA8E0: Devanagari Extended

UA900: Kayah Li

UA930: Rejang

UA960: Hangul Jamo Extended-A

UA980: Javanese

UAA00: Cham

UAA60: Myanmar Extended-A

UAA80: Tai Viet

UAAE0: Meetei Mayek Extensions

UAB00: Ethiopic Extended-A

UABC0: Meetei Mayek

UAC00: Hangul Syllables

UD7B0: Hangul Jamo Extended-B

UD800: High Surrogates

UDB80: High Private Use Surrogates

UDC00: Low Surrogates

UE000: Private Use Area

UF900: CJK Compatibility Ideographs

UFB00: Alphabetic Presentation Forms

UFB50: Arabic Presentation Forms-A

UFE00: Variation Selectors

UFE10: Vertical Forms

UFE20: Combining Half Marks

UFE30: CJK Compatibility Forms

UFE50: Small Form Variants

UFE70: Arabic Presentation Forms-B

UFF00: Halfwidth and Fullwidth Forms

UFFF0: Specials

U10000: Linear B Syllabary

U10080: Linear B Ideograms

U10100: Aegean Numbers

U10140: Ancient Greek Numbers

U10190: Ancient Symbols

U101D0: Phaistos Disc

U10280: Lycian

U102A0: Carian

U10300: Old Italic

U10330: Gothic

U10380: Ugaritic

U103A0: Old Persian

U10400: Deseret

U10450: Shavian

U10480: Osmanya

U10800: Cypriot Syllabary

U10840: Imperial Aramaic

U10900: Phoenician

U10920: Lydian

U10980: Meroitic Hieroglyphs

U109A0: Meroitic Cursive

U10A00: Kharoshthi

U10A60: Old South Arabian

U10B00: Avestan

U10B40: Inscriptional Parthian

U10B60: Inscriptional Pahlavi

U10C00: Old Turkic

U10E60: Rumi Numeral Symbols

U11000: Brahmi

U11080: Kaithi

U110D0: Sora Sompeng

U11100: Chakma

U11180: Sharada

U11680: Takri

U12000: Cuneiform

U12400: Cuneiform Numbers and Punctuation

U13000: Egyptian Hieroglyphs

U16800: Bamum Supplement

U16F00: Miao

U1B000: Kana Supplement

U1D000: Byzantine Musical Symbols

U1D100: Musical Symbols

U1D200: Ancient Greek Musical Notation

U1D300: Tai Xuan Jing Symbols

U1D360: Counting Rod Numerals

U1D400: Mathematical Alphanumeric Symbols

U1EE00: Arabic Mathematical Alphabetic Symbols

U1F000: Mahjong Tiles

U1F030: Domino Tiles

U1F0A0: Playing Cards

U1F100: Enclosed Alphanumeric Supplement

U1F200: Enclosed Ideographic Supplement

U1F300: Miscellaneous Symbols And Pictographs

U1F600: Emoticons

U1F680: Transport And Map Symbols

U1F700: Alchemical Symbols

U20000: CJK Unified Ideographs Extension B

U2A700: CJK Unified Ideographs Extension C

U2B740: CJK Unified Ideographs Extension D

U2F800: CJK Compatibility Ideographs Supplement

UE0000: Tags

UE0100: Variation Selectors Supplement

UF0000: Supplementary Private Use Area-A

U100000: Supplementary Private Use Area-B

Outdated Tutorials

Outdated: EncodingSampler.java - BMP Character Encoding

References

PDF Printing Version

Keywords: Unicode, Universal, Character, Encoding, Tutorial, Book