Document Character Set and Encoding

HTML Tutorials - Herong's Tutorial Examples

∟Document Character Set and Encoding

This section describes HTML document character sets and encodings. Any character set and encoding can be used as long as the browser can extract HTML tags defined as Unicode code points from the character stream representing the HTML document.

If you are creating a HTML document with non-English text contents, you need to decide in what character set and encoding to present the non-English text in your HTML document.

Since HTML tags are mixed with the text content in a single document, the character set and encoding you selected to use for a HTML document must meet the following requirement.

After decoding your HTML document into a character stream in memory, the browser must be able to extract HTML tags out of that character stream. This is done by scanning characters in the HTML document character stream as integers and trying to match them with Unicode code points of HTML tags defined in the HTML specification.

For example, the "<p>" tag is defined in the HTML specification as Unicode code points of U+003C, U+0070, and U+003E. If an HTML document that has a "<p>" tag in it, the character set and encoding used for that document must ensure that when the browser decodes the document into memory, the resulting character stream must have a sequence of characters 0x3C, 0x70, and 0x3E in hexadecimal values, or 60, 112, and 62 in integer values.

Since all HTML tags are defined with characters in the ASCII code point range, you should use a character set that is a superset of ASCII characters to create your HTML documents to meet the above requirement.

If there are multiple encodings available for the selected character set, you can use any encoding to store your HTML document, as long as browsers support that encoding.

Once you have selected the character set and encoding, you should also provide the encoding name through a "metadata" element in your HTML document like this:

<meta charset="encoding_name">

The most commonly recommended HTML document encoding is UTF-8, which is represents the Unicode character set covering all characters of all human languages.

Other frequently used HTML document encodings are:

big5          Chinese Traditional (Big5)
euc-kr        Korean (EUC)
iso-8859-1    Western Alphabet
iso-8859-2    Central European Alphabet (ISO)
iso-8859-3    Latin 3 Alphabet (ISO)
iso-8859-4    Baltic Alphabet (ISO)
iso-8859-5    Cyrillic Alphabet (ISO)
iso-8859-6    Arabic Alphabet (ISO)
iso-8859-7    Greek Alphabet (ISO)
iso-8859-8    Hebrew Alphabet (ISO)
koi8-r        Cyrillic Alphabet (KOI8-R)
shift-jis     Japanese (Shift-JIS)
x-euc         Japanese (EUC)
windows-1258  Vietnamese Alphabet (Windows)
windows-874   Thai (Windows)