Basic Rules of Using Non-ASCII Characters in HTML Documents

This section describes basic rules on how non-ASCII character strings should be managed at different steps to ensure localized text strings can be used in PHP script string literals and displayed correctly on the browser window.

As you can see from the previous chapters, when PHP scripts are involved in a Web based application, they are always used behind a Web server. PHP scripts are expected to generate HTML documents and pass them back to the Web server. There are about four ways non ASCII characters can get into the HTML document through PHP scripts: a) Enter them as string literals; b) Receive them from HTTP request; c) Retrieve them from files; d) Retrieve them from a database.

In this chapter, we will concentrate on how to include non ASCII characters in PHP scripts as string literals. Here are the steps involved in this scenario:

A1. Key Sequences from keyboard
      |
      |- Text editor
      v
A2. PHP File
      |
      |- PHP CGI engine
      v
A3. HTML Document

Based on my experience, here are some basic rules related to those steps:

1. You must decide on the character encoding schema to be used in your PHP script file. For most of the languages, you have two options, a: use a encoding schema specific to that language; b: use a Unicode schema. For example, you can use either GB2312 (a simplified Chinese character schema) or UTF-8 (a Unicode character schema) for Chinese characters. My suggestion used to be "a". But today, I am suggesting "b", because Unicode schema can support all characters of all languages.

2. From step "A1" to "A2", you need select good text editor that supports the encoding schema you have decided. The end goal of this step is simple - characters in string literals must be stored in the PHP file using the decided encoding schema. Don't under estimate the difficulty level of this step. It could be very frustrating, because most computer keyboards support alphabetic letters only. You may have to use some language specific input software to translate alphabetic letters into language specific characters. The editor sometimes may also store characters in memory in one encoding schema, and offer you different encoding schema when saving files to hard disk.

3. String data type is defined as a sequence of bytes in PHP, like C language. This is different than Java language, where string data type is defined as a sequence of Unicode characters. String literals in PHP are also taken as sequences of bytes. This is a nice feature. It allows us to enter non ASCII characters in almost any encoding schema.

4. All PHP built-in string functions assume that strings are sequences of bytes. For example, strlen() returns the number of bytes of the given string, not the number of characters of a specific language. To manage strings as sequences of characters, we need to use Multibyte String functions, mb_*().

5. From step "A2" to "A3", HTML documents are generated from PHP script mainly through the print() function. The print() function will nicely copy every bytes from the specified string to HTML documents. This guarantees that any non ASCII characters encoded in any encoding schema will be copied correctly to the HTML document. Again, this is different than JSP pages, where strings will be converted into bytes stream based a specified encoding schema, if you are using character based output stream functions.

6. If you do want to convert from one encoding schema to another encoding schema during the print() function call, you can use mb_output_handler as the call back function on the output buffer: ob_start("mb_output_handler").

Table of Contents

 About This Book

 Introduction and Installation of PHP

 PHP Script File Syntax

 PHP Data Types and Data Literals

 Variables, References, and Constants

 Expressions, Operations and Type Conversions

 Conditional Statements - "if" and "switch"

 Loop Statements - "while", "for", and "do ... while"

 Function Declaration, Arguments, and Return Values

 Arrays - Ordered Maps

 Interface with Operating System

 Introduction of Class and Object

 Integrating PHP with Apache Web Server

 Retrieving Information from HTTP Requests

 Creating and Managing Sessions in PHP Scripts

 Sending and Receiving Cookies in PHP Scripts

 Controlling HTTP Response Header Lines in PHP Scripts

 Managing File Upload

 MySQL Server Connection and Access Functions

 Functions to Manage Directories, Files and Images

 SOAP Extension Function and Calling Web Services

 SOAP Server Functions and Examples

 Localization Overview of Web Applications

 Using Non-ASCII Characters in HTML Documents

Using Non-ASCII Characters as PHP Script String Literals

Basic Rules of Using Non-ASCII Characters in HTML Documents

 French Characters in String Literals - UTF-8 Encoding

 French Characters in HTML Documents - ISO-8859-1 Encoding

 Chinese Characters in String Literals - UTF-8 Encoding

 Chinese Characters in String Literals - GB2312 Encoding

 Characters of Multiple Languages in String Literals

 Receiving Non-ASCII Characters from Input Forms

 "mbstring" Extension and Non-ASCII Encoding Management

 Managing Non-ASCII Character Strings with MySQL Servers

 Parsing and Managing HTML Documents

 Configuring and Sending Out Emails

 Image and Picture Processing

 Managing ZIP Archive Files

 Managing PHP Engine and Modules on macOS

 Managing PHP Engine and Modules on CentOS

 Archived Tutorials

 References

 Full Version in PDF/EPUB