PHP Tutorials - Herong's Tutorial Examples - 5.10, by Dr. Herong Yang
Basic Rules of Using Non-ASCII Characters in HTML Documents
This section describes basic rules on how non-ASCII character strings should be managed at different steps to ensure localized text strings can be used in PHP script string literals and displayed correctly on the browser window.
As you can see from the previous chapters, when PHP scripts are involved in a Web based application, they are always used behind a Web server. PHP scripts are expected to generate HTML documents and pass them back to the Web server. There are about four ways non ASCII characters can get into the HTML document through PHP scripts: a) Enter them as string literals; b) Receive them from HTTP request; c) Retrieve them from files; d) Retrieve them from a database.
In this chapter, we will concentrate on how to include non ASCII characters in PHP scripts as string literals. Here are the steps involved in this scenario:
A1. Key Sequences from keyboard | |- Text editor v A2. PHP File | |- PHP CGI engine v A3. HTML Document
Based on my experience, here are some basic rules related to those steps:
1. You must decide on the character encoding schema to be used in your PHP script file. For most of the languages, you have two options, a: use a encoding schema specific to that language; b: use a Unicode schema. For example, you can use either GB2312 (a simplified Chinese character schema) or UTF-8 (a Unicode character schema) for Chinese characters. My suggestion used to be "a". But today, I am suggesting "b", because Unicode schema can support all characters of all languages.
2. From step "A1" to "A2", you need select good text editor that supports the encoding schema you have decided. The end goal of this step is simple - characters in string literals must be stored in the PHP file using the decided encoding schema. Don't under estimate the difficulty level of this step. It could be very frustrating, because most computer keyboards support alphabetic letters only. You may have to use some language specific input software to translate alphabetic letters into language specific characters. The editor sometimes may also store characters in memory in one encoding schema, and offer you different encoding schema when saving files to harddisk.
3. String data type is defined as a sequence of bytes in PHP, like C language. This is different than Java language, where string data type is defined as a sequence of Unicode characters. String literals in PHP are also taken as sequences of bytes. This is a nice feature. It allows us to enter non ASCII characters in almost any encoding schema.
4. All PHP built-in string functions assume that strings are sequences of bytes. For example, strlen() returns the number of bytes of the given string, not the number of characters of a specific language. To manage strings as sequences of characters, we need to use Multibyte String functions, mb_*().
5. From step "A2" to "A3", HTML documents are generated from PHP script mainly through the print() function. The print() function will nicely copy every bytes from the specified string to HTML documents. This guarantees that any non ASCII characters encoded in any encoding schema will be copied correctly to the HTML document. Again, this is different than JSP pages, where strings will be converted into bytes stream based a specified encoding schema, if you are using character based output stream functions.
6. If you do want to convert from one encoding schema to another encoding schema during the print() function call, you can use mb_output_handler as the call back function on the output buffer: ob_start("mb_output_handler").
Table of Contents