Chinese Web Sites Using PHP - v2.24, by Herong Yang
Root Cause of Corrupted Chinese Text
This section provides a tutorial example to demonstrate the root cause of corrupted Chinese text - incorrect 8-bit encodings are used to decode original Chinese text.
Based on my experience, the root cause of corrupted Chinese text can be divided into the following 3 encoding processing mistakes:
1. The original Chinese text is encoded in UTF-8, but it gets decoded as one of the 8-bit encodings.
2. The original Chinese text is encoded in Unicode (UTF-16BE), but it gets decoded as one of the 8-bit encodings.
3. The original Chinese text is encoded in GB18030 (GBK), but it gets decoded as one of the 8-bit encodings.
There are many 8-bit encodings. And here are some examples:
With 3 possible encodings in the original text and many possible decoding mistaken options, the resulting corrupted Chinese text will be a large number of variations. The following PHP script shows 12 of possible corrupted output from a single Chinese text:
<?php
#- Chinese-Corrupted-Encoding.php
#- Copyright (c) 2005 HerongYang.com. All Rights Reserved.
# Original in Unicode (UTF-16BE)
$original = hex2bin('7b804f534e2d65877f519875');
corrupted($original, "UTF-16BE");
# Original in UTF-8
$original = hex2bin('e7ae80e4bd93e4b8ade69687e7bd91e9a1b5');
corrupted($original, "UTF-8");
# Original in GB18030 (GBK)
$original = hex2bin('bcf2cce5d6d0cec4cdf8d2b3');
corrupted($original, "GB18030");
function corrupted($original, $encoding) {
print("\nOriginal encoding: ".$encoding."$\n");
print(" Text: ".iconv($encoding, "UTF-8", $original)."$\n");
print(" Binary: ".bin2hex($original)."$\n");
print(" Corrupted as:\n");
print(" ISO-8859-1: "
.iconv("ISO-8859-1", "UTF-8//IGNORE", $original)."$\n");
print(" CP437: ".iconv("CP437", "UTF-8//IGNORE", $original)."$\n");
print(" CP852: ".iconv("CP852", "UTF-8//IGNORE", $original)."$\n");
print(" CP932: ".iconv("CP932", "UTF-8//IGNORE", $original)."$\n");
}
?>
If you run this test PHP script on system that supports Chinese characters in UTF-8 encoding, you should see the follow:
See next tutorials for suggestions on how to avoid and recover from corrupted Chinese text.
Table of Contents
PHP Installation on Windows Systems
Integrating PHP with Apache Web Server
charset="*" - Encodings on Chinese Web Pages
Chinese Characters in PHP String Literals
Multibyte String Functions in UTF-8 Encoding
Input Text Data from Web Forms
Input Chinese Text Data from Web Forms
MySQL - Installation on Windows
MySQL - Connecting PHP to Database
MySQL - Character Set and Encoding
MySQL - Sending Non-ASCII Text to MySQL
Retrieving Chinese Text from Database to Web Pages
Input Chinese Text Data to MySQL Database
►Chinese Text Encoding Conversion and Corruptions
Detect System Default Encoding
►Root Cause of Corrupted Chinese Text
Corrupted Chinese File Name with Un-ZIP
Generate 8-Bit Encoding Tables
Restore Corrupted Chinese Text