Building Chinese Web Sites using PHP
Dr. Herong Yang, Version 2.11

mb_strlen() - Counting Multibyte Characters

This section describes how to count multi-byte characters using php_mbstring.dll module.

Once you have configured PHP to use php_mbstring.dll module, you are ready to use multibyte string functions to manipulate Chinese character strings as characters instead of bytes.

Here is simple example PHP script using mb_strlen() to count Chinese characters in a string:

<?php #Count-UTF-8.php
# Copyright (c) 2007 by Dr. Herong Yang, http://www.herongyang.com/
#
  $help_simplified = '这是一份非常间单的说明书…';
  $help_traditional = '這是一份非常間單的說明書…';
  $help_gb18030 = '?????????????';
  $help_big5 = '?????????????';
  
  print('<html>');
  print('<meta http-equiv="Content-Type"'.
    ' content="text/html; charset=utf-8"/>');
  print('<body>');

# Showing UTF-8 characters
  print('<b>Chinese string in UTF-8 in PHP</b><br/>');
  print('UTF-8 simplified characters: '.$help_simplified.'<br/>');
  print('UTF-8 traditional characters: '.$help_traditional.'<br/>');

# Trying to show GB18030 characters
  print('<b>GB18030 string included in a UTF-8 page</b><br/>');
  print('GB18030 characters: '.$help_gb18030.'<br/>');

# Trying to show Big5 characters
  print('<b>Big5 string included in a UTF-8 page</b><br/>');
  print('Big5 characters: '.$help_big5.'<br/>');

# Counting UTF-8 characters
  print('<b>Count UTF-8 characters in strings:</b><br/>');
  print('UTF-8 simplified characters: '
    .mb_strlen($help_simplified).'<br/>');
  print('UTF-8 traditional characters: '
    .mb_strlen($help_traditional).'<br/>');
  print('GB18030 characters: '.mb_strlen($help_gb18030).'<br/>');
  print('Big5 characters: '.mb_strlen($help_big5).'<br/>');

# Counting bytes
  print('<b>Count UTF-8 characters in strings:</b><br/>');
  print('UTF-8 simplified characters: '
    .strlen($help_simplified).'<br/>');
  print('UTF-8 traditional characters: '
    .strlen($help_traditional).'<br/>');
  print('GB18030 characters: '.strlen($help_gb18030).'<br/>');
  print('Big5 characters: '.strlen($help_big5).'<br/>');

  print('</body>');
  print('</html>');
?>

Here is the Web page generated from this PHP script:
Counting UTF-8 Characters and Bytes

Look at the Web page carefully, you will see:

  • mb_strlen() counted Chinese characters correctly on UTF-8 encoded strings. 13 characters in both simplified and traditional character strings. The '...' as the end both strings is 1 special Chinese character.
  • mb_strlen() counted Chinese characters incorrectly on both GB18030 and Big5 encoded strings. This is unstandable, because mb_strlen() is assuming UTF-8 encoding based on the PHP configuration settings in php.ini. In for mb_strlen() to work correctly, you need to change the setting to mbstring.internal_encoding = GB18030.
  • Comparing with counts of bytes returned by strlen(), we known that 1 Chinese character is mapped to 3 bytes in UTF-8 encoding, 13 characters vs. 39 bytes. This is only true in this test. There may be some Chinese characters that need to be mapped to 4 bytes in UTF-8 encoding.
  • Looking at byte counts of GB18030 and Big5 character strings, we know that 1 Chinese character is mapped to 2 bytes in GB18030 and Big5 encodings. Again, this is only true in this test. Some GB18030 characters are mapped to 4 bytes.

Sections in This Chapter

php_mbstring.dll - Multibyte String Functions

mb_strlen() - Counting Multibyte Characters

List of Multibyte String Functions

Dr. Herong Yang, updated in 2007
mb_strlen() - Counting Multibyte Characters