Unicode Tutorials - Herong's Tutorial Examples - v5.30, by Dr. Herong Yang
Byte Order Mark (BOM) - FEFF - EFBBBF
This section provides a brief introduction on the Byte Order Mark (BOM) character, U+FEFF, used as the Unicode character stream signature when prepended to a character stream. The U+FEFF character becomes a 3-byte sequence of EFBBBF when encoded in UTF-8.
What Is BOM (Byte Order Mark)? BOM is the informal name of the special Unicode character U+FEFF "ZERO WIDTH NO-BREAK SPACE", when it is used to prepend to a stream of Unicode characters as a "signature". This signature tells the receiver of this stream to be ready to process Unicode characters and pay attention to the serialization order of the encoding octets.
When this BOM character, U+FEFF, is serialized in UTF-8 encoding, it becomes an octet sequence of EF BB BF (\xEFBBBF).
As you can see from the previous tutorial, Notepad prepends U+FEFF to the text and converted it to EFBBBF when saving the text in UTF-8 encoding. This is why I was getting these 3 extra bytes, EFBBBF, at the beginning of the saved UTF-8 text file.
With the introduction of the BOM character, now we need to ready to support two variations of UTF-8 text file formats:
Read RFC 3629, "UTF-8, a transformation format of ISO 10646", November 2003 at http://tools.ietf.org/html/rfc3629 for more information.
Prepending the BOM character to Unicode text files is recommended by RFC 3629.
Table of Contents