Converting GB2312 to UTF-8

'Herong's Tutorial Notes on GB2312 Character Set' tutorial book was cited in a Sun Java forum article in 2005.

The Herong's Tutorial Notes on GB2312 Character Set tutorial book was cited in a Sun Java forum article in 2005. Note that my Geocities site has been moved to herongyang.com now.

Subject: Re: Java Forums - Converting GB2312 to UTF-8
Date: Aug 11, 2005
Source: http://forum.java.sun.com/thread.jspa?threadID=639403
Author: horinius

This problem of yours is very interesting.

I'm no conversion expert, but after some investigations, I think the
problem comes from the word's GB2312 code itself (or Windows or
font?). Let me explain things first .

I've added an instruction to print out the number of characters in
your strLine:
...
System.out.print(strLine.length() + "\n");
bw.write( strLine);
...

Now, if I use your word, F25B, it gives 2 characters! If I use another
word, E0A2, it gives 1 character *as expected*. <==> That's why I
think the problem comes from your word.

Then I turned to Unicode's database and conversion:
http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=9A15

As you could see, it seems that your word can't be mapped to GB2312
(but it is mapped to Big5). So, there're 2 possibilities:
1. Unicode consortium forgot to add this word to its database
2. This word doesn't exist in GB2312.

After some searches, I came across this website on GB2312 <-> Unicode
(which is very well done. Good job, Herong!):

http://www.geocities.com/herong_yang/gb2312/

When I looked up characters in the F2XX range:
http://www.geocities.com/herong_yang/gb2312/bihua_4.html

it is clear that your word, F25B, isn't defined!

Moreover, as I could see from the pattern, I think GB2312 encoding
scheme is as follows: a GB2312 code being a XX YY pair, possible
values for YY are A1 to FE.(Am I correct?)

Now, your word's code is F2 5B, but 5B isn't within A1-FE range, so
the code isn't a valid GB2312 code and so I think that word is simply
not encoded in GB2312.

But the problem might be Window's because Microsoft never does things
according to standards and M$ might "invent" a GB2312 code for non-
GB2312-defined characters.

This is very annoying but I don't know what to do beside writing
complaint letter to Microsoft (but it's mostly ignored).

Table of Contents

 About This Book

 Reference Citations in 2023

 Reference Citations in 2022

 Reference Citations in 2021

 Reference Citations in 2020

 Reference Citations in 2019

 Reference Citations in 2018

 Reference Citations in 2017

 Reference Citations in 2016

 Reference Citations in 2015

 Reference Citations in 2014

 Reference Citations in 2013

 Reference Citations in 2012

 Reference Citations in 2011

 Reference Citations in 2010

 Reference Citations in 2009

 Reference Citations in 2008

 Reference Citations in 2007

 Reference Citations in 2006

Reference Citations in 2005

 Kalkati.net, XML database dump

 com.liferay.portal.service.impl.PortletServiceImpl

 Japanese Chinese Tea Web Sites

 AIProject Log

Converting GB2312 to UTF-8

 "OK" auf chinesisch gesucht

 Insertion Sort

 tanya ttg open file

 Base64Decoder

 SSL Client Authentication

 Softwaretechnik-Praktikum SS 2005

 How to develop a scanner/disinfector

 JSTL break ? possible

 Attacks on Encryption Schemes

 Encoding a C String/Buffer with ASCII Char

 mysql 5alpha stored procedures vs mssql

 Hangul, Chinese characters to Unicode Conversion

 Appunti di Informatica Libera

 Reference Citations in 2004

 Reference Citations in 2003