The Asian character sets that we support include Chinese,
Japanese, Korean, and Thai. These can be complicated. For
example, the Chinese sets must allow for thousands of
different characters. See Section 9.1.14.7.1, “The cp932
Character Set”, for
additional information about the cp932
and
sjis
character sets.
For answers to some common questions and problems relating support for Asian character sets in MySQL, see Section A.11, “MySQL 5.5 FAQ — MySQL Chinese, Japanese, and Korean Character Sets”.
big5
(Big5 Traditional Chinese)
collations:
big5_bin
big5_chinese_ci
(default)
cp932
(SJIS for Windows Japanese)
collations:
cp932_bin
cp932_japanese_ci
(default)
eucjpms
(UJIS for Windows Japanese)
collations:
eucjpms_bin
eucjpms_japanese_ci
(default)
euckr
(EUC-KR Korean) collations:
euckr_bin
euckr_korean_ci
(default)
gb2312
(GB2312 Simplified Chinese)
collations:
gb2312_bin
gb2312_chinese_ci
(default)
gbk
(GBK Simplified Chinese)
collations:
gbk_bin
gbk_chinese_ci
(default)
sjis
(Shift-JIS Japanese) collations:
sjis_bin
sjis_japanese_ci
(default)
tis620
(TIS620 Thai) collations:
tis620_bin
tis620_thai_ci
(default)
ujis
(EUC-JP Japanese) collations:
ujis_bin
ujis_japanese_ci
(default)
The big5_chinese_ci
collation sorts on
number of strokes.
For additional information about Asian collations in MySQL, see Collation-Charts.Org (big5, cp932, eucjpms, euckr, gb2312, gbk, sjis, tis620, ujis).
Why is cp932
needed?
In MySQL, the sjis
character set
corresponds to the Shift_JIS
character
set defined by IANA, which supports JIS X0201 and JIS X0208
characters. (See
http://www.iana.org/assignments/character-sets.)
However, the meaning of “SHIFT JIS” as a
descriptive term has become very vague and it often includes
the extensions to Shift_JIS
that are
defined by various vendors.
For example, “SHIFT JIS” used in Japanese
Windows environments is a Microsoft extension of
Shift_JIS
and its exact name is
Microsoft Windows Codepage : 932
or
cp932
. In addition to the characters
supported by Shift_JIS
,
cp932
supports extension characters such
as NEC special characters, NEC selected — IBM extended
characters, and IBM extended characters.
Many Japanese users have experienced problems using these extension characters. These problems stem from the following factors:
MySQL automatically converts character sets.
Character sets are converted via Unicode
(ucs2
).
The sjis
character set does not
support the conversion of these extension characters.
There are several conversion rules from so-called “SHIFT JIS” to Unicode, and some characters are converted to Unicode differently depending on the conversion rule. MySQL supports only one of these rules (described later).
The MySQL cp932
character set is designed
to solve these problems.
Because MySQL supports character set conversion, it is
important to separate IANA Shift_JIS
and
cp932
into two different character sets
because they provide different conversion rules.
How does cp932
differ from sjis
?
The cp932
character set differs from
sjis
in the following ways:
cp932
supports NEC special
characters, NEC selected — IBM extended
characters, and IBM selected characters.
Some cp932
characters have two
different code points, both of which convert to the same
Unicode code point. When converting from Unicode back to
cp932
, one of the code points must be
selected. For this “round trip conversion,”
the rule recommended by Microsoft is used. (See
http://support.microsoft.com/kb/170559/EN-US/.)
The conversion rule works like this:
If the character is in both JIS X 0208 and NEC special characters, use the code point of JIS X 0208.
If the character is in both NEC special characters and IBM selected characters, use the code point of NEC special characters.
If the character is in both IBM selected characters and NEC selected — IBM extended characters, use the code point of IBM extended characters.
The table shown at
http://www.microsoft.com/globaldev/reference/dbcs/932.htm
provides information about the Unicode values of
cp932
characters. For
cp932
table entries with characters
under which a four-digit number appears, the number
represents the corresponding Unicode
(ucs2
) encoding. For table entries
with an underlined two-digit value appears, there is a
range of cp932
character values that
begin with those two digits. Clicking such a table entry
takes you to a page that displays the Unicode value for
each of the cp932
characters that
begin with those digits.
The following links are of special interest. They correspond to the encodings for the following sets of characters:
NEC special characters:
http://www.microsoft.com/globaldev/reference/dbcs/932/932_87.htm
NEC selected — IBM extended characters:
http://www.microsoft.com/globaldev/reference/dbcs/932/932_ED.htm http://www.microsoft.com/globaldev/reference/dbcs/932/932_EE.htm
IBM selected characters:
http://www.microsoft.com/globaldev/reference/dbcs/932/932_FA.htm http://www.microsoft.com/globaldev/reference/dbcs/932/932_FB.htm http://www.microsoft.com/globaldev/reference/dbcs/932/932_FC.htm
cp932
supports conversion of
user-defined characters in combination with
eucjpms
, and solves the problems with
sjis
/ujis
conversion. For details, please refer to
http://www.opengroup.or.jp/jvc/cde/sjis-euc-e.html.
For some characters, conversion to and from
ucs2
is different for
sjis
and cp932
. The
following tables illustrate these differences.
Conversion to ucs2
:
sjis /cp932
Value |
sjis ->
ucs2 Conversion |
cp932 ->
ucs2 Conversion |
5C | 005C | 005C |
7E | 007E | 007E |
815C | 2015 | 2015 |
815F | 005C | FF3C |
8160 | 301C | FF5E |
8161 | 2016 | 2225 |
817C | 2212 | FF0D |
8191 | 00A2 | FFE0 |
8192 | 00A3 | FFE1 |
81CA | 00AC | FFE2 |
Conversion from ucs2
:
ucs2 value |
ucs2 ->
sjis Conversion |
ucs2 ->
cp932 Conversion |
005C | 815F | 5C |
007E | 7E | 7E |
00A2 | 8191 | 3F |
00A3 | 8192 | 3F |
00AC | 81CA | 3F |
2015 | 815C | 815C |
2016 | 8161 | 3F |
2212 | 817C | 3F |
2225 | 3F | 8161 |
301C | 8160 | 3F |
FF0D | 3F | 817C |
FF3C | 3F | 815F |
FF5E | 3F | 8160 |
FFE0 | 3F | 8191 |
FFE1 | 3F | 8192 |
FFE2 | 3F | 81CA |
Users of any Japanese character sets should be aware that
using
--character-set-client-handshake
(or
--skip-character-set-client-handshake
)
has an important effect. See
Section 5.1.2, “Server Command Options”.
User Comments
As of MySQL 4.1.14,
Please notice that for Traditional Chinese (BIG5), collation 'big5_chinese_ci' uses stroke count of the characters on ordering; while in Simplified Chinese (GB2312), collation 'gb2312_chinese_ci' uses Pinyin of the characters on ordering.
Add your own comment.