MySQL 5.4 has two Unicode character sets:
ucs2
, the UCS-2 encoding of the Unicode
character set using 16 bits per character
utf8
, a UTF-8 encoding of the Unicode
character set using one to three bytes per character
You can store text in about 650 languages using these character sets. This section lists the collations available for each Unicode character set. For general information about the character sets, see Section 9.1.10, “Unicode Support”.
A similar set of collations is available for each Unicode
character set. These are shown in the following list, where
xxx
represents the character set
name. For example,
represents the Danish collations, the specific names of which
are xxx
_danish_ciucs2_danish_ci
and
utf8_danish_ci
.
xxx
_bin
xxx
_czech_ci
xxx
_danish_ci
xxx
_esperanto_ci
xxx
_estonian_ci
(default)
xxx
_general_ci
xxx
_hungarian_ci
xxx
_icelandic_ci
xxx
_latvian_ci
xxx
_lithuanian_ci
xxx
_persian_ci
xxx
_polish_ci
xxx
_roman_ci
xxx
_romanian_ci
xxx
_slovak_ci
xxx
_slovenian_ci
xxx
_spanish2_ci
xxx
_spanish_ci
xxx
_swedish_ci
xxx
_turkish_ci
xxx
_unicode_ci
MySQL implements the
collations according to the Unicode Collation Algorithm (UCA)
described at
http://www.unicode.org/reports/tr10/. The
collation uses the version-4.0.0 UCA weight keys:
http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt.
Currently, the
xxx
_unicode_ci
collations have only partial support for the Unicode Collation
Algorithm. Some characters are not supported yet. Also,
combining marks are not fully supported. This affects
primarily Vietnamese, Yoruba, and some smaller languages such
as Navajo. The following discussion uses
xxx
_unicode_ciutf8_unicode_ci
for concreteness.
For any Unicode character set, operations performed using the
_general_ci
collation are faster than those
for the _unicode_ci
collation. For example,
comparisons for the utf8_general_ci
collation are faster, but slightly less correct, than
comparisons for utf8_unicode_ci
. The reason
for this is that utf8_unicode_ci
supports
mappings such as expansions; that is, when one character
compares as equal to combinations of other characters. For
example, in German and some other languages
“ß
” is equal to
“ss
”.
utf8_unicode_ci
also supports contractions
and ignorable characters. utf8_general_ci
is a legacy collation that does not support expansions,
contractions, or ignorable characters. It can make only
one-to-one comparisons between characters.
To further illustrate, the following equalities hold in both
utf8_general_ci
and
utf8_unicode_ci
(for the effect this has in
comparisons or when doing searches, see
Section 9.1.7.7, “Examples of the Effect of Collation”):
Ä = A Ö = O Ü = U
A difference between the collations is that this is true for
utf8_general_ci
:
ß = s
Whereas this is true for utf8_unicode_ci
:
ß = ss
MySQL implements language-specific collations for the
utf8
character set only if the ordering
with utf8_unicode_ci
does not work well for
a language. For example, utf8_unicode_ci
works fine for German and French, so there is no need to
create special utf8
collations for these
two languages.
utf8_general_ci
also is satisfactory for
both German and French, except that
“ß
” is equal to
“s
”, and not to
“ss
”. If this is acceptable
for your application, then you should use
utf8_general_ci
because it is faster.
Otherwise, use utf8_unicode_ci
because it
is more accurate.
utf8_swedish_ci
, like other
utf8
language-specific collations, is
derived from utf8_unicode_ci
with
additional language rules. For example, in Swedish, the
following relationship holds, which is not something expected
by a German or French speaker:
Ü = Y < Ö
The
and
xxx
_spanish_ci
collations correspond to modern Spanish and traditional
Spanish, respectively. In both collations,
“xxx
_spanish2_ciñ
” (n-tilde) is a separate
letter between “n
” and
“o
”. In addition, for
traditional Spanish, “ch
” is a
separate letter between “c
”
and “d
”, and
“ll
” is a separate letter
between “l
” and
“m
”
In the
collations, xxx
_roman_ciI
and J
compare as equal, and U
and
V
compare as equal.
For additional information about Unicode collations in MySQL, see Collation-Charts.Org (utf8).
User Comments
Note for Hebrew speakers: in utf8_general_ci, dots (Niqqud symbols) are treated as seperate characters, so
a. If you have dotted words in your table, they won't be ordered correctly.
b. You can have words with the same letters and different dots in a unique index column.
On the other hand, in utf8_unicode_ci, dots are igonred, so:
a. The order will be correct;
b. Words with the same letters and different dots will be regarded as equal, and you won't be able to have them in a unique index column.
I still didn't find a collation that treats both issues correctly.
Add your own comment.