MySQL :: MySQL 5.4 Reference Manual :: 9.1.13.1 Unicode Character Sets

MySQL 5.4 Reference Manual :: 9 Internationalization and Localization :: 9.1 Character Set Support :: 9.1.13 Character Sets and Collations That MySQL Supports :: 9.1.13.1 Unicode Character Sets

« 9.1.13 Character Sets and Collations That MySQL Supports

9.1.13.2 West European Character Sets »

Section Navigation [Toggle]

9.1.13.1. Unicode Character Sets

MySQL 5.4 has two Unicode character sets:

ucs2, the UCS-2 encoding of the Unicode character set using 16 bits per character
utf8, a UTF-8 encoding of the Unicode character set using one to three bytes per character

You can store text in about 650 languages using these character sets. This section lists the collations available for each Unicode character set. For general information about the character sets, see Section 9.1.10, “Unicode Support”.

A similar set of collations is available for each Unicode character set. These are shown in the following list, where xxx represents the character set name. For example, xxx_danish_ci represents the Danish collations, the specific names of which are ucs2_danish_ci and utf8_danish_ci.

xxx_bin
xxx_czech_ci
xxx_danish_ci
xxx_esperanto_ci
xxx_estonian_ci
xxx_general_ci (default)
xxx_hungarian_ci
xxx_icelandic_ci
xxx_latvian_ci
xxx_lithuanian_ci
xxx_persian_ci
xxx_polish_ci
xxx_roman_ci
xxx_romanian_ci
xxx_slovak_ci
xxx_slovenian_ci
xxx_spanish2_ci
xxx_spanish_ci
xxx_swedish_ci
xxx_turkish_ci
xxx_unicode_ci

MySQL implements the xxx_unicode_ci collations according to the Unicode Collation Algorithm (UCA) described at http://www.unicode.org/reports/tr10/. The collation uses the version-4.0.0 UCA weight keys: http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt. Currently, the xxx_unicode_ci collations have only partial support for the Unicode Collation Algorithm. Some characters are not supported yet. Also, combining marks are not fully supported. This affects primarily Vietnamese, Yoruba, and some smaller languages such as Navajo. The following discussion uses utf8_unicode_ci for concreteness.

For any Unicode character set, operations performed using the _general_ci collation are faster than those for the _unicode_ci collation. For example, comparisons for the utf8_general_ci collation are faster, but slightly less correct, than comparisons for utf8_unicode_ci. The reason for this is that utf8_unicode_ci supports mappings such as expansions; that is, when one character compares as equal to combinations of other characters. For example, in German and some other languages “ß” is equal to “ss”. utf8_unicode_ci also supports contractions and ignorable characters. utf8_general_ci is a legacy collation that does not support expansions, contractions, or ignorable characters. It can make only one-to-one comparisons between characters.

To further illustrate, the following equalities hold in both utf8_general_ci and utf8_unicode_ci (for the effect this has in comparisons or when doing searches, see Section 9.1.7.7, “Examples of the Effect of Collation”):

Ä = A
Ö = O
Ü = U

A difference between the collations is that this is true for utf8_general_ci:

ß = s

Whereas this is true for utf8_unicode_ci:

ß = ss

MySQL implements language-specific collations for the utf8 character set only if the ordering with utf8_unicode_ci does not work well for a language. For example, utf8_unicode_ci works fine for German and French, so there is no need to create special utf8 collations for these two languages.

utf8_general_ci also is satisfactory for both German and French, except that “ß” is equal to “s”, and not to “ss”. If this is acceptable for your application, then you should use utf8_general_ci because it is faster. Otherwise, use utf8_unicode_ci because it is more accurate.

utf8_swedish_ci, like other utf8 language-specific collations, is derived from utf8_unicode_ci with additional language rules. For example, in Swedish, the following relationship holds, which is not something expected by a German or French speaker:

Ü = Y < Ö

The xxx_spanish_ci and xxx_spanish2_ci collations correspond to modern Spanish and traditional Spanish, respectively. In both collations, “ñ” (n-tilde) is a separate letter between “n” and “o”. In addition, for traditional Spanish, “ch” is a separate letter between “c” and “d”, and “ll” is a separate letter between “l” and “m”

In the xxx_roman_ci collations, I and J compare as equal, and U and V compare as equal.

For additional information about Unicode collations in MySQL, see Collation-Charts.Org (utf8).

Previous / Next / Up / Table of Contents

User Comments

Posted by Erel Segal on December 9 2006 7:42pm

[Delete] [Edit]

Note for Hebrew speakers: in utf8_general_ci, dots (Niqqud symbols) are treated as seperate characters, so
a. If you have dotted words in your table, they won't be ordered correctly.
b. You can have words with the same letters and different dots in a unique index column.

On the other hand, in utf8_unicode_ci, dots are igonred, so:
a. The order will be correct;
b. Words with the same letters and different dots will be regarded as equal, and you won't be able to have them in a unique index column.

I still didn't find a collation that treats both issues correctly.

Add your own comment.

Top / Previous / Next / Up / Table of Contents