MySQL implements several types of collations:
Simple collations for 8-bit character sets
This kind of collation is implemented using an array of 256
weights that defines a one-to-one mapping from character codes
to weights. latin1_swedish_ci
is an example.
It is a case-insensitive collation, so the uppercase and
lowercase versions of a character have the same weights and they
compare as equal.
mysql>SET NAMES 'latin1' COLLATE 'latin1_swedish_ci';
Query OK, 0 rows affected (0.00 sec) mysql>SELECT 'a' = 'A';
+-----------+ | 'a' = 'A' | +-----------+ | 1 | +-----------+ 1 row in set (0.00 sec)
Complex collations for 8-bit character sets
This kind of collation is implemented using functions in a C source file that define how to order characters, as described in Section 9.4, “Adding a New Character Set”.
Collations for non-Unicode multi-byte character sets
For this type of collation, 8-bit (single-byte) and multi-byte
characters are handled differently. For 8-bit characters,
character codes map to weights in case-insensitive fashion. (For
example, the single-byte characters 'a'
and
'A'
both have a weight of
0x41
.) For multi-byte characters, there are
two types of relationship between character codes and weights:
Weights equal character codes.
sjis_japanese_ci
is an example of this
kind of collation. The multi-byte character
'ぢ'
has a character code of
0x82C0
, and the weight is also
0x82C0
.
Character codes map one-to-one to weights, but a code is not
necessarily equal to the weight.
gbk_chinese_ci
is an example of this kind
of collation. The multi-byte character
'膰'
has a character code of
0x81B0
but a weight of
0xC286
.
Collations for Unicode multi-byte character sets
Some of these collations are based on the Unicode Collation Algorithm (UCA), others are not.
Non-UCA collations have a one-to-one mapping from character code
to weight. In MySQL, such collations are case insensitive and
accent insensitive. utf8_general_ci
is an
example: 'a'
, 'A'
,
'À'
, and 'á'
each have
different character codes but all have a weight of
0x0041
and compare as equal.
mysql>SET NAMES 'utf8' COLLATE 'utf8_general_ci';
Query OK, 0 rows affected (0.00 sec) mysql>SELECT 'a' = 'A', 'a' = 'À', 'a' = 'á';
+-----------+-----------+-----------+ | 'a' = 'A' | 'a' = 'À' | 'a' = 'á' | +-----------+-----------+-----------+ | 1 | 1 | 1 | +-----------+-----------+-----------+ 1 row in set (0.06 sec)
UCA-based collations in MySQL have these properties:
If a character has weights, each weight uses 2 bytes (16 bits)
A character may have zero weights (or an empty weight). In this case, the character is ignorable. Example: "U+0000 NULL" does not have a weight and is ignorable.
A character may have one weight. Example:
'a'
has a weight of
0x0E33
.
A character may have many weights. This is an expansion.
Example: The German letter 'ß'
(SZ
ligature, or SHARP S) has a weight of
0x0FEA0FEA
.
Many characters may have one weight. This is a contraction.
Example: 'ch'
is a single letter in Czech
and has a weight of 0x0EE2
.
A many-characters-to-many-weights mapping is also possible (this is contraction with expansion), but is not supported by MySQL.
Miscellaneous collations
There are also a few collations that do not fall into any of the previous categories.
User Comments
Add your own comment.