MySQL :: MySQL 5.1 with Maria Reference Manual

MySQL 5.1 with Maria Reference Manual :: 9 Internationalization and Localization :: 9.1 Character Set Support :: 9.1.10 Unicode Support

« 9.1.9.3 SHOW Statements and INFORMATION_SCHEMA

9.1.11 UTF-8 for Metadata »

Section Navigation [Toggle]

9.1.10. Unicode Support

MySQL 5.1 supports two character sets for storing Unicode data:

ucs2, the UCS-2 encoding of the Unicode character set using 16 bits per character
utf8, a UTF-8 encoding of the Unicode character set using one to three bytes per character

These two character sets support the characters from the Basic Multilingual Plane (BMP) of Unicode Version 3.0. BMP characters have these characteristics:

Their code values are between 0 and 65535 (or U+0000 .. U+FFFF)
They can be encoded with a fixed 16-bit word, as in ucs2
They can be encoded with 8, 16, or 24 bits, as in utf8
They are sufficient for almost all characters in major languages

The ucs2 and utf8 character sets do not support supplementary characters that lie outside the BMP.

A similar set of collations is available for each Unicode character set. For example, each has a Danish collation, the names of which are ucs2_danish_ci and utf8_danish_ci. All Unicode collations are listed at Section 9.1.13.1, “Unicode Character Sets”.

In UCS-2, every character is represented by a two-byte Unicode code with the most significant byte first. For example: LATIN CAPITAL LETTER A has the code 0x0041 and it is stored as a two-byte sequence: 0x00 0x41. CYRILLIC SMALL LETTER YERU (Unicode 0x044B) is stored as a two-byte sequence: 0x04 0x4B. For Unicode characters and their codes, please refer to the Unicode Home Page.

The MySQL implementation of UCS-2 stores characters in big-endian byte order and does not use a byte order mark (BOM) at the beginning of UCS-2 values. Other database systems might use little-endian byte order or a BOM, in which case, conversion of UCS-2 values will need to be performed when transferring data between those systems and MySQL.

UTF-8 (Unicode Transformation Format with 8-bit units) is an alternative way to store Unicode data. It is implemented according to RFC 3629. RFC 3629 describes encoding sequences that take from one to four bytes. Currently, MySQL support for UTF-8 does not include four-byte sequences. (An older standard for UTF-8 encoding is given by RFC 2279, which describes UTF-8 sequences that take from one to six bytes. RFC 3629 renders RFC 2279 obsolete; for this reason, sequences with five and six bytes are no longer used.)

The idea of UTF-8 is that various Unicode characters are encoded using byte sequences of different lengths:

Basic Latin letters, digits, and punctuation signs use one byte.
Most European and Middle East script letters fit into a two-byte sequence: extended Latin letters (with tilde, macron, acute, grave and other accents), Cyrillic, Greek, Armenian, Hebrew, Arabic, Syriac, and others.
Korean, Chinese, and Japanese ideographs use three-byte sequences.

MySQL uses no BOM for UTF-8 values.

Tip: To save space with UTF-8, use VARCHAR instead of CHAR. Otherwise, MySQL must reserve three bytes for each character in a CHAR CHARACTER SET utf8 column because that is the maximum possible length. For example, MySQL must reserve 30 bytes for a CHAR(10) CHARACTER SET utf8 column.

UCS-2 cannot be used as a client character set, which means that SET NAMES 'ucs2' does not work. (See Section 9.1.4, “Connection Character Sets and Collations”.)

Client applications that need to communicate with the server using Unicode should set the client character set accordingly; for example, by issuing a SET NAMES 'utf8' statement. ucs2 cannot be used as a client character set, which means that it does not work for SET NAMES or SET CHARACTER SET. (See Section 9.1.4, “Connection Character Sets and Collations”.)

Previous / Next / Up / Table of Contents

User Comments

Posted by Haakon Meland Eriksen on January 24 2006 9:04am

[Delete] [Edit]

Connect with the same characterset as your data to display correctly. This example connects to the MySQL-server using UTF-8:

mysql --default-character-set=utf8 -uyour_username -p -h your_databasehost.your_domain.com your_database

If you get into trouble from a PHP-based web application, check the characterset configurations of these components:

1) the MySQL database
2) php.ini
3) httpd.conf
4) your server

Posted by lorenz pressler on May 2 2006 12:46pm

[Delete] [Edit]

if you get data via php from your mysql-db (everything utf-8)
but still get '?' for some special characters in your browser
(<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />),
try this:

after mysql_connect() , and mysql_select_db() add this lines:
mysql_query("SET NAMES utf8");

worked for me.
i tried first with the utf8_encode, but this only worked for äüöéè...
and so on, but not for kyrillic and other chars.

Posted by Eliram on August 6 2008 5:45pm

[Delete] [Edit]

I had a problem submitting unicode data from ASP pages to the MySQL server while everything was set to utf8 .

It turns out the problem was that my ODBC driver was version 3.5.1 and that's what caused the problem. Installing version 5.1 solved the problem.

http://dev.mysql.com/downloads/connector/odbc/

Posted by David Busby on November 17 2008 11:07am

[Delete] [Edit]

As of mySQL 5.x you can use the init_connect commands to force UTF-8 compliance from any client connection.

I have blogged about this here: http://www.saiweb.co.uk/mysql/mysql-forcing-utf-8-compliance-for-all-connections

Removing the need to use SET NAME in your PHP/ASP/Ruby/C++ code.

Add your own comment.

Top / Previous / Next / Up / Table of Contents