Mysql Utf8

Originally published at https://pjambet.github.io/

tl;dr; MySQL regular UTF8 encoding sucks, you should use UTF8mb4 (or even better, Postgres !)

Intro

This is the details title (click to expand) This is the content (hidden until clicked).

In the past few months I had to struggle with MySQL and its crappy UTF8 support, so I thought I would sum up my findings here.

If you’ve seen stuff like

Mysql::Error: Specified key was too long; max key length is 767 bytes

or

Mysql2::Error: Incorrect string value: '\xF0\x9F\x8F\xA0' for column 'body' at row 1 ...

then maybe this article can help you.

Technical details !

Basically, in order to store the whole range of UTF8 characters, each chararacter has to be 4 bytes wide. But MySQL, with its default UTF8 encoding will only allocate 3 bytes per character. This becomes a problem when users try to insert character such as emojis in your database. That might sound like an edge case, but if your app is an API where users can post content through their phones, then it’s really likely some people will add emojis.

If you want a comprehensive guide for migrating your mysql database from utf8 to utf8mb4, this article contains everything you need.

How to use it in Rails and Django

Using utf8mb4 is quite straight forward

Django example :

in your new project :

$> django-admin.py startproject test

Simply add the charset option in the DATABASES setting :

settings.py

DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.mysql',
        'NAME': 'test',
        'USER': 'test_user'
        'OPTIONS': {'charset': 'utf8mb4'},
        'PASSWORD': '',
        'HOST': '',
        'PORT': '',
    }
}

And you’re good to go !

Caveat

There is still a problem with that configuration. Now the maximum length that MySQL can index is 191 instead of 255. That won’t be a problem if you apply this new limit on your models, but some libraries, such as celery try to index some column with a 255 max length.

So if you have djcelery in your INSTALLED_APPS, running your migrations (python manage.py migrate) will fail with a nice :

django.db.utils.DatabaseError: (1071, 'Specified key was too long; max key length is 767 bytes')

The only solution I found so far is to only run migrations on an app per app basis.

python manage.py migrate myapp
python manage.py migrate anotherapp

and for the app that fails, such as djcelery, first run

python manage.py sqall djcelery

Which should give you an output similar to :

BEGIN;
CREATE TABLE `celery_taskmeta` (
    `id` integer AUTO_INCREMENT NOT NULL PRIMARY KEY,
    `task_id` varchar(255) NOT NULL UNIQUE,
    `status` varchar(50) NOT NULL,
    `result` longtext,
    `date_done` datetime NOT NULL,
    `traceback` longtext,
    `hidden` bool NOT NULL,
    `meta` longtext
)
;

# other CREATE TABLES ...

CREATE INDEX `djcelery_taskstate_c91f1bf` ON `djcelery_taskstate` (`hidden`);
COMMIT;

Then copy this in your favorite editor (vim, of course !), and specify the CHARACTER SET as utf8 for each table. This should not cause any issues, as all the data that is gonna be inserted in those tables is controlled by celery itself, and there should not be any emojis are other 4 bytes wide characters.

Then your new file should look like :

BEGIN;
CREATE TABLE `celery_taskmeta` (
    `id` integer AUTO_INCREMENT NOT NULL PRIMARY KEY,
    `task_id` varchar(255) NOT NULL UNIQUE,
    `status` varchar(50) NOT NULL,
    `result` longtext,
    `date_done` datetime NOT NULL,
    `traceback` longtext,
    `hidden` bool NOT NULL,
    `meta` longtext
) CHARACTER SET=utf8 # Note the CHARACTER SET explictly set
;

# ... and the rest

PS : I’ve opened an issue on github, and I am still waiting for more information about this problem.

Rails example :

Note that utf8mb4 support in rails is really recent, and is only supported since rails 4.0.0

in your new rails project :

$> rails new test -d mysql

database.yml

# MySQL.  Versions 4.1 and 5.0 are recommended.
#
# Install the MYSQL driver
#   gem install mysql2
#
# Ensure the MySQL gem is defined in your Gemfile
#   gem 'mysql2'
#
# And be sure to use new-style password hashing:
#   http://dev.mysql.com/doc/refman/5.0/en/old-client.html
development:
    adapter: mysql2
    encoding: utf8mb4
    database: test_development
    pool: 5
    username: root
    password:
    socket: /tmp/mysql.sock

# Warning: The database defined as "test" will be erased and
# re-generated from your development database when you run "rake".
# Do not set this db to the same as development or production.
test:
    adapter: mysql2
    encoding: utf8mb4
    database: test_test
    pool: 5
    username: root
    password:
    socket: /tmp/mysql.sock

production:
    adapter: mysql2
    encoding: utf8mb4
    database: test_production
    pool: 5
    username: root
    password:
    socket: /tmp/mysql.sock

Use postgresql if you can !

This won’t really help you if you’re stuck with MySQL. In my case, I had to use AWS and had to use MySQL. Even if it’s possible to use postgresql on AWS, using RDS has many advantages, such as the ease of setup and configuration.

So if you have the choice, use postgresql ! I haven’t seen any advantages of MySQL over postgresql.

Postgres UTF8 encoding works like a charm (and it should be the same for every RDBMS), that means no encoding, collation or index length headaches. It just works !

And of course if you start using posgresql, you’ll discover a bunch of amazing features such as array columns, hstore and many other really awesome stuff !

dev  sql  mysql 

See also