Wednesday, August 25, 2010

dude, where's my umlaut?

While setting up a new rocket for our company hosting service I was forced to revisit the old problem of latin1 mysql data from an old project displaying weird after the move to the new server.
Well it turns out that as always, it was the desired behavior...
The old content was inserted form the CakePHP app without setting the encoding to utf8 in database.php. This let Cake decide that the proper encoding to communicate to the server will be latin1, the server's default global "character set server". So the data was actually latin1 even if all the other parameters were utf8. Even my database with latin1 content was utf8 collated as well as all it's tables.
Now what I wanted was to have utf8 inserted from now on. So I needed to set the encoding. But doing this was causing the text to display all sorts of weird symbols. So the task as hand was really to convert latin1 data to utf8.
Now my luck was that all that data was latin1 so I had consistency. There are articles on the net dealing with mixed encoded data or even double encoded. I weep for anyone who has to fix that.
My solution was easy and by the book. Dump the content while telling MySQL that it's encoding is latin1:

mysqldump -u userName -p --default-character-set=latin1 dbName > dbName.sql
edit the sql file with VI and change:
/*!40101 SET NAMES latin1 */;
to
/*!40101 SET NAMES utf8 */;
drop all the tables from the database and import the sql back:
mysql -u -p --default-character-set=utf8 dbName < dbName.sql

The database is fairly small with the sql being only 2MB so it worked fast. Well the box is quad with 8 gigs of RAM so probably that helped too.
As always when working with magic, remember that backup is the first step to success ;)