I am a bad human being. I confess that 7 years ago, when I was first putting together the Online Talent Manager personality testing system, I (I’m so ashamed to say this) used ISO-8859-15 encoding in the MySQL database! The first 6 months were fantastic, the application grew by leaps and bounds and the amount of data collected was impressive.
If you are a system’s administrator, you know exactly what has happened to me in the last 7 years. A series of ‘little’ encoding problems with half-hearted fixes and a series of failed conversion schemes to ‘fix’ the problem. Every new table, every new feature, every new anything became a two faze process:
- Build it.
- Play whack-a-mole with badly encoded characters.
And it wasn’t that the characters were badly encoded, it was just that I had to fit those characters into the bizarro scheme that had crudded up my applications over the years. I thought many times about fixing it, really fixing it, and moving everything to UTF-8, but if you have seen any other posts on the internet about this topic, you know that there are horror stories enough to make a grown administrator quake in his sneakers.
But now, I HAVE DONE IT. And it wasn’t really that painful or weird. I’m assuming you are using MySQL and Apache on a flavor of Linux. Here are the steps.
1) Create a dump of your current database:
mysqldump -u youruser -pyourpass yourdatabasename > yourdatabase.sql
2) Shut down your web application and go through it removing ISO-8859 or any other encoding schemes from your pages and replacing everything with UTF-8. If you are using some sort of templating system or CMS, this will be a breeze for you.
3) Now let’s clean up that .sql file with the iconv application. Make sure you have it installed ( man iconv ). Here is the formula to take your original dump and clean out all of the latin and replace it with gorgeous clean UTF8:
iconv –from-code LATIN1 –to-code UTF-8 yourdatabase.sql > utfclean_yourdatabase.sql
4) Open up your utfclean_yourdatabase.sql and replace all instances of “latin1″ with “utf8″. I used vim on my 350 mb sql file and it took about 3 minutes:
5) Edit the mysql configuration file, in my Ubuntu installation, it was located at /etc/mysql/my.cnf. Add the following line:
6) Make sure apache is talking utf8 by default. Again, for me that was located in the /etc/apache2/conf.d/charset file. I simply removed the comment from the following line:
7) Restart Apache /etc/init.d/apache2 restart
8) Restart MySQL /etc/init.d/mysql restart
9) Import your cleaned data back into the database:
mysql -u yourname -pyourpassword yourdatabase < utfclean_yourdatabase.sql
10) Your connections are all talking utf8 now, your database is utf8 now, and now it is just a case of cleaning up your own code to remove any hacks you might have installed. That took about 2 hours for me and mostly it was just commenting out calls to encoding routines.
11) Enjoy your freedom.