I’ve been using WordPress to power my blog ever since I started wasting time with it, and it’s been pretty good to me so far. However, there was something that bothered me: despite serving UTF-8 to the browser, the actual database table collation being used was latin1_swedish_ci. Now the way WordPress was getting away with this was by passing 8-bit UTF-8 text to the database server and telling it that it was 8-bit Windows Latin 1. This is very bad, as sorting and searching wouldn’t behave properly.
Now with WordPress 2.3, all that’s in the past. WordPress now seems to be able to do the right thing with text encodings. It’s just too bad the upgrade script can’t clean up the rot left from previous versions. After running the upgrade script, I found that every piece of Japanese text, every typographical quote, every accented character, in fact everything outside 7-bit ASCII, was horribly mangled. Now I had a number of options for going forward:
- Go back to a previous version of WordPress
- Leave it and hope no-one minds.
- Delete every post that got mangled.
- Manually fix every affected post.
- Come up with a l33t way to solve it without manual effort.
Now the first option would have been easy. I had a complete backup (like you should, too), and I could have rolled it back in a matter of minutes. But I like to be on the curve, and I like to have all the newest features, even if I never use them. Also, having the latest security updates is nice.
The second option wouldn’t fly, because even if the readers wouldn’t mind, I’d mind. The third option would probably mean deleting every post, since I’m in the habit of using typographical quotes, non-breaking spaces and dashes (as opposed to hyphens). Deleting all my posts after an upgrade would defeat the purpose of keeping a blog. The fourth option would be excessively time-consuming, and I’d have to play fill the blanks, which may not even be possible if important things were mangled.
So the only way to fix it would be to call on my inner geek. I had a quick look at the database contents in phpMyAdmin, and had a look at the database code in WordPress. I noted that the new tables created by WordPress 2.3 had the collation utf8_general_ci while the upgraded tables had the collation latin1_swedish_ci, and WordPress was asking MySQL to communicate in UTF-8. Armed with this, I downloaded a UTF-8 SQL dump of the database.
The rest of it was actually reasonably simple: I opened the SQL dump in TextWrangler, which correctly identified it as being UTF-8 with no “byte order mark”. I then found all references to the latin1 character set and replaced them with utf8. That would fix the issue with the upgraded tables having the wrong collation, but not the corrupted data.
So here’s the trick: you need to convert the UTF-8 representation of what was in in the database back to its old binary representation, and then interpret that as UTF-8. I tried to save the file as ISO Latin 1 (ISO 8859-1), but TextWrangler complained about unmappable characters. It turns out that MySQL’s latin1 is actually Windows Latin 1 (code page 1252). So I saved the file in this encoding, and then told TextWrangler to reinterpret it as UTF-8. It all went smoothly, and I had my data back! I could play the SQL dump back on the server, and everything is as it should be.
So what are the morals of the story?
- Always keep backups – particularly when you plan to do something drastic like an upgrade. Even though I didn’t actually need the backup this time, it was comforting to know it was there.
- Don’t trust upgrade/migration scripts – always check the result to ensure it’s actually what you want.
- Store data in appropriate formats – hacks will always come back and bite you. I shouldn’t have been using WordPress when I knew it was doing the wrong thing with my data.
- When you’re writing a migration script, try to ensure that it actually works! Then you don’t risk infuriating and/or losing your users.