Earthli now uses the UTF8 encoding 🍾

Published by marco on

Earthli started out in 1997 on an Apache 1.2.x server running on a Windows desktop in my office cubicle in New York. I set it up at the time with PHP 2.x and MySQL 3.x.

History

The default encoding at the time was to use the latin1 character set, which mapped to Windows-1252 and roughly corresponds to ANSI, which is somewhat standardized, but is platform-dependent for some characters (hence the moniker).

Over the years, I migrated to PHP 3.x, then 4.x. The migration to 5.x was a lot of work because they changed the entire reference/copy semantics. From there, though the upgrade to 7.2.x was relatively easy. The jump to 8.x shouldn’t pose too many issues, either.[1]

The move through MySql versions to the current 5.7.33 and Apache versions to the current 2.4.29, all running on Ubuntu 18.04 was more or less without incident.

Fixing the Encoding

One last vestige of that original setup was the encoding. I’d opened an issue called Add support for UTF8 as the default encoding almost 8 years ago, but never found the time or inspiration to actually make the change.

I did end up making some workarounds for missing characters, but it was always less work to just add another one. See the section on “proprietary markup” below.

You have to be careful when changing encodings, making the sure that the tool that makes the conversion is aware of both the source and target encodings. You have to be dead-certain of the source encoding or the conversion will not succeed. You’ll end up with garbled characters.

The earthli database isn’t large by any enterprise standards, but it does contain almost 27k pictures and almost 4k articles[2], some of them quite long. There is a lot of text, created with various browsers and tools over almost a quarter of a century.

A search online for tips on converting to UTF-8 turned up a lot of useless advice for older versions or based on an inadequate understanding of encodings or the available tools.

Luckily, my database dump was already in UTF-8, but it contained SQL commands to use the latin1 collation instead of utf8 or, to be more precise, utf8mb4 (which is the version of UTF8 that supports 4-byte characters, e.g. 🧐). With utf8, you can store anything you want in MySql other than emojis. It’s unclear why they have two formats, but MySql has always had its idiosyncrasies.[3]

Converting earthli’s proprietary markup

The most fiddly bit I had to deal with was actually optional and involved converting some of the ad-hoc character combinations that I’d added to my markup language in order to use characters not supported in the latin1 encoding (e.g. I would write Slavoj Žižek as “Slavoj (Z-)i(z-)ek). I used sed to replace these character combinations. Obviously, no-one else uses these, but for the sake of posterity—and to serve as an example for other replacements—here they are:

sed \
-e 's@(a_)@ā@g' \
-e 's@(c,)@ç@g' \
-e 's@(C,)@Ç@g' \
-e 's@(c-)@č@g' \
-e 's@(C-)@Č@g' \
-e 's@(g-)@ğ@g' \
-e 's@(i-)@ı@g' \
-e 's@(I.)@İ@g' \
-e 's@(l-)@ł@g' \
-e 's@(L-)@Ł@g' \
-e 's@(s,)@ş@g' \
-e 's@(S,)@Ş@g' \
-e 's@(s-)@š@g' \
-e 's@(S-)@Š@g' \
-e 's@(u-)@ū@g' \
-e 's@(z-)@ž@g' \
-e 's@(Z-)@Ž@g' \
-e "s@(a\\\')@á@g" \
-e "s@(A\\\')@Á@g" \
-e "s@(C\\\')@Ć@g" \
-e "s@(n\\\')@ń@g" \
-e "s@(N\\\')@Ń@g" \
-e "s@(o\\\')@ő@g" \
-e "s@(O\\\')@Ő@g" \
earthli.sql > earthli_utf8.sql

I used @ as the separator character and had to escape the backslash twice (once for sed and once for bash). Also, you have to use a different output file because sed truncates the output before it does anything. If you use the same file, then you just end up with an empty file. Neat.

It’s not super-efficient, but it was done in a few seconds.

A bonus to doing these replacements for me is that a full-text search for “Zizek” or “Žižek” now finds all articles where I mention the Slovenian philosopher. That didn’t work before because MySql was indexing “(Z-)i(z-)ek” instead.

Working with the dump file

If you need to open the dump file, be aware that the lines are very long. vim does a good job of searching and editing and jumping to locations (e.g. +normal 15G25| jumps to line 15, column 25. nano can also find text (⌘ + W) pretty well and quickly. Both edit the text without a problem, once you’ve found the location you’re interested in.

Desktop editors (e.g. Visual Studio Code or Sublime Text) and differs (e.g. BeyondCompare) were mostly overwhelmed by both the file size and the line lengths.

Luckily, I only ended up needing to make one edit to avoid an error creating an index because the UTF-8 encoding considered “bugin” and “bügin” to be equivalent.

Commands

I made most of the following changes from the command line, but made one change using PHPMyAdmin.

Here’s what I ended up doing:

Dump the current database. MySQL dumps to UTF-8 by default and converts all text.
```
mysqldump --user=earthli -p --add-drop-table earthli > earthli.sql
```
Verify that the dump file is in UTF8 format. If it’s not, then you can use iconv to change the encoding (example from Wikipedia):
```
iconv -f iso-8859-1 -t utf-8 <infile> -o <outfile>
```
Search/replace the character set for each table with the following command:
```
sed -e 's@CHARSET=latin1@CHARSET=utf8mb4@g' earthli.sql > earthli_utf8.sql
```
Use PHPMyAdmin to change the default encoding for the database to
utf8mb4 in the Operations pane for the database.

Import the database.

cat earthli_utf8.sql | mysql --user=earthli -p earthli

In PHP and the configuration, I made the following changes:

Call mysqli_set_charset ($this->_connection, ‘utf8mb4’); after opening the connection to the database
Change the encoding in all generated pages by including the tag < meta charset=“utf-8”>
Change the default charset in the Apache config files php_value default_charset UTF-8 (it’s possible that this is already the default by now)

Conclusion

It took a bunch of research and preparation and nerves to dump, globally modify, and re-import a database that contains the last quarter-century of my writing. In the end, though, it wasn’t even that much work and it went smoothly. As always with encodings, it serves you well to understand exactly what you’re doing—it often saves a lot of steps.

And, now, because I can: ✊🏼.

[1] The zip functions will be removed in favor the object-based API.↩

[2] 2.6k of which are published … I write a lot of drafts that I never end up publishing. Some of them are quite long, as well and serve as notes for myself.↩

[3] It’s probably a legacy thing or a desire to provide an option that uses one less byte if you know you’re not going to want emojis?↩