Encodings, Act N

You know what? I am having fun with encodings again!

Today is the big day: the latest development cycle of my project is finished and will be deployed. So I made a schedule what needs to be done pre-deployment. Running tests is up first. I was being more thorough than usual cleaning up things that usually don’t get cleaned up. I had noticed before that some tests had been failing. Didn’t seem like a big deal, because that happens a lot and I have not really been keeping the tests up to date. After all it’s just data samples and if more than 50% is okay, that’s fine with me. Yeah sure. Stupid.

Half an hour ago, my heart kinda stopped when I noticed that the tests that were failing all contained “Umlaute”. Not my first time. Of course I start checking wether it’s the shell or I broke my code or …. I knew that the database actually contained the “Umlaute” correctly, because that was one of the checks I remembered to do. So I tried to fix my bash (only half works somehow, don’t ask me what’s wrong, and I am already an hour behind my self-imposed schedule right now!) to no avail.

Next up: panic. Check the database – again. Umlaute there. Not good. More panic. Must deploy today. Must deploy.

Then: find out how to encode java strings … doesn’t quite help either because the shell still displays funny symbols and …. finally it hit me: the last database was utf-8 encoded what if this one wasn’t? Utf-8 isn’t the only encoding to support weird German letters. Strike. The database is iso-8859-1 encoded and with my new-found java string conversion skills I could easily test that that had indeed been the problem: I put all the “Umlaute” in my test-code as unicode characters which of course didn’t fit too well with the database.

The Law of Encodings: encodings are never handled smartly and if there are more than two parts involved they will use different encodings by default!