File encoding detection improvements — Hello, charlock_holmes!

By Edouard on November 23, 2011

Today I released an update to WebTranslateIt to improve the file encoding detection.

Until today our file encoding detection strategy was using the character encoding detector rchardet when we couldn’t reliably determine the character encoding of a file. The thing is, rchardet was an unmaintained ruby library, so I forked it and maintained it myself.

It was all jolly good, except rchardet contains many bugs. It often finds the wrong encoding for some files, and in some rare cases this led to WebTranslateIt file imports stalling. File encoding detection is so complex it was very hard to fix these bugs.

Yesterday I stumbled upon charlock_holmes, a character encoding detector made by the fine people at Github. It’s actually a ruby wrapper for ICU, a set of C/C++ and Java libraries providing Unicode and Globalization support for software applications. ICU is well maintained and widely used (it’s notably used by Google and Apple) and works very well.

WebTranslateIt’s file encoding detection strategy is now at the top of its class, and no file imports will stall because of encoding detection issues.

I hope you will find this improvement useful, thank you for using WebTranslateIt.