File encoding detection improvements
By Edouard on June 10, 2011
Today I released an update to Web Translate It to improve the file encoding detection.
Internally, Web Translate It use UTF-8, but we have to accept files of any file encoding. This is needed because while most files are UTF-8 encoded, Java .properties files are ISO 8859-1 encoded, while Apple .strings are sometimes UTF-16LE encoded. We also support uploading text files which can have any encoding.
Under the hood, Web Translate It detects your file’s encoding, saves it to database and converts your strings to UTF-8 before importing them to database. When you’ll download your translated file, Web Translate It will convert your strings back to your file original encoding.
In some rare cases, file imports were failing or strings were imported wrongly because Web Translate It wasn’t able to recognise your file encoding.
To remediate this issue, I improved the file encoding detection strategy. Here’s what Web Translate It does now:
If your file contains a BOM, we’re sure that your file is encoded in UTF-8, UTF-16LE, UTF-16BE, UTF-32, etc.
If your file doesn’t contain a BOM, it means that either your file is encoded in something else that UTF-something, but it could also mean it is a UTF-8 file without a BOM. So we scrub the content of your file and look for a hint. For instance, if your Gettext .po file contains in its header
"Content-Type: text/plain; charset=UTF-8\n"
, then we assume your file is UTF-8 encoded.If we can’t find any indication of the encoding of your file, a character detection algorithm is used (we use rchardet). rchardet takes a sequence of bytes from your file (of unknown encoding), and attempts to determine the encoding so you we can read the text and import it to database.
Finally, there is so much we can do. If rchardet couldn’t reliably detect your file encoding, then our fallback strategy is to assume you’re using UTF-8.
Detecting character encodings is tricky. If you’re having a character encoding problems when uploading a file to Web Translate It, please don’t hesitate to open a support ticket and we’ll work with you to correctly import your file.