Improving Web Translate It’s locales

By Edouard on 16 avril 2010

When I started working on Web Translate It, I thought handling locales would be a piece of cake. Wrong I was.

Web Translate It’s locale builder knows in which countries a language is spoken, which scripts are used for which language, and really helps you creating locales that make sense. It also knows about the plural rules for most languages, which help translating sentences with plurals forms.

In the past two weeks I made two micro-improvements to the locale builder that worth mentioning: hyphened locale codes and default scripts.

Hyphened locale codes

Until two weeks ago locales using subtags used to look like so:

en_US
en_GB
fr_FR
pt_PT
pt_PT_Latn
ar_Arab
zh_Hans
...

These locale codes don’t comply to the IETF language tags, which suggests locales subtags should be separated by hyphens instead of underscores:

en-US
en-GB
fr-FR
pt-PT
pt-PT-Latn
ar-Arab
zh-Hans
...

The API endpoints have been updated and recognise both hyphened and underscored locales, so it is not much of a big deal, unless you use Ruby YAML with language subtags, in which case the root key has changed.

So, instead of getting:

zh_CN:
  contact: 联系

You will get:

zh-CN:
   contact: 联系

Of course, nothing has changed if you don’t use the country subtag for your locales. Hyphens really are a preferred syntax, and Ruby on Rails also use hyphened locale subtags.

Default scripts

One language can be written into different scripts. Some languages’ scripts depend on the country the language is spoken. For example, Chinese is written in Simplified Han in China, but in Traditional Han in Taiwan.

Another example: Kashmiri, spoken in India and Pakistan, can be written in Arabic, Devanagari or Latin.

So the big question when you just select “Chinese” or “Kashmiri” in Web Translate It is: which one do you mean?

Knowing the script is important for Web Translate It, because it relies heavily on the script you use for two important features: the translation memory and handling right-to-left scripts in the translation interface.

To fix these issues I set up a default, invisible script for all locales. When you choose Chinese (zh) it will assume you want to use zh-Hans (Simplified Chinese). When you choose zh-TW (Chinese, Taiwan) it will assume you want to use zh-TW-Hant (Chinese, Taiwan, Traditional Han). If you use ar (Arabic) it will assume you mean ar-Arab (Arabic, Arabic), etc.

These default locales also allow us to “fold” the locales by language and script to serve more relevant suggestions. For example, if you chose fr-FR (French, France), it will fold the locale and look up for fr-Latn, which extends the results to locales such as fr-CA, fr-BE, etc. This is now used for searching suggestions in the global translation memory.