New word and character count system using Unicode’s text segmentation standard

By Edouard on 19 octobre 2022

We’ve released an update to our word and character count system. The previous implementation was running system calls to the Unix command wc, which was slow and not language dependent.

The new word count system is language dependent, more accurate and 16 times faster than our previous implementation, according to our benchmarks. This speed improvement improves file import speed, as each segment created or modified gets its word counted on import.

Our new word counter implements Unicode’s text segmentation by word and characters described in the Unicode Technical Report #29. Segmentation is now language dependent and uses regexes for most languages and dictionaries for CJK languages (Chinese, Japanese and Korean).

We also implemented new rules for counting. These rules are described in our new documentation section about word and character counting.

Basically, HTML tags do not count as translatable words or characters, but translatable attributes included in HTML tags do count.