Fixing up the jobs queue and releasing wti 2.5.0
By Edouard on November 15, 2021
Last week we released an important update to our queuing system. It makes updating your language files work faster and more consistently.
How updating language files work on WebTranslateIt
When a translation is made from the web interface, we don’t directly update the project’s language files. Updating files is a very resource intensive job (it requires thousands of SQL queries) and it would take too long to update the file directly when requesting to download the file. So instead files are pushed to a queue and then updated asynchronously.
We have this system in place since 2009 and it has served us very well. We now have a small army of workers (60 workers working off jobs about files and 20 workers working off jobs to fetch translation suggestions, etc.) and they have been working off over 400 million jobs since its inception. Not too bad!
Hitting the limits of the system
In the past few months, this system has shown its limits under heavy load, so much that it became our first point of failure. Failure means that files were sometimes not being updated in a timely fashion. We’ve tried to add more workers but it wasn’t doing anything.
The problem was due to a handful of large projects hosting very large files (containing over 20.000 segments per file) being translated and generating hundreds of long running jobs each minute.
Updating files for such projects is slow because it involves pulling thousands of segments from the database and generating very large files. It can take over 3 minutes to update such files, compared to a few seconds on a 4.000 segments file. So things quickly go out of hand.
To put things into perspective, 1 translation and 1 proofread on one of these files create two 3-minutes-long jobs. Imagine a team of translators working these files and you have a backlog of tens of thousands of jobs in queue. And since they are working the same file these are actually duplicated jobs! As these large project get translated these jobs pile up and and all our workers are all busy working off the large jobs, and no workers are available to work off other project’s jobs, and they pile up as well. How can we fix that?
How we fixed the queue
First, we focussed on making these jobs finish up quicker. We improved the code and these 3-minutes jobs now finish in about 45 seconds. Not a bad start. We deployed the changes and monitored the queue. We quickly noticed that while there was definitely an improvement, we were still experiencing the same issue of jobs piling up under heavy load. So, what else can we do to completely fix the issue?
We thought: “Could we rate-limit background jobs for these large projects? We could allow each file to be updated once per minute. Large projects wouldn’t even notice the difference and it would actually work better. And also: couldn’t we do it not only for large projects, but for everyone?”
So that’s the new Jobs architecture that we’re currently using. It works great so far. Whenever we need to update a file due to a translation or status change the system flags it as “to update” with an attribute on the File object. Then, each minute, a cron job runs, checks which files need to be updated and then pushes them to the queue. This way, a file is updated at most once per minute.
We immediately noticed that our jobs server load is now lower, the queue is almost always empty and there is never more than 1 minute of delays after a file update. This makes the system work more consistently for everyone.
wti 2.5.0
This update brings another sweet new features to files. Since we now know when the file was last updated (by a translation or a status change) and we also know when the file was last generated, we can now tell if the file we’re currently serving is stale or not.
We’ve updated our Project API to show if the files that are ready to be served are stale or fresh.
And we also updated our synchronization tool wti to make use of that feature.
If at the time you request a wti pull
files are fresh, then no problem: you can download them right away. If they are stale, then the system lets you know by displaying a tiny *
right next to the language file name. And now you should never have to wait longer than 1 minute with this new system.
We’ll continue to refine this system further and have other updates planned to our wti gem. With that, thanks for reading!