Downtime post-mortem: Tuesday 1st December
By Edouard on December 1, 2009
Yesterday has been the busiest day of Web Translate It’s history. We completely outreached the number of visits and pages viewed. Many new visitors have been trying Web Translate It’s demo which resulted in a fairly high, although sustainable server load.
Around 5AM GMT the cron job that back up our database started. This backup is usually done in less than 5 minutes. Due to the high server load we were experiencing, after more than 30 minutes the backup was still not finished, at which time another resource-hungry task started. This abnormal accumulation of heavy tasks at the same time jammed our web server, at which point the service became unresponsive.
Around 9AM GMT I noticed the service was not responding and I rebooted the server, which instantly restored Web Translate It’s service back to business.
I am really sorry about this downtime. This is the biggest downtime in Web Translate It’s history. I strive for delivering high quality and highly available software to my customers, which I failed to provide this morning.
I will take the following actions:
I will set up the monitoring system to send e-mail and SMS when the server load is extremely high. At the moment I am only notified when the server goes completely down.
I will migrate Web Translate It to a more beefy server. I was planning such a migration after Christmas holiday but I will do my best to order a new server and migrate the entire service before the beginning of next week.
Again, I am very sorry for any and all problems this has caused and hope that you will give us a chance to re-earn your trust and continued business.