We recently experienced a major outage, which affected all members. This incident report outlines what caused the outage, the steps we took to get Airtasker back up and the actions we’re taking to help prevent a similar outage recurring.
We apologise to everyone affected by this outage.
On 2nd April 2017, we experienced a 4-hour outage starting at 07:30 UTC. It took 28min to identify the cause within our cloud formation templates, 5 minutes to make the change we thought was required. It then took a further 180 minutes to identify the cause of the error stopping our earlier fix from being applied. 10 minutes to apply the correct fix and 15 minutes to deploy the fix allowing Airtasker to come back online again. The direct cause was a package install began failing when a new server was being initialised into our load balancer. Specifically, the `apt-get install
The direct cause was a package install that began failing when a new server was being initialised into our load balancer. Specifically, the apt-get install new relic-infra -y
call was failing. This halted each new server being created. As our load balancer attempted to rotate instances it was unable spin up new ones. This eventually led to a situation where there were no instances left. At this point (07:30 UTC) we went down. It also stopped us from simply creating and adding the instances manually as they were unable to complete the create.
Each time our load balancer adds either a new EC2 instance to the stack or replaces an Ec2 instance, it uses Cloud Formation templates. One of the templates called for the NewRelic Infrastructure package to be installed. This install started failing sometime after 04:00UTC. This commenced a chain reaction where all of our instances disappeared (for various standard operational reasons) over the course of a few hours and were unable to be replaced.
When our Engineers identified the issue, then attempted to remove the offending call from within the Cloud Formation Template. Unfortunately our naming conventions led to confusion as to exactly which template needed to be updated. This confusion was the primary reason for the length of the outage.
Our AMI needs to be both immutable and not have external dependencies. Each time a new instance is stood up, there should be no external requirements that can fail (as happened).
We also need to rename and better document the exact nature of our Cloud Formation templates. Had they reflected the naming conventions, we would have resolved in approximately 1.5 hours rather than 4 hours.
A dedicated DevOps task group will plan an approach to hardening our deployment and operational activities to root out any additional external dependencies. This same group will also refactor our templates to ensure their simplicity and robustness moving forward.
We had wrongly assumed that our instances were immutable once the AMI built them in our Continuous Integration stage. This was not the case. We also had assumption about which Cloud Formation template did what. Both of these facts led to an outage. Neither the outage nor the length of time is acceptable.
We are committed to improving our technology and operational processes to prevent future outages. We appreciate your patience and we apologise for any inconvenience. Thank you for your continued support.
This post was last modified on
Airtasker secures $26.2M in media partnerships
Recent legislative changes require Airtasker to report to the ATO on earnings made through the…
Announcing strategic media partnerships with leading U.S. media giants, including iHeartMedia (iHeart)