Impact
EU (Non-Premium) was down for 7 minutes interrupting candidates taking exams. The Cirrus platform did recover automatically and 6 minutes later service was fully restored.
Root Cause
A pattern of large waves of load in succession was not correctly handled by our autoscaling, resulting in an overloading of the two remaining api containers (“the server”).
Resolution
The autoscaling did recover automatically after the outage when the load returned in a single wave and Cirrus platform's service was fully restored automatically.
Preventative Measures
- DONE 2022-2023: For all production environments all [1] critical services redundant, autoscaling, and self-healing. This validated through realistic load tests.
- DONE: Keep EU (non-Premium) manually scaled up until improved autoscaling [SYSOPS-852] was successfully tested and released.
- DONE: Extend load test with successive wave scenario, a.o. to test SYSOPS-852.
- DONE: Increased vigilance in load test analysis on container vs instance scaling
- DONE [31 October 2023 Release]: Improve autoscaling (scale more robust for when load comes in successive waves) - SYSOPS-852
- DONE External communication during incident and follow up deemed satisfactory by impacted customers
- DONE Internal communication: Revised CSOP110_Cirrus_Incident_Management_Procedure v1.7 (use Skype as it has better notifications than Slack, join as listener even when focused on investigating)
- DONE Timely handling of wave of external ticket communication through help from non-service desk staff and prepared incident response
- Under Consideration: Improve error communication so candidates know where to go in case of issues - CR-15112
- Under Consideration: Waiting room for candidates in case of high peaks of usage and Cirrus needs a minute or two to scale up - CR-22056
[1] Including databases only for EU Premium.