[EU Non-Premium] Service Interuption

Incident Report for Cirrus Assessment

Postmortem

Impact

EU (Non-Premium) was down for 7 minutes interrupting candidates taking exams. The Cirrus platform did recover automatically and 6 minutes later service was fully restored.

Root Cause

A pattern of large waves of load in succession was not correctly handled by our autoscaling, resulting in an overloading of the two remaining api containers (“the server”).

Resolution

The autoscaling did recover automatically after the outage when the load returned in a single wave and Cirrus platform's service was fully restored automatically.

Preventative Measures

DONE 2022-2023: For all production environments all [1] critical services redundant, autoscaling, and self-healing. This validated through realistic load tests.
DONE: Keep EU (non-Premium) manually scaled up until improved autoscaling [SYSOPS-852] was successfully tested and released.
DONE: Extend load test with successive wave scenario, a.o. to test SYSOPS-852.
DONE: Increased vigilance in load test analysis on container vs instance scaling
DONE [31 October 2023 Release]: Improve autoscaling (scale more robust for when load comes in successive waves) - SYSOPS-852
DONE External communication during incident and follow up deemed satisfactory by impacted customers
DONE Internal communication: Revised CSOP110_Cirrus_Incident_Management_Procedure v1.7 (use Skype as it has better notifications than Slack, join as listener even when focused on investigating)
DONE Timely handling of wave of external ticket communication through help from non-service desk staff and prepared incident response
Under Consideration: Improve error communication so candidates know where to go in case of issues - CR-15112
Under Consideration: Waiting room for candidates in case of high peaks of usage and Cirrus needs a minute or two to scale up - CR-22056

[1] Including databases only for EU Premium.

Posted Nov 03, 2023 - 14:02 CET

Resolved

EU (Non-Premium) Cirrus platform is completely stable. As a precaution we keep the platform scaled up.

As part of our continuous improvement and SLA, our Post Mortem report will be published here within two weeks.

NOTE: Premium customers were not affected.

Posted Oct 26, 2023 - 12:20 CEST

Monitoring

Performance is back to normal, still
investigating and monitoring

Posted Oct 26, 2023 - 11:55 CEST

Update

We are continuing to investigate this issue.

Posted Oct 26, 2023 - 11:53 CEST

Investigating

We’re experiencing a service outage. Our engineering and
operations team is currently working to restore the service. We
apologize for any inconvenience. We’ll update you within 5
minutes.

Posted Oct 26, 2023 - 11:45 CEST

This incident affected: EU Assessment Management (Dublin) (Scheduling (EU), Marking (EU), Authoring - Library/Assessments (EU), Administration (EU), Auxiliary - Services (EU)) and EU Candidate Delivery (Dublin) (Candidate Delivery incl Proctoring API (EU), Invigilation (EU)).