[EU Non-Premium] Service Interuption
Incident Report for Cirrus Assessment
Postmortem

Impact

EU (Non-Premium) was down for 7 minutes interrupting candidates taking exams. The Cirrus platform did recover automatically and 6 minutes later service was fully restored.

Root Cause

A pattern of large waves of load in succession was not correctly handled by our autoscaling, resulting in an overloading of the two remaining api containers (“the server”).

Resolution

The autoscaling did recover automatically after the outage when the load returned in a single wave and Cirrus platform's service was fully restored automatically.

Preventative Measures

  • DONE 2022-2023: For all production environments all [1] critical services redundant, autoscaling, and self-healing. This validated through realistic load tests.
  • DONE: Keep EU (non-Premium) manually scaled up until improved autoscaling [SYSOPS-852] was successfully tested and released.
  • DONE: Extend load test with successive wave scenario, a.o. to test SYSOPS-852.
  • DONE: Increased vigilance in load test analysis on container vs instance scaling
  • DONE [31 October 2023 Release]: Improve autoscaling (scale more robust for when load comes in successive waves) - SYSOPS-852
  • DONE External communication during incident and follow up deemed satisfactory by impacted customers
  • DONE Internal communication: Revised CSOP110_Cirrus_Incident_Management_Procedure v1.7 (use Skype as it has better notifications than Slack, join as listener even when focused on investigating)
  • DONE Timely handling of wave of external ticket communication through help from non-service desk staff and prepared incident response
  • Under Consideration: Improve error communication so candidates know where to go in case of issues - CR-15112
  • Under Consideration: Waiting room for candidates in case of high peaks of usage and Cirrus needs a minute or two to scale up - CR-22056

[1] Including databases only for EU Premium.

Posted Nov 03, 2023 - 14:02 CET

Resolved
EU (Non-Premium) Cirrus platform is completely stable. As a precaution we keep the platform scaled up.

As part of our continuous improvement and SLA, our Post Mortem report will be published here within two weeks.

NOTE: Premium customers were not affected.
Posted Oct 26, 2023 - 12:20 CEST
Monitoring
Performance is back to normal, still
investigating and monitoring
Posted Oct 26, 2023 - 11:55 CEST
Update
We are continuing to investigate this issue.
Posted Oct 26, 2023 - 11:53 CEST
Investigating
We’re experiencing a service outage. Our engineering and
operations team is currently working to restore the service. We
apologize for any inconvenience. We’ll update you within 5
minutes.
Posted Oct 26, 2023 - 11:45 CEST
This incident affected: EU Assessment Management (Dublin) (Scheduling (EU), Marking (EU), Authoring - Library/Assessments (EU), Administration (EU), Auxiliary - Services (EU)) and EU Candidate Delivery (Dublin) (Candidate Delivery incl Proctoring API (EU), Invigilation (EU)).