Singapore brief service disruption [SG Only]

Incident Report for Cirrus Assessment

Postmortem

Impact

On December 8th the Cirrus platform in the Singapore region had a brief service interruption causing significant additional stress for all candidates in the exam of one customer. Fortunately, all candidates were able to complete their exam after reconnecting. The impact being further reduced because they were granted extra time by the customer through the platform.

At 03:02 AM UTC (11:02 AM SGT) the Cirrus platform in the Singapore region experienced a one minute downtime from which it automatically recovered. Many candidates still in the exam had to reconnect through Proctorio to reenter the exam. Due to the load from reconnecting candidates the platform's performance was degraded for a further three minutes (03:06 UTC).‌

Root Cause

Due to a “design” bug, candidates clicked on the Close button an average thirteen times after receiving an invigilator message sent to all candidates in a large exam. The processing of all the resulting requests briefly overwhelmed the Cirrus servers.

Resolution

The system immediately reprovisioned servers and restored service within 1 minute. Followed by automatically scaling up to handle the reconnecting candidates.

Preventative Measures

  • [DONE] Manually increased capacity further for the upcoming exams so the platform can handle such an additional spike of load before the bug is fixed
  • [DONE] Briefed the customer that if the bulk message is not essential to their exam flow, avoiding it during the exam window would help minimise any remaining risk.
  • [DONE] Improve QA procedures to report a bug as Critical if a button does not disable after clicking it.
  • [DONE] Improved product development procedures to include unchanged original functionality in code review etc. when a feature turns it into a Bulk action or enables mass usage.
  • [DONE, Released Jan 20th, 2026] Improve invigilator message handling including UX/UI Fix [CR-25281]
Posted Dec 18, 2025 - 14:52 CET

Resolved

On December 8th at 03:02 AM UTC (11:02 AM SGT) Cirrus platform experienced a one minute downtime from which it automatically recovered. Due to the load from reconnecting candidates the platform's performance was degraded for a further three minutes (03:06 UTC).
While investigating into the cause of the spike in load we have decided to manually scale up the SG platform further so it can readily handle such a spike during upcoming exams (i.e. December 9th).
Posted Dec 08, 2025 - 03:00 CET