Open API System Outage - Commercial, HIPAA, Private and Gov Cloud
Incident Report for TrackVia
Postmortem

Incident Report for Red Hat 3scale

Postmortem

On May 24th of 2023 all the 3scale SaaS APIs suffered a major outage for 26 minutes.

Timeline:

  • May 24, 2023 09:03 UTC: rollout of a new version of the control plane that the 3scale SaaS uses to distribute its ingress layer configuration is completed.
  • May 24, 2023 09:04 UTC: alerts start firing for all 3scale SaaS APIs. Statistics show all requests to all APIs failing. Given the widespread outage the rollback procedure is started some minutes later to bring the control plane back to its previous version.
  • May 24, 2023 09:30 UTC - the rollback procedure is completed and service is recovered for all APIs.

Root Cause:

The root cause has been identified as being related to the rollout strategy used to deploy the new versions of the control plane.

Preventative Actions:

Mitigations include modifying this strategy appropriately and adding some control mechanisms to make the process more robust and avoid this from happening in the future.

Posted May 24, 2023 - 08:51 MDT

Resolved
On May 24th of 2023 all the 3scale SaaS APIs suffered a major outage for 26 minutes affecting TrackVia OpenAPI processing.
Posted May 24, 2023 - 03:03 MDT