Zephyr Scale is inaccessible via UI
Incident Report for Zephyr Scale
Postmortem

The cause of incident is the exact same as the previous two incidents with the same name.

A database cluster node started misbehaving and, although it would accept connections, no queries would run as if the user didn’t have permission to do so. This caused the application to become inaccessible for some clients, although others could just not have noticed the issue as their requests might have been allocated to a different database node.

During the investigation, we could not find the root cause. We are working with AWS to understand what might have been the root cause, as we do not control individual cluster nodes ourselves, but we only configure the entire cluster; database nodes replicas are managed by AWS. This investigation is still in progress.

In the meantime, the event that causes such state, although unrelated to the database, has been identified. We are preventing these events to happen by making a change to our deployment pipelines in order to prevent new incidents to happen. In addition to that, additional monitoring an alerts specific to this scenario have been put in place.

While we are working hard to deliver the best experience to our customers, we apologise for any inconvenience that this might have caused.

Posted Mar 06, 2023 - 15:13 UTC

Resolved
This incident has been resolved.
Posted Mar 02, 2023 - 12:00 UTC