616 views 13 secs 0 comments

Service Outage Postmortem: April 28

In General
April 28, 2023


On April 28th, 2023, Semaphore experienced an incident that impacted its cloud CI/CD services. The incident started at 22:23 UTC and lasted until 01:43 UTC, during which time the triggering of the jobs on Semaphore was severely delayed, leading to a partial loss of service for many users. 

Our on-call SRE team was alerted by automated monitoring systems and initiated an investigation into the issue. 

The root cause was determined to be low-performing database queries that, under specific conditions, caused the production database CPU usage to spike to 100%. The problem was resolved by implementing a solution and cleaning up the jobs queue.

Timeline (all times in UTC)

  • 22:23: Incident start
  • 22:34: On-call SRE team alerted by automated monitoring systems
  • 23:05: Problem identified as database performance issue; issue escalated to additional engineers
  • 00:29: Root cause found: low-performing queries causing production database CPU usage to spike to 100% under…



Continue reading on source link