On April 28th, 2023, Semaphore experienced an incident that impacted its cloud CI/CD services. The incident started at 22:23 UTC and lasted until 01:43 UTC, during which time the triggering of the jobs on Semaphore was severely delayed, leading to a partial loss of service for many users.
Our on-call SRE team was alerted by automated monitoring systems and initiated an investigation into the issue.
The root cause was determined to be low-performing database queries that, under specific conditions, caused the production database CPU usage to spike to 100%. The problem was resolved by implementing a solution and cleaning up the jobs queue.
Timeline (all times in UTC)
- 22:23: Incident start
- 22:34: On-call SRE team alerted by automated monitoring systems
- 23:05: Problem identified as database performance issue; issue escalated to additional engineers
- 00:29: Root cause found: low-performing queries causing production database CPU usage to spike to 100% under…