Recent repo outages

Gemfury had two unplanned outages on Tuesday, April 20, and Monday, May 3. Not only were these disruptive service outages, but they were also failures in preparation and response around events that we could have anticipated and/or prevented. We owe an explanation and an apology.

Tuesday, April 20

Around 17:33 UTC, customers began to see increased response times. By 17:40 UTC, a significant number of requests were beginning to timeout or fail with a 5XX status response.

Investigation showed that at 17:33 UTC, our repository processes began to get overwhelmed with servicing too many requests and accumulated active connections with related states, reaching container memory limit. Subsequent page swapping lead to further request slowdown and accumulation until the repo processes were killed to be restarted by our container platform.

This sequence of events repeated itself after each restart and was exacerbated by our platform provider (PaaS) throttling restarts with exponential backoff. As more repo processes were throttled to be restarted, the remaining running processes were overwhelmed at a faster rate until all processes entered the restart grace period. In the meantime, customers were served 503 (Service Unavailable) responses.

Resolution

Restarting and scaling up the service resolved the issue. We have not received a clear answer from our platform provider as to why our repository service failed to autoscale up in response to the request pile-up, and how we can automatically handle similar anomalies in the future.

Monday, May 3

At 12:00 UTC, our main certificate for *.fury.io domains expired, and lead to invalid certificate failure for all requests to the dashboard, repo, git, etc.

Resolution

Generating and replacing the certificate resolved the issue.

Incident management & response

A significant failure during these incidents was the substandard preparation and response. Not having well-tested tools and processes in place prevented us from anticipating, being notified, and promptly resolving these issues.

We have since upgraded our incident management software and implemented a process to escalate unanswered issues to other members of the team. We’ve also upgraded our uptime monitor to also perform a certificate check with a preemptive expiration warning.

We’re sorry

During long periods of stability, you’ve come to trust Gemfury as a core part of your development infrastructure, and when our service stability and incident response are subpar, it makes you doubt that trust. We apologize for this, and how it may have affected your work. We will do better.