GitHub Availability Report: January 2025
In January, we experienced two incidents that resulted in degraded performance across GitHub services.
In January, we experienced three incidents that resulted in degraded performance across GitHub services.
January 09 1:26 UTC (lasting 31 minutes)
On January 9, 2025, between 01:26 UTC and 01:56 UTC, GitHub experienced widespread disruption to many services, with users receiving 500 responses when trying to access various functionality. This was due to a deployment which introduced a query that saturated a primary database server. On average, the error rate was 6% and peaked at 6.85% of update requests.
We were able to mitigate the incident by identifying the source of the problematic query and rolling back the deployment. The internal tooling and our dashboards surfaced the relevant data that helped us quickly identify the problematic query. It took us a total of 14 minutes from the time to engage to finding the errant query.
However, we are investing in tooling to detect problematic queries prior to deployment to prevent and to reduce our time to detection and mitigation of issues like this one in the future.
January 13 23:35 UTC (lasting 49 minutes)
On January 13, 2025, between 23:35 UTC and 00:24 UTC, all Git operations were unavailable due to a configuration change related to traffic routing and testing that caused our internal load balancer to drop requests between services that Git relies upon.
We mitigated the incident by rolling back the configuration change.
We are improving our monitoring and deployment practices to improve our time to detection and automated mitigation for issues like this in the future.
January 30 14:22 UTC (lasting 26 minutes)
On January 30, 2025, between 14:22 UTC and 14:48 UTC, web requests to github.com experienced failures (at peak the error rate was 44%), with the average successful request taking over three seconds to complete.
This outage was caused by a hardware failure in the caching layer that supports rate limiting. In addition, the impact was prolonged due to a lack of automated failover for the caching layer. A manual failover of the primary to trusted hardware was performed following recovery to ensure that the issue would not reoccur under similar circumstances.
As a result of this incident, we will be moving to a high availability cache configuration and adding resilience to cache failures at this layer to ensure requests are able to be handled should similar circumstances happen in the future.
Please follow our status page for real-time updates on status changes and post-incident recaps. To learn more about what we’re working on, check out the GitHub Engineering Blog.
Tags:
Written by
Related posts
Changes to GitHub Copilot Individual plans
We’re making these changes to ensure a reliable and predictable experience for existing customers.
Bringing more transparency to GitHub’s status page
Changes to the status page will provide more specific data, so you’ll have better insight into the overall health of the platform.
Developer policy update: Intermediary liability, copyright, and transparency
We’re sharing recent policy updates that developers should know about, updating our Transparency Center with the full year of 2025 data, and looking to what’s ahead.
