Multiple services are affected, service degradation

highGitHubActionsMar 5, 2026 16:35Duration: 2h 55m
apiconfigurationcapacityrouting
Configuration ErrorNetwork / RoutingCapacity IssueAPI Issue

Summary

On Mar 5, 2026, between 16:24 UTC and 19:30 UTC, Actions was degraded. During this time, 95% of workflow runs failed to start within 5 minutes with an average delay of 30 minutes and 10% workflow runs failed with an infrastructure error. This was due to Redis infrastructure updates that were being rolled out to production to improve our resiliency. These changes introduced a set of incorrect configuration change into our Redis load balancer causing internal traffic to be routed to an incorrect h

Impact

major

Timeline

Mar 5, 2026 16:35

[investigating] We are investigating reports of degraded performance for Actions

via statuspage
+6m
Mar 5, 2026 16:41

[investigating] Actions is experiencing degraded availability. We are continuing to investigate.

via statuspage
+6m
Mar 5, 2026 16:47

[investigating] Webhooks is experiencing degraded availability. We are continuing to investigate.

via statuspage
+6m
Mar 5, 2026 16:52

[investigating] We are observing delays in queuing Actions workflow runs. We’re still investigating the causes of these delays.

via statuspage
+33m
Mar 5, 2026 17:25

[investigating] We have applied mitigations for connection failures across backend resources and we are observing a recovery in queueing Actions workflow runs.

via statuspage
+22m
Mar 5, 2026 17:48

[investigating] We are back to queueing Actions workflow runs at nominal rates and we are monitoring the clearing of queued runs during the incident.

via statuspage
+27m
Mar 5, 2026 18:15

[investigating] The queue of requested Actions jobs continues to make progress. Job delays are now approximately 6 minutes and continuing to decrease.

via statuspage
+43m
Mar 5, 2026 18:59

[investigating] Actions is now fully recovered.

via statuspage
+6m
Mar 5, 2026 19:05

[investigating] Actions is operating normally.

via statuspage
+12m
Mar 5, 2026 19:17

[investigating] Webhooks is operating normally.

via statuspage
+13m
Mar 5, 2026 19:30

[resolved] This incident has been resolved. Thank you for your patience and understanding as we addressed this issue. A detailed root cause analysis will be shared as soon as it is available.

via statuspage
+0m
Mar 5, 2026 19:30

[resolved] On Mar 5, 2026, between 16:24 UTC and 19:30 UTC, Actions was degraded. During this time, 95% of workflow runs failed to start within 5 minutes with an average delay of 30 minutes and 10% workflow runs failed with an infrastructure error. This was due to Redis infrastructure updates that were being rolled out to production to improve our resiliency. These changes introduced a set of incorrect configuration change into our Redis load balancer causing internal traffic to be routed to an incorrect host leading to two incidents. <br /><br />We mitigated this incident by correcting the misconfigured load balancer. Actions jobs were running successfully starting at 17:24 UTC. The remaining time until we closed the incident was burning through the queue of jobs. <br /><br />We immediately rolled back the updates that were a contributing factor and have frozen all changes in this area until we have completed follow-up work from this. We are working to improve our automation to ensure incorrect configuration changes are not able to propagate through our infrastructure. We are also working on improved alerting to catch misconfigured load balancers before it becomes an incident. Additionally, we are updating the Redis client configuration in Actions to improve resiliency to brief cache interruptions.

via statuspage

Lessons Learned

GitHub has experienced 39 incidents in the past year. This frequency suggests systemic reliability challenges that may warrant additional monitoring.

📊Incidents related to api, configuration, capacity, routing have occurred 201 times across all providers in the past year. This is one of the most common failure categories in cloud infrastructure.

💡This incident is categorized as: Configuration Error, Network / Routing, Capacity Issue, API Issue. Consider implementing preventive measures specific to this failure category.