Root Cause Analysis - Multiple Distributed Cloud Customers experiencing traffic processing issues
Report Date: 2025-07-02
Incident Date: 2025-06-30
On 2025-06-30 at 06:08 UTC, the F5 Distributed Cloud Support team started responding to multiple monitoring alerts and customer reports regarding difficulties accessing various websites and applications. Upon investigation, the team identified issues with endpoint configurations within the Regional Edge (RE) network, which resulted in a disruption of traffic processing functionality. Customers were provided with a manual mitigation via K000152246, to serve as an immediate solution while we continued troubleshooting.
Investigation has revealed that the issue was caused after the execution of scheduled maintenance to upgrade F5 Distributed Cloud Global Controller software to the latest version. During troubleshooting efforts, the F5 Distributed Cloud team initiated a reboot of the configuration propagator service. Following the reboot, synchronization for the affected tenants resumed successfully, restoring services completely to all our customers by 19:10 UTC on June 30, 2025.
F5 will maintain ongoing monitoring of the environment to ensure the incident is fully resolved and to promptly identify and address any potential impacts or unrelated issues that may arise.
WHAT HAPPENED?
INCIDENT DETAILS
Start time of Service Event | 2025-06-30 06:08 UTC |
---|---|
Mitigation available | 2025-06-30 09:12 UTC |
Conclusion of Service Event | 2025-06-30 19:10 UTC |
Event Mitigation time | 3 hours 4 minutes |
Event duration | 13 hours 2 minutes |
Impact | Several Distributed Cloud customers experienced service disruptions affecting their production environments deployed on the Distributed Cloud platform. |
Root cause | Planned maintenance caused service degradation and customer impact. |
TIMELINE OF EVENTS
Date | Time (UTC) | Action |
---|---|---|
2025-06-30 | 04:00 | Start of planned upgrades to Global Controller |
2025-06-30 | 06:08 | Start time of Service Incident |
2025-06-30 | 07:00 | F5 worked on the mitigation strategy |
2025-06-30 | 07:30 | F5 worked on reproducing the issue in staging environment |
2025-06-30 | 09:12 | Customer self-mitigation steps were made available via document created on AskF5 |
2025-06-30 | 09:15 | F5 worked on creating and testing an automation script |
2025-06-30 | 09:30 | While service was down, F5 worked on restoring service for our customers using both automated and manual workarounds on the console. |
2025-06-30 | 12:15 | F5 continued to recover tenant configurations for our customers while in parallel worked on resolving the issue |
2025-06-30 | 18:50 | Configuration Propagator Service was restarted |
2025-06-30 | 19:10 | Resolution of Service Event |
IS THE SERVICE EVENT FULLY RESOLVED?
Yes, Distributed Cloud Traffic processing has been restored for all impacted customers. All Distributed Cloud Console core services are operational. The F5 Distributed Cloud team continues to monitor the situation to ensure service stability.
ROOT CAUSE
Detailed Analysis:
· Global Controller (GC) Configuration Management Service Memory Pressure: The Configuration Management Service experienced memory contention, leading to degraded performance and a failure to provide the complete set of configuration data to our Configuration Propagator service. The root cause of the memory contention was related to an upgraded software dependency within the Configuration Management Service. F5 performs rigorous testing in its test and staging environments, and this memory contention issue was not observed within those environments during testing.
· Partial synchronization by Configuration Propagator Service: The service proceeded with propagation despite receiving an incomplete vhost list.
2. Impact:
· Partial synchronization during maintenance caused service to be instantiated with an incomplete list of vhosts thereby causing temporary service disruption.
We will be taking several measures to prevent this service event from recurring and to ensure that we are better prepared to react to and recover from similar scenarios more quickly.
Preventive Measures:
· F5 has identified the source of the memory contention in the Global Controller Configuration Management service and will be addressing it via service change maintenance.
· Adjust resource allocation for Global Controller Configuration Management service during maintenance to prevent memory pressure or performance degradation.
· Conduct stress testing to simulate high memory scenarios and optimize Global Controller Configuration Management service’s handling of requests under load.
Monitoring Enhancements:
· Implement closer monitoring of resource utilization for critical components like Global Controller Configuration Management service, especially during planned maintenance events, to proactively address issues.
Fail-Safe Mechanisms:
· Improve Configuration Management service’s retry logic to ensure it verifies the completeness of configurations before proceeding with propagation.
· Introduce safeguards to incomplete configuration synchronization scenarios.
F5® understands how important reliability of the Distributed Cloud Platform is for customers, and specifically how the F5 Distributed Cloud Global Controller is critical to your services. F5 will ensure the recommended changes in this document are canonized into our operational Methods of Procedure (MoP) moving forward. We are grateful you have chosen to partner with F5® for critical service delivery and are committed to evolving our platform and tooling to better anticipate and mitigate disruptions to Distributed Cloud Platform services.
APPENDICES
F5 Glossary
https://www.f5.com/services/resources/glossary