Root Cause Analysis for CDN Service impact affecting traffic processing
Report Date: 2025-06-28 Incident Date(s): 2025-06-16
On 2025-06-16, at approximately 08:43 UTC, the F5 Distributed Cloud team identified a traffic processing issue within the Content Delivery Network (CDN), accompanied by increased latency for customer requests.
A detailed investigation revealed that production CDN nodes encountered an error, resulting in loss of connectivity between our CDN global controller and the CDN edge nodes, triggering a series of unexpected reboots. This resulted in 5xx errors during traffic processing. However, due to the transient nature of the issue, the CDN nodes began to recover automatically without technical intervention.
As more nodes came back online, the 5xx errors ceased, although reduced processing capacity led to temporary increased latency. By 12:05 UTC, all nodes had fully recovered, and the CDN service was restored to normal functionality.
INCIDENT DETAILS
Start time of Service Event | 2025-06-16 08:43 UTC |
---|---|
Conclusion of Service Event | 2025-06-16 12:05 UTC |
Event Duration | 3 hours, 22 minutes |
Impact | Distributed Cloud customers using CDN service experienced 5xx errors and increased latency for traffic processing. |
Date | Time (UTC) | Action |
---|---|---|
2025-06-16 | 08:57 | Customer reported SOC that using CDN service experienced 5xx errors and increased latency for traffic processing. |
2025-06-16 | 09:46 | SOC escalated the case to engineering for further investigation of root cause to share with customer |
2025-06-16 | 12:05 | CDN service has been fully restored and is operating normally without any latency |
IS THE SERVICE EVENT FULLY RESOLVED?
Yes, the issue is resolved, and CDN service is fully operational.
ROOT CAUSE
The incident occurred because the CDN edge nodes lost connectivity to the CDN global controller while applying configuration updates, causing the configuration application transaction to fail. As a result, the edge nodes initiated a re-initialization process which requires the edge nodes to temporarily be unavailable for traffic processing. The loss of connectivity between the CDN edge nodes and CDN global controller was ultimately caused by the failure of the ingress service in the global controller to complete the SSL handshake. This failure was due to the SSL session cache being full, which prevented successful mTLS communication between the global controller and the edge nodes.
The issue was resolved through automated recovery mechanisms, requiring no manual intervention.
We will be taking several measures to prevent this service event from reoccurring and to ensure that we are better prepared to react to and recover from similar scenarios more quickly.
First, F5 Distributed Cloud team upgraded the CDN Edge Nodes which will help in better prevention of such events in future.
Second, F5 Distributed Cloud team deployed the hotfix on the CDN Global Controller for better SSL session cache management.
Lasty, F5 Distributed Cloud team is also working on enhancing existing monitoring of SSL session failure logs for better detection.
F5® understands how important reliability of the Distributed Cloud Platform is for customers, and specifically how the F5® Distributed Cloud Services / CONTENT DELIVERY NETWORK is critical to your services. F5 will ensure the recommended changes in this document are canonized into our operational Methods of Procedure (MoP) moving forward. We are grateful you have chosen to partner with F5® for critical service delivery and are committed to evolving our platform and tooling to better anticipate and mitigate disruptions to Distributed Cloud Platform services.
APPENDICES
F5 Glossary