F5 Distributed Cloud - Service Degradation - Global CDN nodes are in degraded state - INC-20250616-343

Incident Report for F5

Postmortem

F5® Distributed Cloud Services – Content Delivery Network

Root Cause Analysis for CDN Service impact affecting traffic processing

Report Date: 2025-06-28 Incident Date(s): 2025-06-16

EVENT SUMMARY

On 2025-06-16, at approximately 08:43 UTC, the F5 Distributed Cloud team identified a traffic processing issue within the Content Delivery Network (CDN), accompanied by increased latency for customer requests.

A detailed investigation revealed that production CDN nodes encountered an error, resulting in loss of connectivity between our CDN global controller and the CDN edge nodes, triggering a series of unexpected reboots. This resulted in 5xx errors during traffic processing. However, due to the transient nature of the issue, the CDN nodes began to recover automatically without technical intervention.

As more nodes came back online, the 5xx errors ceased, although reduced processing capacity led to temporary increased latency. By 12:05 UTC, all nodes had fully recovered, and the CDN service was restored to normal functionality.

WHAT HAPPENED?

INCIDENT DETAILS

Start time of Service Event 2025-06-16 08:43 UTC
Conclusion of Service Event 2025-06-16 12:05 UTC
Event Duration 3 hours, 22 minutes
Impact Distributed Cloud customers using CDN service experienced 5xx errors and increased latency for traffic processing.

TIMELINE OF EVENTS

Date Time (UTC) Action
2025-06-16 08:57 Customer reported SOC that using CDN service experienced 5xx errors and increased latency for traffic processing.
2025-06-16 09:46 SOC escalated the case to engineering for further investigation of root cause to share with customer
2025-06-16 12:05 CDN service has been fully restored and is operating normally without any latency

IS THE SERVICE EVENT FULLY RESOLVED?

Yes, the issue is resolved, and CDN service is fully operational.

ROOT CAUSE

The incident occurred because the CDN edge nodes lost connectivity to the CDN global controller while applying configuration updates, causing the configuration application transaction to fail. As a result, the edge nodes initiated a re-initialization process which requires the edge nodes to temporarily be unavailable for traffic processing. The loss of connectivity between the CDN edge nodes and CDN global controller was ultimately caused by the failure of the ingress service in the global controller to complete the SSL handshake. This failure was due to the SSL session cache being full, which prevented successful mTLS communication between the global controller and the edge nodes.

RESOLUTION AND NEXT STEPS

RESOLUTION

The issue was resolved through automated recovery mechanisms, requiring no manual intervention.

NEXT STEPS: FUTURE EVENT PREVENTION

We will be taking several measures to prevent this service event from reoccurring and to ensure that we are better prepared to react to and recover from similar scenarios more quickly.

First, F5 Distributed Cloud team upgraded the CDN Edge Nodes which will help in better prevention of such events in future.

Second, F5 Distributed Cloud team deployed the hotfix on the CDN Global Controller for better SSL session cache management.

Lasty, F5 Distributed Cloud team is also working on enhancing existing monitoring of SSL session failure logs for better detection.

CLOSING

F5® understands how important reliability of the Distributed Cloud Platform is for customers, and specifically how the F5® Distributed Cloud Services / CONTENT DELIVERY NETWORK is critical to your services. F5 will ensure the recommended changes in this document are canonized into our operational Methods of Procedure (MoP) moving forward. We are grateful you have chosen to partner with F5® for critical service delivery and are committed to evolving our platform and tooling to better anticipate and mitigate disruptions to Distributed Cloud Platform services.

APPENDICES

F5 Glossary

https://www.f5.com/services/resources/glossary

Posted Jun 28, 2025 - 07:47 UTC

Resolved

The F5 Distributed Cloud team has confirmed that the issue with the Distributed Cloud CDN service has been resolved and it is operating normally without any latency. All other services remain fully operational. If you have any questions or concerns, please feel free to reach out to our support team. This incident has been resolved.
Posted Jun 17, 2025 - 06:30 UTC

Monitoring

The F5 Distributed Cloud team confirms that the CDN service has been fully restored and is operating normally without any latency. We are actively monitoring the system to ensure ongoing stability and optimal performance. If you experience any issues, please don’t hesitate to contact our support team for assistance.
Posted Jun 16, 2025 - 12:08 UTC

Identified

The F5 Distributed Cloud team is actively working on the recovery of affected CDN nodes. We can confirm that traffic disruption is no longer expected, users may continue to experience higher latency during this period. We will share further updates on the progress as they become available, please feel free to reach out to our support team for prompt assistance.
Posted Jun 16, 2025 - 10:43 UTC

Update

The F5 Distributed Cloud CDN services are experiencing service degradation issue, It is impacting our CDN capabilities during this duration. Customers may experience intermittent traffic disruptions and increased latency as traffic is being rerouted to healthy CDN nodes during this period. We will continue to monitor the situation closely and share timely updates as additional information becomes available.
Posted Jun 16, 2025 - 09:45 UTC

Investigating

This advisory is to inform you that we are currently investigating reports of service degradation affecting CDN nodes.

We understand this may be affecting your operations and we are committed to providing transparent and timely updates as more information becomes available. While we gather more information, we recommend monitoring the progress on this site (https://www.f5cloudstatus.com/) for the latest update.

Investigation Status:
• We have identified potential service impact at 09:13 AM UTC;
• Our incident response team has been fully mobilized;
• Initial investigation and impact assessment efforts are underway.
Next Steps:
• A detailed incident notification will be provided within 30 minutes;
• Our teams are working to determine the root cause of the incident;
• We will share mitigation steps as soon as they become available.

We appreciate your patience as we work to resolve this situation. If you are experiencing a critical business impact, please contact our support team through your established channels.

Thank you for your continued support and trust in F5.
Posted Jun 16, 2025 - 09:15 UTC
This incident affected: Services (CDN Control Plane).