Delay in config propagation
Incident Report for F5 Distributed Cloud
Postmortem

F5® Distributed Cloud Services – Load Balancer

Root Cause Analysis for Delay in configuration propagation to Load Balancers

Report Date: 2024-04-08

Incident Date(s): 2024-04-01 – 2024-04-02

EVENT SUMMARY

On 2024-04-02 at 05:43 UTC, the F5 Distributed Cloud support team received the initial customer report of a delay in change propagation to load balancers. Upon receiving the report, the team initiated an investigation and determined that the issue was limited to the configuration propagation of new load balancers and modifications to existing ones. There was no impact on traffic processing.

Detailed analysis confirmed that the issue began on 2024-04-01 18:30 UTC, when the Distributed Cloud platform experienced an influx of anomalous traffic which affected the load balancer configuration propagation from the Global Controller and Regional Edges.

To address the issue, the F5 Distributed Cloud team deployed a hotfix to the Global Controller and Regional Edges. Post-fix validation confirmed the propagation delay was resolved on 2024-04-02 at 19:50 UTC.

The service degradation lasted 1 day, 1 hour, and 20 minutes.

WHAT HAPPENED?

INCIDENT DETAILS
Start time of Service Event 2024-04-01 18:30 UTC
Conclusion of Service Event 2024-04-02 19:50 UTC
Event duration 1 day, 1 hour and 20 minutes
Impact Customers encountered delays in the application of load balancer configurations when creating a new configuration or modifying an existing one.
Root cause F5 Distributed Cloud Load Balancer service experienced a partial degradation due to a high volume of anomalous requests in network traffic coming from non-trusted systems.
TIMELINE OF EVENTS
DATE TIME (UTC) ACTION
2024-04-01 18:30 F5 Distributed Cloud Team observed configuration propagation delays.
2024-04-02 06:01 F5 Distributed Cloud Team started receiving reports of delays in executing configuration changes on the LB from customers.
2024-04-02 06:10 F5 Distributed Cloud Team sought additional information from the customer to understand the nature of the issue.
2024-04-02 07:43 F5 Distributed Cloud Team, after gathering the required information, escalated the issue to the internal engineering team.
2024-04-02 08:23 The F5 Distributed Cloud Team acknowledged the existence of a configuration push issue at the backend and commenced an investigation for the root cause.
2024-04-02 11:15 The F5 Distributed Cloud Team identified a high configuration queue within the system, which adversely affected the propagation of configurations updated by the customers on the console.
2024-04-02 11:50 The F5 Distributed Cloud Team discovered an issue with the Regional Edges and Global Controller and began deploying hotfix to address the issue.
2024-04-02 17:00 F5 Distributed Cloud Team completed the hotfix deployment to the Global Controller and continued updating the Regional Edges.
2024-04-02 19:50 The F5 Distributed Cloud Team successfully completed all hotfix deployment tasks for the Regional Edge (RE) clusters, and traffic is observed to be processing normally. The team closely monitored the situation to ensure that all Load Balancer propagation issues had been effectively resolved. End of service event.

IS THE SERVICE EVENT FULLY RESOLVED?

Yes, the service degradation is resolved, and the load balancer service is fully operational.

ROOT CAUSE

To stop the large, well crafted, distributed and fully randomized influx of anomalous network traffic, multiple countermeasures were applied to the Distributed Cloud platform, which inadvertently triggered congestion in the configuration path. This introduced a delay in the load balancer configuration propagation.

RESOLUTION AND NEXT STEPS

RESOLUTION

The F5 Distributed Cloud team deployed a hotfix to Global Controller and Regional Edges which restored normal operations of load balancer configurations propagation.

NEXT STEPS: FUTURE EVENT PREVENTION

We will be taking several measures to prevent this service event from reoccurring and to ensure that we are better prepared to react to and recover from similar scenarios more quickly.

  • To prevent this situation in the future, a remediation has been implemented by updating the script that had significantly reduced the time it takes to process mitigation configuration.
  • In parallel we are working to limit the number of countermeasures required to stop anomalous requests in network traffic.

CLOSING

F5® understands how important reliability of the Distributed Cloud Platform is for customers. F5 will ensure the recommended changes in this document are canonized into our operational Methods of Procedure (MoP) moving forward. We are grateful you have chosen to partner with F5® for critical service delivery and are committed to evolving our platform and tooling to better anticipate and mitigate disruptions to Distributed Cloud Platform services.

APPENDICES

F5 Glossary

https://www.f5.com/services/resources/glossary

Posted Apr 10, 2024 - 13:53 UTC

Resolved
The F5 Distributed Cloud team validated and confirmed the config propagation issue is restored and no more errors are observed. All other services remain fully operational. This incident has been resolved.
Posted Apr 03, 2024 - 07:44 UTC
Monitoring
The hotfix deployment tasks has been completed for all Regional Edges (REs) clusters and the traffic is observed to be processing normally. Distributed Cloud team is closely monitoring to ensure Load Balancer propagation issues are fixed.
Posted Apr 02, 2024 - 20:15 UTC
Update
Distributed Cloud team continues their efforts to address the delaying in the config propagation issue. The team has completed the hotfix deployment to the Global Controller and continues updating the Regional Edges. The other services remain to be operational. The updated ETA is 20:00 UTC. More updates to follow.
Posted Apr 02, 2024 - 17:00 UTC
Update
Distributed Cloud customers are experiencing delays in config propagation when creating a new Load Balancer or modifying an existing one. All other services remain fully operational. F5 team continues working on Hotfix deployment activities, ETA 18:00 UTC. More updates to follow.
Posted Apr 02, 2024 - 13:46 UTC
Update
We have identified an issue on the Regional Edges and Global Controller. A hot fix will be implemented to resolve the issue.
Posted Apr 02, 2024 - 11:53 UTC
Identified
We have noticed our system have high config queue and it is impacting the config propagation. Customer has updated config on the console may not get propagated on time.
Posted Apr 02, 2024 - 11:15 UTC
This incident affected: Customer Support, Docs and WebSite (Software Distribution).