Report Date: 2024-04-03
Incident Date(s): 2024-03-28 – 2024-03-29
On 2024-03-28 at 13:17 UTC, the F5 Distributed Cloud support team received the initial customer report for frequent 503 error while accessing websites/web applications. Upon receiving the report, the F5 Distributed Cloud team started investigation and determined the service event to be isolated to the Tokyo Regional Edge. Subsequently a few more customer reports were received about the 503 error on websites/web applications.
Detailed analysis confirmed the service event to have started from 2024-03-28 11:00 UTC, after new nodes were added to the Tokyo Regional Edge as part of a scheduled platform maintenance.
To mitigate the impact, the F5 Distributed Cloud team disabled the network addressing and routing method (Anycast) on the Tokyo Regional Edge, which automatically rerouted network traffic to other Regional Edges.
Further investigation revealed that the configuration used for directing network traffic to the origin server did not get installed correctly on one of the nodes when it was added.
To resolve the service event, the F5 Distributed Cloud team removed the problematic node from the Tokyo Regional Edge and reapplied the configuration. Post which node functionality was verified and then the traffic was reenabled on the Tokyo Regional Edge on 2024-03-29 at 03:20 UTC.
The total duration of the service event was 16 hours and 20 minutes.
Start time of Service Event | 2024-03-28 11:00 UTC |
Conclusion of Service Event | 2024-03-29 03:20 UTC |
Event duration | 16 hours and 20 minutes |
Impact | Customers experienced frequent 503 errors on load balancers impacting their websites / web application accessibility. |
Root cause | A network configuration for the newly added node on regional edge malfunctioned, impacting traffic via load balancers. |
DATE | TIME (UTC) | ACTION |
---|---|---|
2024-03-28 | 11:00 | During the planned maintenance window, new resources were added to Tokyo RE. |
2024-03-29 | 00:50 | Testing script reports multiple failure on one of our testing LB and issue is escalated to the Distributed Cloud team. |
2024-03-29 | 01:20 | Our Distributed Cloud team disabled anycast on Tokyo RE to avoid impact. |
2024-03-29 | 01:35 | The Distributed Cloud team identified that the configuration required to reach the public origin server, has failed to get deployed |
2024-03-29 | 03:20 | The Distributed Cloud team fixed the issue on the problematic node, and enabled traffic back on Tokyo Regional Edge after verification. |
Yes, the Tokyo Regional Edge is fully operational and serves network traffic.
The root cause for the Load Balancers experiencing 503 errors has been identified as a malfunction of the code base, which was deployed as part of a scheduled maintenance on the Tokyo Regional Edge. Due to this event, the newly added nodes were failing to egress the Anycast VIP traffic, thus causing an outage.
The F5 Distributed Cloud team disabled anycast on the Tokyo Regional Edge which automatically shifted traffic to other regional edges. The problematic node was then removed from production and the required configuration was reapplied on it. The node was tested for functionality and then added back to production. Anycast was re-enabled on the Tokyo Regional Edge to serve traffic normally.
We will be taking several measures to prevent this service event from reoccurring and to ensure that we are better prepared to respond and recover from similar scenarios more quickly.
F5® understands how important reliability of the Distributed Cloud Platform is for customers, and specifically how the load balancers are critical to your services. F5 will ensure the recommended changes in this document are canonized into our operational Methods of Procedure (MoP) moving forward. We are grateful you have chosen to partner with F5® for critical service delivery and are committed to evolving our platform and tooling to better anticipate and mitigate disruptions to Distributed Cloud Platform services.
F5 Glossary