LB on Tokyo RE intermediately are having high error

Incident Report for F5

Postmortem

F5® Distributed Cloud Services – Load Balancer

Root Cause Analysis for 503 Error in Load Balancers on Tokyo Regional Edge

Report Date: 2024-04-03

Incident Date(s): 2024-03-28 – 2024-03-29

EVENT SUMMARY

On 2024-03-28 at 12:54 UTC, the F5 Distributed Cloud support team received the initial customer report for frequent 503 error while accessing websites/web applications. Upon receiving the report, the F5 Distributed Cloud team started investigation and determined the service event to be isolated to the Tokyo Regional Edge. Subsequently a few more customer reports were received about the 503 error on websites/web applications.

Detailed analysis confirmed the service event to have started from 2024-03-28 11:00 UTC, after new nodes were added to the Tokyo Regional Edge as part of a scheduled platform maintenance.

To mitigate the impact, the F5 Distributed Cloud team disabled the network addressing and routing method (Anycast) on the Tokyo Regional Edge, which automatically rerouted network traffic to other Regional Edges.

Further investigation revealed that the configuration used for directing network traffic to the origin server did not get installed correctly on one of the nodes when it was added.

To resolve the service event, the F5 Distributed Cloud team removed the problematic node from the Tokyo Regional Edge and reapplied the configuration. Post which node functionality was verified and then the traffic was reenabled on the Tokyo Regional Edge on 2024-03-29 at 03:20 UTC.

The total duration of the service event was 16 hours and 20 minutes.

WHAT HAPPENED?

INCIDENT DETAILS


Start time of Service Event	2024-03-28 11:00 UTC
Conclusion of Service Event	2024-03-29 03:20 UTC
Event duration	16 hours and 20 minutes
Impact	Customers experienced frequent 503 errors on load balancers impacting their websites / web application accessibility.
Root cause	A network configuration for the newly added node on regional edge malfunctioned, impacting traffic via load balancers.

TIMELINE OF EVENTS

DATE	TIME (UTC)	ACTION
2024-03-28	11:00	During the planned maintenance window, new resources were added to Tokyo RE.
2024-03-29	00:50	Testing script reports multiple failure on one of our testing LB and issue is escalated to the Distributed Cloud team.
2024-03-29	01:20	Our Distributed Cloud team disabled anycast on Tokyo RE to avoid impact.
2024-03-29	01:35	The Distributed Cloud team identified that the configuration required to reach the public origin server, has failed to get deployed
2024-03-29	03:20	The Distributed Cloud team fixed the issue on the problematic node, and enabled traffic back on Tokyo Regional Edge after verification.

IS THE SERVICE EVENT FULLY RESOLVED?

Yes, the Tokyo Regional Edge is fully operational and serves network traffic.

ROOT CAUSE

The root cause for the Load Balancers experiencing 503 errors has been identified as a malfunction of the code base, which was deployed as part of a scheduled maintenance on the Tokyo Regional Edge. Due to this event, the newly added nodes were failing to egress the Anycast VIP traffic, thus causing an outage.

RESOLUTION AND NEXT STEPS

RESOLUTION

The F5 Distributed Cloud team disabled anycast on the Tokyo Regional Edge which automatically shifted traffic to other regional edges. The problematic node was then removed from production and the required configuration was reapplied on it. The node was tested for functionality and then added back to production. Anycast was re-enabled on the Tokyo Regional Edge to serve traffic normally.

NEXT STEPS: FUTURE EVENT PREVENTION

We will be taking several measures to prevent this service event from reoccurring and to ensure that we are better prepared to respond and recover from similar scenarios more quickly.

The F5 Distributed Cloud team have introduced procedural improvements to better validate configuration functionality while adding new resources to Regional Edges.
An improved monitoring and alerting system will be implemented for better detection of traffic impact on load balancers.

CLOSING

F5® understands how important reliability of the Distributed Cloud Platform is for customers, and specifically how the load balancers are critical to your services. F5 will ensure the recommended changes in this document are canonized into our operational Methods of Procedure (MoP) moving forward. We are grateful you have chosen to partner with F5® for critical service delivery and are committed to evolving our platform and tooling to better anticipate and mitigate disruptions to Distributed Cloud Platform services.

APPENDICES

F5 Glossary

https://www.f5.com/services/resources/glossary

Posted Apr 03, 2024 - 13:12 UTC

Resolved

We detected higher 503 errors for customers using LB on Tokyo RE. At 01:20 UTC (10:20 JST), we promptly rerouted traffic to nearby regions as a temporary workaround. Upon further investigation, it was discovered that a node in Tokyo DC experienced difficulties connecting to customer origin servers, resulting in 503 errors for some client packets.

We swiftly identified and removed the problematic node and resolving the issue on the node.Tokyo RE is now fully operational and serving traffic as of 03:20 UTC (12:20 JST).

Please note that customer edge connections to Tokyo RE should not have been impacted by this incident.

Posted Mar 28, 2024 - 23:00 UTC