F5XC traffic processing degradation in Seattle, Ashburn and London PoPs.
Incident Report for F5 Distributed Cloud
Postmortem

F5® Distributed Cloud Services – Web Application Firewall 

Root Cause Analysis for Traffic Degradation in Seattle, Ashburn and London Point of Presence (PoP)

Report Date: 2024-04-10 

Incident Date(s): 2024-04-01 

EVENT SUMMARY

On 2024-04-01 around 18:30 UTC, the F5 Distributed cloud team noticed an influx of anomalous requests in network traffic coming from a non-trusted system. This resulted in degraded performance for network traffic processing. Subsequently customer reports were received informing of 502 errors while accessing websites/web applications. 

Analysis of the nature of incoming traffic revealed it to be affecting the application layer which resulted in 502 errors for HTTP/HTTPS traffic. Additionally, it was determined by the F5 Distributed Cloud team that the impact was isolated to Seattle, Ashburn and London Point of Presence (PoP). 

Countermeasures were applied by the F5 Distributed Cloud team to stop the surge in network traffic which improved the performance and restored the service at 21:45 UTC. 

The F5 Distributed Cloud support team continued the monitoring of our platform for stability, and no new reports were received indicating any failures. 

The total duration of the service event was 3 hours and 15 minutes. 

WHAT HAPPENED?

INCIDENT DETAILS 
Start time of Service Event  2024-04-01 18:30 UTC 
Conclusion of Service Event  2024-04-01 21:45 UTC 
Event duration  3 hours and 15 minutes 
Impact  Customers experienced 502 errors on a subset of traffic routing through Seattle, Ashburn and London PoP, which affected website/web application accessibility. 
Root cause  F5 Distributed Cloud WAF service experienced performance degradation due to an influx of anomalous requests in network traffic coming from non-trusted systems. 
TIMELINE OF EVENTS 
Date Time UTC Action
2024-04-01  18:30  F5 Distributed Cloud team noticed an influx of anomalous traffic from non-trusted systems. 
2024-04-01  19:07  Multiple automatic mitigations were applied, but due to the nature of the traffic the auto-mitigations had minimal effect.  
2024-04-01  19:40  Received the initial customer report informing of 502 errors while accessing websites/web applications.  Subsequent reports were received from a few other customers reporting 502 errors. 
2024-04-01  21:00  Additional countermeasures were applied by the F5 Distributed Cloud team 
2024-04-01  21:45  The Distributed Cloud platform stabilized, and services were confirmed to be restored. End of service event. 

IS THE SERVICE EVENT FULLY RESOLVED? 

Yes, the Distributed Cloud WAF service is fully operational and processing traffic normally. 

ROOT CAUSE 

The incoming anomalous requests interfered with the normal traffic processing which resulted in 502 errors for a small percentage of requests. 

RESOLUTION AND NEXT STEPS

RESOLUTION

The F5 Distributed Cloud team applied technical measures to mitigate and restore normal traffic processing capabilities.

NEXT STEPS: FUTURE EVENT PREVENTION

We have taken a few measures to prevent this service event from reoccurring and to ensure that we are better prepared to react to and recover from similar scenarios more quickly.  

  • There was a known logging issue in one of our F5 Distributed Cloud components which has now been fixed and redeployed. This will help our auto mitigation process to be more efficient. 
  • An improved monitoring system has been implemented for better detection of incoming request anomalies and errors. 

CLOSING

F5® understands how important reliability of the Distributed Cloud Platform is for customers. F5 will ensure the recommended changes in this document are canonized into our operational Methods of Procedure (MoP) moving forward. We are grateful you have chosen to partner with F5® for critical service delivery and are committed to evolving our platform and tooling to better anticipate and mitigate disruptions to Distributed Cloud Platform services. 

APPENDICES

F5 Glossary

https://www.f5.com/services/resources/glossary

Posted Apr 10, 2024 - 13:46 UTC

Resolved
F5 Distributed Cloud team noticed an influx of anomalous traffic on the F5XC Platform, impacting some of the F5 traffic processing capabilities. As a result, some customers may have experienced 502 errors when sending traffic requests to the Seattle (primary) and Ashburn/London (secondary) PoPs.

F5 Distributed Cloud team successfully restored services to normal operations on 2024-04-01 at 21:45 UTC. The full root cause will be pursued via the Problem Management as per process.
Posted Apr 01, 2024 - 21:00 UTC