F5XC Management Plane disruption

Incident Report for F5

Postmortem

Incident Started:2022-11-23 05:25 UTC

Resolution started: 2022-11-23 05:27 UTC

Incident Resolved:2022-11-23 05:40 UTC

Summary:

F5XC SaaS management plane is running in AWS and had disruption due to out of memory issues for some of our controller services. After detailed analysis, our conclusion was presence of noisy neighbors running on the same physical nodes as F5XC SaaS Management services. There was no impact to the data plane.

Root cause:

The F5XC SaaS Controller is deployed in AWS. Due to capacity constraints (AWS Claims to be addressing capacity issues on an expedited basis) few nodes on which F5XC services were executing did not get required resources due to other workloads scheduled on these nodes (CPU, memory etc.) and this resulted in failure of services.

Incident flow:

At 05:25 AM UTC, F5XC monitoring system alerted about services running in SaaS Management plane started crashing. SRE engineers identified a few AWS instances as overloaded and drained the nodes to stop processing traffic. At 5:27 AM UTC, all services running on the affected node were evacuated and scheduled on other node. As a precautionary measure more AWS resources were added to the SaaS Management Plane and all services were restored by 5:40 AM UTC. F5XC Console was completely inaccessible for 5 minutes, but intermittent disruption was for 15 minutes.

Conclusion

While we are investigating the root cause, current assessment is pointing towards an oversubscribed AWS node that caused the outage.

Corrective measures

F5XC SaaS Management SRE team has added additional resources alerts that will provide early warning going forward. In addition, SRE team is going to work with development engineering to put processes in place to identify services that are likely to consume more resources and proactively add more capacity.

Posted Nov 23, 2022 - 10:46 UTC

Resolved

Posted Nov 23, 2022 - 05:30 UTC