Anycast VIP Outage
Incident Report for F5 Distributed Cloud
Postmortem

Synopsis

A network issue regarding the availability of the default route within our global backbone caused an outage of VoltMesh and VoltStack services using it. However, BGP customers receiving a full view did not have any impact.

Timeline and description of the incident : 

Nov 18 03:49:26 AM UTC The default route becomes unavailable

Nov 18 04:07:43 AM UTC The default route becomes available

Analysis and impact of the incident:

On Nov. 18th at 03:49:26 AM UTC we suffered an unforeseen network incident which normally should not have caused any impact. The RCA was tracked to configuration change that was put in Sept 2020 and that change caused the unavailability of the default route, leading to impact on services that depended on Volterra’s anycast IP addresses. 

On Sep. 2nd, 2020 we suffered a global routing issue caused by an abnormal peak in CPU usage of our route reflectors because of on-going problems with Centurylink’s global network. At that time, to reduce the impact of this outage, we did a configuration to modify the behaviour and propagation of the default route within our backbone. We solved this problem by redesigning the route reflector infrastructure as you may have noticed over the past few weeks, we had some maintenance to completely finish this new implementation.

However due to the configuration fix in place, due to a maintenance on CenturyLink network on Nov 18th, this fix caused the unavailability of the default route and impacted the following services : 

  • BGP transit customers with Default route only
  • VoltMesh (Cloud Protect)
  • VoltStack

During the incident, our Network team identified this routing configuration issue and applied the changes on the backbone to restore the default route advertisement to customers. The outage was solved at 04:07:43 AM UTC and everything went back to normal. We continued the investigation after emergency fix and found the root cause to be multiple factors and have applied a permanent fix for this issue.

Posted Nov 18, 2020 - 14:53 UTC

Resolved
We are investigating the root cause for 16 mins outage of our Global Anycast VIPs. At 03:49UTC, we observed that there were problems with our network partner - CenturyLink - this problem exposed a misconfiguration in our routing infrastructure. We are investigating this issue in more detail and will create a post mortem report in the next 36 to 48 hrs. We apologize for this downtime event.
Posted Nov 18, 2020 - 12:00 UTC