CDN DNS Service Outage
Incident Report for F5 Distributed Cloud
Postmortem

Incident Started:2022-11-21 09:11 UTC 

Resolution started: 2022-11-21 09:14 UTC 

Incident Resolved:2022-11-21 09:16 UTC 

Summary: 

F5XC CDN DNS service was misconfigured during a rollout of stability improvement for DNS traffic monitoring. This change caused the DNS service to go down and resulted in DNS resolution failure for CDN Distribution domains. 

Root cause: 

The CDN DNS service is running as part of the CDN F5XC SaaS Controller. It is the authoritative DNS nameserver for CDN service domains. This service resolves CDN service domains to the geographically closest CDN site.  

Last week we observed stability issues in DNS traffic routing, as a mitigation we decided to rollout the patch. During this rollout, due to a misconfiguration, the service pods failed to come up. Our system detected misconfiguration and rollbacked the configuration. DNS service was down for 5 minutes. 

Incident flow: 

At 09:10 UTC, the upgrade rollout started. When the rollout was in progress, at 09:11 UTC, it was observed that DNS service instances were crashing. System recognized the DNS service instance crashes. The service logs indicated the misconfiguration. At 09:14 UTC, SRE started automated rollback of the configuration. At 09:16 UTC, the DNS service was restored. 

Conclusion 

CDN DNS service misconfiguration brought down the DNS service and hence caused DNS resolution outage. 

Corrective measures 

SRE will improve the redundancy for the CDN DNS service by introducing Blue/Green Deployments instead of single deployment with rolling update. This will mitigate any accidental mistake or crashes in new version deployments.

Posted Nov 21, 2022 - 10:59 UTC

Resolved
F5XC CDN DNS service was misconfigured during a rollout of stability improvement for DNS traffic monitoring. This change caused the DNS service to go down and resulted in DNS resolution failure for CDN Distribution domains.
Posted Nov 21, 2022 - 08:00 UTC