Incident Started:2022-11-21 09:11 UTC
Resolution started: 2022-11-21 09:14 UTC
Incident Resolved:2022-11-21 09:16 UTC
Summary:
F5XC CDN DNS service was misconfigured during a rollout of stability improvement for DNS traffic monitoring. This change caused the DNS service to go down and resulted in DNS resolution failure for CDN Distribution domains.
Root cause:
The CDN DNS service is running as part of the CDN F5XC SaaS Controller. It is the authoritative DNS nameserver for CDN service domains. This service resolves CDN service domains to the geographically closest CDN site.
Last week we observed stability issues in DNS traffic routing, as a mitigation we decided to rollout the patch. During this rollout, due to a misconfiguration, the service pods failed to come up. Our system detected misconfiguration and rollbacked the configuration. DNS service was down for 5 minutes.
Incident flow:
At 09:10 UTC, the upgrade rollout started. When the rollout was in progress, at 09:11 UTC, it was observed that DNS service instances were crashing. System recognized the DNS service instance crashes. The service logs indicated the misconfiguration. At 09:14 UTC, SRE started automated rollback of the configuration. At 09:16 UTC, the DNS service was restored.
Conclusion
CDN DNS service misconfiguration brought down the DNS service and hence caused DNS resolution outage.
Corrective measures
SRE will improve the redundancy for the CDN DNS service by introducing Blue/Green Deployments instead of single deployment with rolling update. This will mitigate any accidental mistake or crashes in new version deployments.