New Load Balancer's deployments exhibiting name resolution failures
Incident Report for F5 Distributed Cloud
Postmortem

F5® Distributed Cloud Services – Load Balancer

Root Cause Analysis for New Load Balancer's deployments exhibiting name resolution failures

Report Date: 2024-04-23

Incident Date(s): 2024-04-11

EVENT SUMMARY

On 2024-04-11 at 12:45 UTC, the F5 Distributed Cloud support team used internal monitoring to detect that some load balancers were failing to resolve hostnames. The ensuing investigation revealed that this issue was confined to load balancers that were either newly created or updated during this incident window, while the functionality of existing ones remained unaffected.

Further examination pinpointed the onset of the hostname resolution failure to the moment when the number of DNS records per zone reached a set threshold, which in turn disrupted the creation of DNS A records for load balancers.

To address the issue, the F5 Distributed Cloud team raised the threshold limit and restarted the DNS A record creation process on 2024-04-11 at 19:40 UTC. Subsequently, the affected load balancers began to recover in sequence, with the hostname resolution failures ceasing.

The service disruption lasted for 6 hours and 55 minutes. Throughout this period, no customer reports regarding the issue were received.

WHAT HAPPENED?

INCIDENT DETAILS
Start time of Service Event 2024-04-11 12:45 UTC
Conclusion of Service Event 2024-04-11 19:40 UTC
Event duration 6 hours 55 minutes
Impact Newly created or recently updated load balancers were exhibiting hostname resolution failure. Customers might have experienced failure while accessing web applications.
Root cause Threshold exhaustion for DNS record per zone affected creation of DNS A records which resulted in hostname resolution failure on load balancers.
TIMELINE OF EVENTS
DATE TIME (UTC) ACTION
2024-04-11 12:45 Internal monitoring detected hostname resolution issue on load balancers.
2024-04-11 13:09 F5 Distributed Cloud team started to investigate and identified that new and/or recently modified load balancers are affected.
2024-04-11 17:43 It was identified that DNS A record was not getting created due to threshold exhaustion for DNS record per zone.
2024-04-11 19:04 F5 Distributed Cloud team increased the threshold and reinitiated the DNS A record creation.
2024-04-11 19:40 The F5 Distributed Cloud team validated and confirmed the issue with Load Balancer's exhibiting name resolution failures has been restored and no more issues are observed. End of service event.

IS THE SERVICE EVENT FULLY RESOLVED?

Yes, the hostname resolution issue with load balancer is resolved.

ROOT CAUSE

When new load balancers of any type are created, a virtual host DNS object also gets created. An internal service monitors the creation / deletion of this object and attempts to create a DNS A record on the DNS infrastructure. This A record points to the IP address configured on the load balancer. Over time we exhausted the limit for DNS record per zone which prevented further DNS A record creation impacting the load balancer name resolution. Due to lack of specific alert tracking, the limit exhaustion could not be detected which eventually triggered the service event.

RESOLUTION AND NEXT STEPS

RESOLUTION

The F5 Distributed Cloud team increased the DNS record per zone threshold which allowed DNS A record to get created. This restored the hostname resolution issue.

NEXT STEPS: FUTURE EVENT PREVENTION

We will be taking several measures to prevent this service event from reoccurring and to ensure that we are better prepared to react to and recover from similar scenarios more quickly.

  • The F5 Distributed Cloud team will tune the metric collections procedure of DNS A record. This will be addressed in the upcoming platform releases.
  • F5 Distributed Cloud team are working to enhance graphical representation of DNS A record dashboard for advanced tracking. This is anticipated to be added in the upcoming platform releases.
  • F5 Distributed Cloud team are working on the code base to improve error handling for faster detection and restoration of such events. This will also be addressed in the upcoming platform releases.

CLOSING

F5® understands how important reliability of the Distributed Cloud Platform is for customers. F5 will ensure the recommended changes in this document are canonized into our operational Methods of Procedure (MoP) moving forward. We are grateful you have chosen to partner with F5® for critical service delivery and are committed to evolving our platform and tooling to better anticipate and mitigate disruptions to Distributed Cloud Platform services.

APPENDICES

F5 Glossary

https://www.f5.com/services/resources/glossary

Posted Apr 23, 2024 - 23:40 UTC

Resolved
The F5 Distributed Cloud team validated and confirmed the issue with Load Balancer's exhibiting name resolution failures has been restored and no more issues are observed. All other services remain fully operational. This incident has been resolved.
Posted Apr 11, 2024 - 22:26 UTC
Monitoring
The issue with the Load Balancer's exhibiting name resolution failures has been resolved. We are continuously monitoring the system.
Posted Apr 11, 2024 - 20:04 UTC
Identified
The issue has been identified and the fix is getting prepared for it. It is advised to use IP address instead of cname as a workaround. We'll provide updates as we progress.
Posted Apr 11, 2024 - 17:58 UTC
Investigating
The F5 Distributed Cloud team is currently investigating issues affecting the DNS resolution for a subset of newly created and recently updated TCP load balancers. Our team is actively engaged working on resolving this matter, and we'll provide regular updates as we progress.
Posted Apr 11, 2024 - 16:51 UTC
This incident affected: Services (DNS).