HTTP Load Balancer with HTTP/1.1 heath check were failing after release
Incident Report for F5 Distributed Cloud
Postmortem

Incident Started:2023-05-11 11:32 UTC 

Resolution Started: 2023-05-11 11:55 UTC 

Incident Resolved:2023-05-11 23:02 UTC 

 

Summary: 

Following the release of a new software version by F5XC on May 11th, the servers on our region edge (RE) were upgraded. Subsequently, several customers utilising the HTTP Load Balancer with HTTP/1.1 health checks encountered failures. Consequently, the HTTP Load Balancer no longer had any healthy origin endpoints available, leading to the return of 503 errors to client requests. 

Root cause: 

The issue arose from a bug in the recent change to the default configuration of HTTP health checks. Previously, HTTP/1.1 was used for health checks, but due to the bug, TLS sessions for health checks started including ALPN negotiations. As a result, the TLS sessions were negotiated using HTTP/2 while the HTTP requests within the sessions remained in HTTP/1.1. This inconsistency led to the failure of health checks and resulted in HTTP 503 responses for user requests. It's important to note that this problem specifically affected configurations with HTTP/1.1 health check settings. The smooth operation of TCP and HTTPS health checks caused a delay in identifying this issue.  

Incident flow: 

  • On May 11th at 08:00 UTC, we initiated the upgrade of our regional edge, completing it by 11:00 UTC and customer traffic started to use new software release.  
  • At 11:32 UTC, F5 support received a ticket from a customer reporting health check failure. Since all locations had already been upgraded, rolling back to a previous state would have taken considerable time. Consequently, the focus shifted towards providing a workaround solution. 
  • Over the next few hours, similar reports were received from multiple customers.  

  • At 14:10 UTC, we provided an update to customers with a workaround to remove the health check. We observed that customers who removed the health check no longer experienced issues.  

  • At 16:30 UTC, the engineering team identified the issue with ALPN on the health check affecting only HTTP/1.1 communication towards to upstream server. The efficient functioning of TCP and HTTPS health checks caused a delay in identifying the issue. A new fixed image which has reverted the ALPN was prepared and tested.  

  • From 19:20 UTC, we began applying the hotfix to the regional edge and confirmed that the system was back to normal by May 11th at 23:02 UTC. 

Corrective measures:  

  • We are committed to continuously improving our testing processes by enhancing our testing infrastructure to include more comprehensive origin server simulations and protocol compatibility checks. 
  • Additionally, we will implement enhanced monitoring and alerting mechanisms to proactively detect and respond to similar issues in the future.
Posted May 13, 2023 - 16:28 UTC

Resolved
Issue has been resolved, fix has been applied. We are monitoring the platform closely
Posted May 12, 2023 - 00:41 UTC
Monitoring
We have applied the fix. We are validating and monitoring the changes
Posted May 12, 2023 - 00:06 UTC
Update
Remediation for the issue getting applied
Posted May 11, 2023 - 19:25 UTC
Update
We have fix and working on testing it
Posted May 11, 2023 - 17:16 UTC
Identified
Issue has been identified and we are working on to fixing it
Posted May 11, 2023 - 16:42 UTC
Update
Please go and remove health check in case you see this issue. We are investigating.
Posted May 11, 2023 - 14:11 UTC
Investigating
We are currently investigating this problem.
Posted May 11, 2023 - 14:10 UTC
This incident affected: Services (Secure Mesh).