Incident Started:2023-05-11 11:32 UTC
Resolution Started: 2023-05-11 11:55 UTC
Incident Resolved:2023-05-11 23:02 UTC
Summary:
Following the release of a new software version by F5XC on May 11th, the servers on our region edge (RE) were upgraded. Subsequently, several customers utilising the HTTP Load Balancer with HTTP/1.1 health checks encountered failures. Consequently, the HTTP Load Balancer no longer had any healthy origin endpoints available, leading to the return of 503 errors to client requests.
Root cause:
The issue arose from a bug in the recent change to the default configuration of HTTP health checks. Previously, HTTP/1.1 was used for health checks, but due to the bug, TLS sessions for health checks started including ALPN negotiations. As a result, the TLS sessions were negotiated using HTTP/2 while the HTTP requests within the sessions remained in HTTP/1.1. This inconsistency led to the failure of health checks and resulted in HTTP 503 responses for user requests. It's important to note that this problem specifically affected configurations with HTTP/1.1 health check settings. The smooth operation of TCP and HTTPS health checks caused a delay in identifying this issue.
Incident flow:
Over the next few hours, similar reports were received from multiple customers.
At 14:10 UTC, we provided an update to customers with a workaround to remove the health check. We observed that customers who removed the health check no longer experienced issues.
At 16:30 UTC, the engineering team identified the issue with ALPN on the health check affecting only HTTP/1.1 communication towards to upstream server. The efficient functioning of TCP and HTTPS health checks caused a delay in identifying the issue. A new fixed image which has reverted the ALPN was prepared and tested.
From 19:20 UTC, we began applying the hotfix to the regional edge and confirmed that the system was back to normal by May 11th at 23:02 UTC.
Corrective measures: