HTTP Load Balancer with HTTP/1.1 heath check were failing after release

Incident Report for F5

Postmortem

Incident Started:2023-05-11 11:32 UTC

Resolution Started: 2023-05-11 11:55 UTC

Incident Resolved:2023-05-11 23:02 UTC

Summary:

Following the release of a new software version by F5XC on May 11th, the servers on our region edge (RE) were upgraded. Subsequently, several customers utilising the HTTP Load Balancer with HTTP/1.1 health checks encountered failures. Consequently, the HTTP Load Balancer no longer had any healthy origin endpoints available, leading to the return of 503 errors to client requests.

Root cause:

The issue arose from a bug in the recent change to the default configuration of HTTP health checks. Previously, HTTP/1.1 was used for health checks, but due to the bug, TLS sessions for health checks started including ALPN negotiations. As a result, the TLS sessions were negotiated using HTTP/2 while the HTTP requests within the sessions remained in HTTP/1.1. This inconsistency led to the failure of health checks and resulted in HTTP 503 responses for user requests. It's important to note that this problem specifically affected configurations with HTTP/1.1 health check settings. The smooth operation of TCP and HTTPS health checks caused a delay in identifying this issue.

Incident flow:

On May 11th at 08:00 UTC, we initiated the upgrade of our regional edge, completing it by 11:00 UTC and customer traffic started to use new software release.
At 11:32 UTC, F5 support received a ticket from a customer reporting health check failure. Since all locations had already been upgraded, rolling back to a previous state would have taken considerable time. Consequently, the focus shifted towards providing a workaround solution.
Over the next few hours, similar reports were received from multiple customers.
At 14:10 UTC, we provided an update to customers with a workaround to remove the health check. We observed that customers who removed the health check no longer experienced issues.
At 16:30 UTC, the engineering team identified the issue with ALPN on the health check affecting only HTTP/1.1 communication towards to upstream server. The efficient functioning of TCP and HTTPS health checks caused a delay in identifying the issue. A new fixed image which has reverted the ALPN was prepared and tested.
From 19:20 UTC, we began applying the hotfix to the regional edge and confirmed that the system was back to normal by May 11th at 23:02 UTC.

Corrective measures:

We are committed to continuously improving our testing processes by enhancing our testing infrastructure to include more comprehensive origin server simulations and protocol compatibility checks.
Additionally, we will implement enhanced monitoring and alerting mechanisms to proactively detect and respond to similar issues in the future.

Posted May 13, 2023 - 16:28 UTC

Resolved

Issue has been resolved, fix has been applied. We are monitoring the platform closely

Posted May 12, 2023 - 00:41 UTC

Monitoring

We have applied the fix. We are validating and monitoring the changes

Posted May 12, 2023 - 00:06 UTC

Update

Remediation for the issue getting applied

Posted May 11, 2023 - 19:25 UTC

Update

We have fix and working on testing it

Posted May 11, 2023 - 17:16 UTC

Identified

Issue has been identified and we are working on to fixing it

Posted May 11, 2023 - 16:42 UTC

Update

Please go and remove health check in case you see this issue. We are investigating.

Posted May 11, 2023 - 14:11 UTC

Investigating

We are currently investigating this problem.

Posted May 11, 2023 - 14:10 UTC

This incident affected: Services (Secure Mesh).