F5 Distributed Cloud - Service Degradation - Issues with Endpoint Configurations - INC-20250630-356

Incident Report for F5

Postmortem

F5® Distributed Cloud Services – Endpoint Configurations

Root Cause Analysis - Multiple Distributed Cloud Customers experiencing traffic processing issues

Report Date: 2025-07-02
Incident Date: 2025-06-30

EVENT SUMMARY

On 2025-06-30 at 06:08 UTC, the F5 Distributed Cloud Support team started responding to multiple monitoring alerts and customer reports regarding difficulties accessing various websites and applications.   Upon investigation, the team identified issues with endpoint configurations within the Regional Edge (RE) network, which resulted in a disruption of traffic processing functionality. Customers were provided with a manual mitigation via  K000152246, to serve as an immediate solution while we continued troubleshooting.

Investigation has revealed that the issue was caused after the execution of scheduled maintenance to upgrade F5 Distributed Cloud Global Controller software to the latest version. During troubleshooting efforts, the F5 Distributed Cloud team initiated a reboot of the configuration propagator service. Following the reboot, synchronization for the affected tenants resumed successfully, restoring services completely to all our customers by 19:10 UTC on June 30, 2025.

F5 will maintain ongoing monitoring of the environment to ensure the incident is fully resolved and to promptly identify and address any potential impacts or unrelated issues that may arise.

WHAT HAPPENED?

INCIDENT DETAILS

Start time of Service Event 2025-06-30 06:08 UTC
Mitigation available 2025-06-30 09:12 UTC
Conclusion of Service Event 2025-06-30 19:10 UTC
Event Mitigation time 3 hours 4 minutes
Event duration 13 hours 2 minutes
Impact Several Distributed Cloud customers experienced service disruptions affecting their production environments deployed on the Distributed Cloud platform.
Root cause Planned maintenance caused service degradation and customer impact.

TIMELINE OF EVENTS

Date Time (UTC) Action
2025-06-30 04:00 Start of planned upgrades to Global Controller
2025-06-30 06:08 Start time of Service Incident
2025-06-30 07:00 F5 worked on the mitigation strategy
2025-06-30 07:30 F5 worked on reproducing the issue in staging environment
2025-06-30 09:12 Customer self-mitigation steps were made available via document created on AskF5
2025-06-30 09:15 F5 worked on creating and testing an automation script
2025-06-30 09:30 While service was down, F5 worked on restoring service for our customers using both automated and manual workarounds on the console.
2025-06-30 12:15 F5 continued to recover tenant configurations for our customers while in parallel worked on resolving the issue
2025-06-30 18:50 Configuration Propagator Service was restarted
2025-06-30 19:10 Resolution of Service Event

IS THE SERVICE EVENT FULLY RESOLVED?

Yes, Distributed Cloud Traffic processing has been restored for all impacted customers. All Distributed Cloud Console core services are operational. The F5 Distributed Cloud team continues to monitor the situation to ensure service stability.

ROOT CAUSE

Detailed Analysis:

  1. Root Cause:

·       Global Controller (GC) Configuration Management Service Memory Pressure: The Configuration Management Service experienced memory contention, leading to degraded performance and a failure to provide the complete set of configuration data to our Configuration Propagator service. The root cause of the memory contention was related to an upgraded software dependency within the Configuration Management Service. F5 performs rigorous testing in its test and staging environments, and this memory contention issue was not observed within those environments during testing.

·       Partial synchronization by Configuration Propagator Service: The service proceeded with propagation despite receiving an incomplete vhost list.

2. Impact:

·       Partial synchronization during maintenance caused service to be instantiated with an incomplete list of vhosts thereby causing temporary service disruption.

RESOLUTION AND NEXT STEPS

  1. Mitigation: F5 provided both manual and automated script-based methods for resolving the issue experienced by customers. These solutions were made available through the Knowledge Base article K000152246, ensuring customers could easily access and implement the mitigation steps. The manual method instructed customers on how to handle specific configuration issues step-by-step, while the automated script offered a more streamlined and efficient way to address the problem. This approach minimized disruptions and provided flexibility to customers based on their technical expertise and operational needs.
  2. Service Migration: To address resource constraints and alleviate memory pressure, the Global Controller Configuration Management service was migrated to a higher-capacity node. By doing so, F5 ensured that this critical control-plane service had sufficient resources to process configuration changes effectively and maintain stability. This measure improved the service's responsiveness, reduced the likelihood of crashes, and will help prevent similar issues in the future.
  3. Monitoring: F5 implemented proactive monitoring mechanisms to detect and respond to memory pressure incidents in real-time. These monitoring tools were configured to track resource utilization, performance metrics, and potential anomalies in the memory usage of critical services. Alerts were set to trigger when thresholds were approached, allowing for immediate remediation before issues impacted the system's functionality. This monitoring infrastructure reduces downtime risk and enables faster detection and resolution of any resource-related incidents in the Global Controller environment.
  4. Configuration Propagator Service Reboot: The Configuration Propagator Service was rebooted to resolve inconsistencies stemming from partial synchronization during the issue. The reboot initiated a fresh configuration sync, ensuring that the complete dataset, including all virtual hosts (vhosts), was properly retrieved and propagated across the system. This restored the correct system state, resolving discrepancies caused by missing configurations during the memory pressure incident. Rebooting the service not only resolved the immediate issue but also ensured proper synchronization going forward.

 

NEXT STEPS: FUTURE EVENT PREVENTION

We will be taking several measures to prevent this service event from recurring and to ensure that we are better prepared to react to and recover from similar scenarios more quickly. 

  1. Preventive Measures:
    ·       F5 has identified the source of the memory contention in the Global Controller Configuration Management service and will be addressing it via service change maintenance.

    ·       Adjust resource allocation for Global Controller Configuration Management service during maintenance to prevent memory pressure or performance degradation.

    ·       Conduct stress testing to simulate high memory scenarios and optimize Global Controller Configuration Management service’s handling of requests under load.

  2. Monitoring Enhancements:
    ·       Implement closer monitoring of resource utilization for critical components like Global Controller Configuration Management service, especially during planned maintenance events, to proactively address issues.

  3. Fail-Safe Mechanisms:

·       Improve Configuration Management service’s retry logic to ensure it verifies the completeness of configurations before proceeding with propagation.

·       Introduce safeguards to incomplete configuration synchronization scenarios.

CLOSING

F5® understands how important reliability of the Distributed Cloud Platform is for customers, and specifically how the F5 Distributed Cloud Global Controller is critical to your services. F5 will ensure the recommended changes in this document are canonized into our operational Methods of Procedure (MoP) moving forward. We are grateful you have chosen to partner with F5® for critical service delivery and are committed to evolving our platform and tooling to better anticipate and mitigate disruptions to Distributed Cloud Platform services.

APPENDICES

F5 Glossary 
https://www.f5.com/services/resources/glossary 

Posted Jul 03, 2025 - 01:59 UTC

Resolved

The F5 Distributed Cloud team has successfully concluded the monitoring stage, and the Distributed Cloud Console Core services, along with traffic processing, remained in steady state. We appreciate your patience and understanding during this time. Our team is advocating for continuously improving our services for you. Thanks again for your trust in F5.
Posted Jul 03, 2025 - 01:06 UTC

Update

Our service remains fully operational, and we are actively monitoring the environment to ensure its ongoing reliability. For any concerns or issues, please contact F5 Support for assistance

Thank you for your patience and understanding
F5 Distributed Cloud
Posted Jul 02, 2025 - 19:00 UTC

Update

Our service remains stable and we will continue to monitor the environment to ensure service stability. If you experience any issue, please contact F5 Support for further assistance. Thank you.
Posted Jul 02, 2025 - 07:03 UTC

Update

Dear customer,
F5 knows that this event impacted the operational and reputational commitments made for services reliant on our platform. We are committed to correcting the controllable elements to avoid similar lapses moving forward. F5 strives to provide the highest quality of service for our customers, and we are thankful for the opportunity to partner with you to improve the services you have entrusted to us to operate on your behalf. As part of service continuity, the F5 Distributed Cloud team will extend our proactive monitoring to ensure all services remain fully operational. In the meantime, we would like to share important information associated with the preliminary findings of the incident.


WHAT HAPPENED?

EVENT DESCRIPTION
On 2025-06-30 at 06:08 UTC, the F5 Distributed Cloud Support team started responding to multiple monitoring alerts and customer reports regarding difficulties accessing various websites and applications. Upon investigation, the team identified issues with endpoint configurations within the Regional Edge (RE) network, which resulted in a disruption of traffic processing functionality. Customers were provided with a manual solution (https://my.f5.com/manage/s/article/K000152246) to serve as an immediate solution while we continued troubleshooting.

Preliminary investigation revealed that the issue was caused after the execution of a scheduled maintenance to upgrade F5 Distributed Cloud Global Controller software to the latest version.

During troubleshooting efforts, the F5 Distributed Cloud team initiated a reboot of the configuration propagator service. Following the reboot, synchronization for the affected tenants resumed successfully, restoring services by 19:10 UTC on June 30, 2025. F5 will maintain ongoing monitoring of the environment to ensure the incident is fully resolved and to promptly identify and address any potential impacts or unrelated issues that may arise.


IS THE SERVICE EVENT FULLY RESOLVED?
Yes, Distributed Cloud Traffic processing has been restored for all impacted customers. All Distributed Cloud Console core services are operational. The F5 Distributed Cloud team continues to monitor the situation to ensure service stability.

WHAT HAPPENS NEXT?

ONGOING INVESTIGATION
Preliminary investigation ascertains the issue triggered after implementation of a scheduled maintenance to upgrade F5 Distributed Cloud Global Controller to the latest software version (https://www.f5cloudstatus.com/incidents/97pw5jnc6w7s). We will have a detailed Root Cause Analysis (RCA) once the root cause has been identified.

As of now, upcoming updates towards the monitoring outcome will be shared twice a day to provide relevant information on the stability of our XC Console services. Feel free to reach out to our XC support team if you’re facing any service issues or have questions. We’re here to help and support you.

We appreciate your patience and understanding during this time. Your satisfaction is our top priority, and we are dedicated to continuously improving our services for you. We are implementing additional measures to ensure this does not happen again.

Best regards,
The F5 Distributed Cloud Team
Posted Jul 01, 2025 - 21:25 UTC

Update

Our services remain operational as we continue to monitor the situation in order to ensure ongoing service stability. If you experience any issue, please contact F5 Support for further assistance.
Posted Jul 01, 2025 - 05:23 UTC

Monitoring

Dear customers:

The F5 Distributed Cloud support team has identified and fixed the issue affecting XC vips.

We will maintain proactive monitoring to ensure ongoing service stability. If you continue to experience any issues, please contact F5 Support for further assistance.
Posted Jun 30, 2025 - 20:45 UTC

Identified

Dear customers:

We strongly recommend implementing the fix outlined in the following article: https://my.f5.com/manage/s/article/K000152246.
Updating the description on the load balancer, as described in the article, will trigger the config synchronization, effectively addressing the issue.
Please note that this incident may be impacting other customers as well, which could result in delays in support responses from our SOC team. To help us prioritize urgent cases, we kindly ask that you reach out to support only if the recommended fix does not resolve the problem.

For customers seeking further reassurance about routing traffic back, we can confirm that other customers who have applied the fix have experienced stable outcomes. As an additional measure of precaution, we recommend waiting until the incident is declared fully service restored before routing traffic back through the load balancer to ensure service stability to your operation.
Posted Jun 30, 2025 - 17:55 UTC

Update

Please note that the workaround recommended in our previous updates is designed to trigger a configuration sync, which will help in resolving the issue. Kindly proceed with implementing the change as outlined in the link below:

https://my.f5.com/manage/s/article/K000152246

Updating the description on the load balancer will initiate the synchronisation of the configuration database with our central database.

Please contact support if you require any assistance in implementing the workaround.
Posted Jun 30, 2025 - 14:07 UTC

Update

Please note that the workaround has been reinstated and customers can proceed with applying the change as per the link below:
https://my.f5.com/manage/s/article/K000152246

Please contact support if you require any assistance to implement the workaround.

We are committed to keeping you informed with updates as they become available. Thank you for your patience and understanding.
Posted Jun 30, 2025 - 11:41 UTC

Update

Please note that we encountered a temporary issue, and the current workaround is not performing as expected. We continue to work tirelessly in resolving the issue and will share further updates once the workaround has been reinstated.
Posted Jun 30, 2025 - 10:00 UTC

Update

As mentioned in our last update, customers can perform a no-impact configuration change on the load balancer and the origin object to restore the services as a workaround.

Please refer to this link below for more information regarding the implementation of the workaround:
https://my.f5.com/manage/s/article/K000152246
Posted Jun 30, 2025 - 09:28 UTC

Update

The F5 Distributed Cloud support team is actively working to resolve this issue. In the meantime, as a workaround, customers can perform a no-impact configuration change on the load balancer and the origin object to restore the services.

We are committed to keeping you informed with updates as they become available. Thank you for your patience and understanding.
Posted Jun 30, 2025 - 08:38 UTC

Update

The F5 Distributed Cloud support team has identified an issue with some of the distributed cloud endpoint configurations. Our team is actively working to resolve this incident, we are committed to keeping you informed with updates as they become available. Thank you for your patience and understanding.
Posted Jun 30, 2025 - 07:53 UTC

Investigating

This advisory is to inform you that we are currently investigating reports of service degradation affecting some of our services.

We understand this may be affecting your operations and we are committed to providing transparent and timely updates as more information becomes available. While we gather more information, we recommend monitoring the progress on this site (https://www.f5cloudstatus.com/) for the latest update.

Investigation Status:
• We have identified potential service impact at 07:00 UTC.
• Our incident response team has been fully mobilized;
• Initial investigation and impact assessment efforts are underway.
Next Steps:
• A detailed incident notification will be provided within 30 minutes;
• Our teams are working to determine the root cause of the incident;
• We will share mitigation steps as soon as they become available.

We appreciate your patience as we work to resolve this situation. If you are experiencing a critical business impact, please contact our support team through your established channels.

Thank you for your continued support and trust in F5.
Posted Jun 30, 2025 - 07:18 UTC
This incident affected: Services (HTTP Load Balancer).