Console Login Disruption
Incident Report for F5 Distributed Cloud
Postmortem

Incident started at:  September 12th, 9:52 AM PST
Work around applied: September 12th, 10:20 AM PST

Total Outage Duration: 0h 28m

Summary:

On September 12th, 9:52 AM PST we identified that some of F5 Distributed Cloud customers were facing login issues. We started investigating and observed that one of the micro-service responsible for decrypting the client secret information was crashing due to over memory utilization.

Root cause:

The root cause was traced back to the backend job, during the schedule maintenance window, to tune backward compatibility. The job caused over memory utilization on one of the micro-service responsible for decrypting the client secret information and forced it to crash.

The incident was detected on September 12th, 9:52 AM PST, prompting an immediate investigation by the F5XC Team. By 10:20 AM PST, corrective measures were applied by increasing the memory resource and the login service was restored.

Incident flow:

Incident Time Description
September 12th, 9:52 AM PST Our monitoring system detected over utilizing memory on couple of micro-services
September 12th, 10:05 AM PST The login issue was observed across the platform affecting more then one customer
September 12th, 10:13 AM PST F5XC Team identified the root cause as over memory utilization and crashing of a micro-service responsible decrypting the client secret
September 12th, 10:20 AM PST F5XC Team increased the memory limits across the platform for this service and were able to restore the service

Corrective measures:

  • The identified backend job has been tuned and tested for causing over memory utilization.
  • Process change has been put in place to perform stress test on these micro-service post bulk operations
  • We are also making changes in our observability stack to ensure we have traces to allow issues to be pinpointed faster.
Posted Sep 16, 2023 - 23:20 UTC

Resolved
We have noticed an issue with login to tenant/console. We were able to identify the issue and were able to fix it. The issue persited from 9:52 AM PST - 10:20 AM PST. We will publish the detail RCA/Postmortem here soon.
Posted Sep 12, 2023 - 16:30 UTC