Console Login Disruption

Incident Report for F5

Postmortem

Incident started at: September 12th, 9:52 AM PST
Work around applied: September 12th, 10:20 AM PST

Total Outage Duration: 0h 28m

Summary:

On September 12th, 9:52 AM PST we identified that some of F5 Distributed Cloud customers were facing login issues. We started investigating and observed that one of the micro-service responsible for decrypting the client secret information was crashing due to over memory utilization.

Root cause:

The root cause was traced back to the backend job, during the schedule maintenance window, to tune backward compatibility. The job caused over memory utilization on one of the micro-service responsible for decrypting the client secret information and forced it to crash.

The incident was detected on September 12th, 9:52 AM PST, prompting an immediate investigation by the F5XC Team. By 10:20 AM PST, corrective measures were applied by increasing the memory resource and the login service was restored.

Incident flow:

Incident Time	Description
September 12th, 9:52 AM PST	Our monitoring system detected over utilizing memory on couple of micro-services
September 12th, 10:05 AM PST	The login issue was observed across the platform affecting more then one customer
September 12th, 10:13 AM PST	F5XC Team identified the root cause as over memory utilization and crashing of a micro-service responsible decrypting the client secret
September 12th, 10:20 AM PST	F5XC Team increased the memory limits across the platform for this service and were able to restore the service

Corrective measures:

The identified backend job has been tuned and tested for causing over memory utilization.
Process change has been put in place to perform stress test on these micro-service post bulk operations
We are also making changes in our observability stack to ensure we have traces to allow issues to be pinpointed faster.

Posted Sep 16, 2023 - 23:20 UTC

Resolved

We have noticed an issue with login to tenant/console. We were able to identify the issue and were able to fix it. The issue persited from 9:52 AM PST - 10:20 AM PST. We will publish the detail RCA/Postmortem here soon.

Posted Sep 12, 2023 - 16:30 UTC