Please find the brief details on the analysis and what went wrong with respect to the recent outage :
There was a sudden spike in the number of user requests to the application, which proportionately translated into a spike in the cache server hits. The cache servers activity logs indicate the hits corresponding to only the user requests started to slow down and eventually not respond.
Since the hits from the other modules of the application were normally working, the monitoring system did not flag the cache server to be down, which otherwise could have triggered a cache server failover. The slowness and the eventual stopping of response of the cache server cascaded to the application server too, resulting in the application going down for user access.
Reporting servers, mail fetching, and other scheduled job servers were normally working during this outage.
We can share the RCA document with customers who has signed NDA with us. Please send an email to our support @ firstname.lastname@example.org for initiating the NDA and once the NDA is signed we will share the RCA document.