Monitoring Intelligence using Anomaly Detection

Monitoring Intelligence using Anomaly Detection

Most of the clients I have consulted with, take a traditional approach towards performance monitoring. They are focused towards  threshold based alerting mechanism and fail to improve their service.  The problem with this method is its reactive approach. Your IT gets involved only when there is a failure or  when the system reaches  a critical state. This often results in a gradual degradation of resource   performance.

For example, the response time takes a hit when the load of the server increases  over a period of time.  It leads to customer frustration and your IT is still in the dark, unless informed. Anomaly  detection is the key to address this performance problem and it adds business intelligence to your  system. Anomaly is any metric that deviates from the defined baseline value.

Baseline values are not derived as soon as the application is implemented. Allow the tool to run for a certain period of time to collect performance metric s. I'd recommend 4-6 weeks because,  performance differs from one application to another. For  intance , a  Helpdesk  application may be experiencing heavy load during weekdays whereas an HR application will have an even load over all days in a week. 

Run report against the various metrics and understand the pattern to arrive at a baseline value. Typically, an average of the performance metric could be the baseline. Will be happy to hear how yo u specify the baselines for resources on your network.

You can now define anomaly profiles  to compare current set of data  with  the previously reported best data (the one where the system worked optimally). It could be a fixed one or a moving value like previous week 's data.

When anomaly is detected  get an alert and jump into action  before any delay is noticed by the end consumers/users.

The above graph indicates the  response time(in ms) of a  webpage . You can notice performance degradation towards the month end. Setting up a thresholds doesn't help identify the deviation as the value is set to >45000ms.  On the other hand,  anomaly detection will help  analyze the deviation in pattern and fix the root cause before the issue becomes critical.


Rameshkumar Ramachandran