Most of the clients I have consulted with, take a traditional approach towards performance monitoring. They are focused towards threshold based alerting mechanism and fail to improve their service. The problem with this method is its reactive approach. Your IT gets involved only when there is a failure or when the system reaches a critical state. This often results in a gradual degradation of resource performance.
For example, the response time takes a hit when the load of the server increases over a period of time. It leads to customer frustration and your IT is still in the dark, unless informed. Anomaly detection is the key to address this performance problem and it adds business intelligence to your system. Anomaly is any metric that deviates from the defined baseline value.
Baseline values are not derived as soon as the application is implemented. Allow the tool to run for a certain period of time to collect performance metric s. I'd recommend 4-6 weeks because, performance differs from one application to another. For intance , a Helpdesk application may be experiencing heavy load during weekdays whereas an HR application will have an even load over all days in a week.
Run report against the various metrics and understand the pattern to arrive at a baseline value. Typically, an average of the performance metric could be the baseline. Will be happy to hear how yo u specify the baselines for resources on your network.
You can now define anomaly profiles to compare current set of data with the previously reported best data (the one where the system worked optimally). It could be a fixed one or a moving value like previous week 's data.
When anomaly is detected get an alert and jump into action before any delay is noticed by the end consumers/users.
The above graph indicates the response time(in ms) of a webpage . You can notice performance degradation towards the month end. Setting up a thresholds doesn't help identify the deviation as the value is set to >45000ms. On the other hand, anomaly detection will help analyze the deviation in pattern and fix the root cause before the issue becomes critical.