How good is your threshold?

How good is your threshold?

One standard question that I get asked whenever I'm consulting with a client is  about the recommended thresholds for the monitored resources. Most, if not all, users are concerned with setting optimal thresholds for monitoring server, network, and application performance.  When I draw a blank, I'm looked at like an alien. Either I'd come across as a thorough dud who doesn't know his stuff, or a total rookie.  However, the truth is that there is no universal standard threshold for most monitored metrics. It all depends on the monitored environment and the resources within.

As someone responsible for deploying and implementing IT management solutions, we might come across a lot of random content on the web that prescribe thresholds as part of best practices. We assume that these recommendations cannot go wrong and go ahead with updating the tools with the 'best' thresholds. The monitoring tools then start firing alerts based on the set thresholds.  Now, you face a new set of challenges.

The thresholds you configured may not at all be appropriate for your environment. Let's say the server that runs Oracle DB has an  expected memory usage of 90%. However, based on standards recommended you set the value at 75%. You end up being bombarded with unwanted alerts and are forced to finally reconfigure the threshold on the solution that actually suits and is relevant to your environment. 

Its important that we understand the metrics and the reasoning behind its configuration. Have seen many users  configuring threshold for the overall disk utilization(all drives combined) in OpManager and AppManager. One bad day, they come back complaining that the tool didn't alert them when the F drive where SQL log files are written was full. The problem is simple and straightforward. The  users overlook configuring thresholds for individual drives and end up in deep trouble.

I'd therefore suggest that you stop searching for the standards. Allow your tool to monitor the performance for a few days, observe the behavior and set baseline values based on its past performance. Now configure thresholds for relevant metrics. After all, no one knows your network better than you!

Thank you

Rameshkumar Ramachandran