RCA and Error Message in Alert when Monitor is down
I have been longing for theese features for quite a while now. We are a consulting firm and help a lot of customers to install and set up Applications Manager for monitoring the availability and health of applications using custom monitors and scripts. Theese script call automation frameworks in order to perform tests towards the customers apps. When an error ocurres, we set the script_availability and script_message attributes in the result file in order to trigger downtime and an alert (this is necessary so that an SLA violation is triggered as well). This has been working very nice for several years and we have a lot of happy customers.
Our problem is however fault analysis. When a monitor goes down, the only thing logged in the alert is just that, Monitor is not available. I need it to log the script_message provided, as well as a complete RCA if a result file was available (which would give more information on what additional attributes are faulty or unhealthy). Further more, I would like to customize the alert logged when the monitor is back up and health is clear, to create a more simple alert just saying: "Monitor is up, no anomalies detected" and not the entire RCA message.
Once this is possible, I would also be able to have reporting on script_messages, like statistics on how often what error occurs and being able to filter out a specific error message and show its occurancies (or stats for each day/week or hour of a day how many times it has occurred). Like with the response times where I can see if there are differences depending on hour of day, day of week etc. Also, in the availability view for last 24 hours or last 30 days when doing mouse over a period with unavailability, to also see at least the first error message that caused the outage.
My last feature request is to be able to supply a link with the error, ie being able to set script_link together with script_message, which is a url to more information about the error wich will link back to the specific test automation framwork, which in our case can provide snapshots of the errors and additional log files with stack traces etc. If this link is posted in the alert/alarm this will also help a lot during fault analysis, today I have to get the specific time of the error first from application manager down time history and then search for the corresponding log files manually.
I realize this is a lot but I think this will give Applications Manager a real edge over other similar tools. Even if just some of theese features are implemented it would help us a lot :)
New to ADSelfService Plus?