JMX notification listener bug

I am an Applications Manager production customer running build 8200 on linux, although the issue described below also occurs on Windows.

Within a "JMX Applications" type monitor, I have a JMX notification listener defined within AdventNet. The intent of the listener is to provide asynchronous notifications of important events within a remote Java VM. To achieve this on a continuous basis, the notification listener must always be listening.

Under the covers a JMX notification listener is implemented via a thread on the Applications Manager server that opens a socket to the RMI server on the remote machine, typically with a timeout. These threads are named "ClientNotifForwarder-N" (where N is an incrementing integer) in the Applications Manager thread dump.

If a notification listener socket is interrupted (typically by the remote Java VM going down), the ClientNotifForwarder-N thread exits immediately. The next time that the JMX Applications monitor is polled in Applications Manager, it notices that the resource is unavailable and marks it as such. If the Java VM is then started on the remote box (after Applications Manager notices it was down via the poll just mentioned), then the next time the JMX Applications monitor is polled, it successfully fetches any JMX attributes and it also re-connects the Notification Listener (thus creating a new ClientNotifForwarder-N thread). The scenario just described works fine.

However, a significant issue/bug occurs within Applications Manager in a slightly altered (but very common) scenario. If the remote Java VM is stopped, then the ClientNotifForwarder-N thread associated with the Applications Manager notification listener will exit as mentioned above. If the remote Java VM is then started before the next poll of the JMX Applications monitor, then Applications Manager (at the next poll) never sees that the resource was down briefly. Because of this, it appears that the logic within Applications Manager that would start a new ClientNotifForwarder-N thread (to replace the one that exited because the socket died) is not executed in this scenario. The net effect is that JMX notifications are ignored from this resource from this time onward.

In practice this is a very significant problem. Our polling interval is 5 minutes for the JMX Applications monitor. If the remote Java VM is stopped and started quickly (a common occurrence or at least one that should be handled), then all JMX notifications are lost from that Java VM because a ClientNotifForwarder-N thread is not running.

I would appreciate a fix for this bug in the next service pack. This is prohibiting us from expanding our Applications Manager managed network from a few dozen machines currently to many hundreds. We need to know that JMX Notifications (which we rely upon heavily) will make it to the Applications Manager server regardless of whether the Java VM has gone down briefly between monitoring polls.

My guess is that the fix is to ensure at every poll that the notification listener thread is still running for each configured notification listener (regardless of whether the resource was previously considered up or down). If it is not running, then start one up. This still leaves a small window where JMX Notifications will be ignored, but that window is now bounded by the size of the polling interval (5 minutes in our case). That's an acceptable tradeoff.

Thanks!
Brett Peterson
VisionShare Inc.