Thought I'd share an interesting issue faced during the days when I was handling product support.
The interesting fact was that the alerts were raised only at a particular time of the day , between 3:00 AM to 3:30 AM. Things would come back to normal after 3:30AM.
After ruling out points 2-4 owing to proper data collection after 3:30AM, I had a hunch that the issue could have something to do with the network environment. I checked for the Event Log messages on the NMS server next. There were a lot of event logs with the following message during that exact time window!
" DCOM was unable to communicate with the computer <Computer_ FQDN > using any of the configured protocols"
It was therefore obvious that the remote servers were not reachable from the NMS server. This was the reason for the alerts. Besides, the Application logs confirmed the same too.
As no recent changes were made to the servers, the cause was narrowed down to the network. Using NetFlow Analyzer, we were able to notice a surge in traffic during that period and tracked the source to a Backup server. Upon analysis, the client identified that a large backup job of an Exchange server was overloading the SAN and Switch. It resulted in significant drop of packets on the outbound queue of the switch which led us to this issue.
After discussing with the vendor, the client set the right QoS on his network and as expected, the alerts ceased :)
Hope you found this interesting. I'm sure you would have come across many such interesting and sometimes complex scenarios too. Go on and share your story too!