Looking Beyond the Obvious - Troubleshooting RPC Errors

Looking Beyond the Obvious - Troubleshooting RPC Errors

Thought I'd share an interesting issue faced during the days when I was handling product support.

 

The client I was attending to reported that the monitoring tool was generating  RPC  errors  for most of the monitored servers. This was the error message he received:

 

Error # The  RPC  server is unavailable.

 

The interesting  fact was that the alerts were raised only at a  particular time of the day , between 3:00 AM to 3:30 AM. Things would come back to normal after 3:30AM.

 

Here are the typical  troubleshooting steps we'd follow when analyzing  RPC  errors:

 

1. Whether or not the device is  reachable.

2. Host name resolution(Reverse and forward  nslookup  should fetch same results).

3. Firewall configuration.

4.  WMI  security settings rolled out in the network.

 

After ruling out points 2-4 owing to proper  data collection  after 3:30AM, I  had a hunch that the issue could have something to do with the network environment. I checked for the Event Log messages on the  NMS  server next. There were a lot of event logs with the following message during that exact time window!

 

DCOM  was unable to communicate with the computer <Computer_ FQDN > using any of the configured protocols"

 

It was therefore obvious that the remote servers were not reachable from the  NMS  server. This was the reason for the alerts. Besides, the Application logs confirmed the same too.

 

As no recent changes were made to the servers, the cause was narrowed down to the network. Using  NetFlow  Analyzer, we were able to notice  a surge in traffic during that period and tracked the source to a Backup server.  Upon analysis, the client identified that a large backup job of an Exchange server was overloading the SAN and Switch. It resulted in significant drop of packets on the outbound queue of the switch which led us to this issue.

 

After discussing with the vendor, the client set the right  QoS  on his network and as expected, the alerts ceased :)

 

Hope you found this interesting. I'm sure you would have come across many such interesting and sometimes complex scenarios too. Go on and share your story too!

 

 

Regards,

Rameshkumar   Ramachandran