HeartBeat not reachable !!

HeartBeat not reachable !!

 

Administrators would agree that Network Monitoring tools are living embodiment of “Heart” in a Service Provider ecosystem. This bionic heart monitors & manages servers, applications of customer networks in an uninterrupted manner and floats their boat.



 

Thought I'd share an instance where more than one heart skipped a beat :)

 

OpManager is an application that has been trusted by thousands of Service Providers for more than a decade. In my recent consultation programs, I happened to engage closely with one such customer using OpManager since 2007.

 

The customer is a Service Provider monitoring the networks of credit unions & financial institutions. Their live instance has a ginormous OpManager database  (25 GB database) & monitors over a 1000 devices in a heterogeneous environment spread across 40 client networks. My task was to upgrade this OpManager instance to the latest build & then to migrate the OpManager server to a ‘hip’ n beefy VM .

 

Detailed procedures with backup plans were drafted and re-visited umpteen times prior to the D-Day. As planned, the application was stopped at 6.30 pm after approval from the management and their clients.

 

The application upgrades were smooth, with testing plans executed in-between upgrades. The migration of OpManager instance was a breeze and the upgrade,migration task was completed by 11 pm.

 

The new instance of OpManager was live & monitoring all client networks except one critical client network. During testing, we found the customer’s client network was not reachable from OpManager server and all their devices were reported to be down. On trying a manual ping from any other server, we were able to reach the customer’s client network.

 

In this hair-trigger situation,the customer’s client was appraised of the issue and in parallel to the troubleshooting steps,a temporary instance of OpManager was installed to monitor this network from a different server.

 

Several rounds of troubleshooting involving Network routes, DNS   lookups , adding static routes, rollback to old instance went in vain as the heartbeat was offline with respect to this client’s network.

 

The ISP’s priority helpline was put to full use and by 5 am we made headway as the client’s network was back online in OpManager .

 

On a deep-dive analysis of the firewall logs of the client network, it was found that the client’s private cloud network allows only 128 connections per IP per second. When OpManager was started after migration (around 11 pm), it triggered several polls (exceeding the client network's permissible limit ) that blacklisted OpManager   IP address in the IPS /Firewall which resulted in the IP being blocked. Hence the availability polls were failing. After the IP was unblocked, OpManager was back online and heartbeat restored successfully.

 

This esteemed customer continues to love OpManager for its robust monitoring, alerting abilities and is now also evaluating our plugins/applications that integrates with OpManager . Migration to the  highly scalable   OpManager Large Enterprise Edition ( LEE ), is also on the cards right now. I'm sure that you'll agree with me that for those in customer facing roles, there is nothing more gratifying than putting a smile on the customer's face. I headed home with a smile too :)

 

Thanks for reading this & I'd sure love to hear your story when your network skipped a Heart Beat!