I recently came across blog post by morgajel from February on The Philosophy of Monitoring, and I enjoyed it quite a bit. I really could have used this document several ago when I had my own run in with problems caused by over monitoring and explaining them to management.
At the time I was updating and managing a What’s Up Gold installation for a small Managed Services company who leased SonicWall VPNs to small business and created VPN links for key office personnel. Part of the package was notification and escalation if select services were unavailable. I also covered part of the overnight monitoring time frame to give the engineers a break after work before they might be on call.
We were monitoring the status of VPN endpoints out of band to catch ISP failures, A simple Is_available check. The host endpoints on commercial ISP accounts were pinged every 5 minutes. The endpoints located at residential locations were pinged every 15. During the down state they would be pinged once a minute to look for up status. Most of the endpoints were served by a popular cable ISP. Between hosts and services there were probably 300 services being checked.
It came down that management felt that we were not monitoring the residential endpoints enough and to change interval to 5 minutes as a service to our customers. I pushed back that residential customers were not given a SLA by the ISP, and all we could do was contact the ISP and report it down. From my own experience with the ISP’s residential service they would go down randomly for short periods and then claim when support was called that there was no outage (sound familiar?) even when the cable light was out on my end. I eventually was fed up and switched to DSL.
I was told to do it anyway and I asked for and received the requesting managers SMS to be included in the alerts (company provided cellphone). So that they could see as well how often small outages happened at that ISP which they were completely unaware of.
Jump to 1:05 am the following morning. 50 of the endpoints reported down for 5 minutes, a minute later they all reported up(eg 5-6 minute recorded outage), and 15 minutes later reported as still up. 150 SMS messages crashed the phones of myself and the manager. We could not delete messages fast enough. At 1:30 am it happened with different endpoints, and at 2:00 am a third time. This was in 2005, I don’t know if modern cellphones can take this kind of abuse. Later that morning I received a call to set it back to 15 minutes and, remove the manager from that alert group.
What was the cause? An un-communicated maintenance window by the ISP. I guess they figured that no one would be awake then. The routers serving each group of endpoints were being rebooted. The ISP claimed it was for routing tables. They needed to upgrade hardware if routing table updates required a 5 minute reboot. That ISP still doesn’t report maintenance windows to customers. The claim I heard was they could not tell when work on their network would affect certain customers. My counter was that they should communicate when maintenance windows would be happening so users could plan for the potential for downtime and there wold be no issues. That received no response so I avoid recommending them whenever possible. Of course if you don’t know when work on your network would affect customers you have other issues or too many levels of management.
I later figured out which routers were serving each group of endpoints and put in a rule to check if the router status if any of the affected endpoints reports down and alert accordingly.