This is the story of my afternoon yesterday, Friday 12/5/08. It’s a case where the simple act of “keeping an eye on things” caught an anomaly that would’ve attracted a lot of negative attention from the business and scaled it down to a slight inconvenience.
$employer doesn’t have a NOC or really any sort of enterprise-wide centralized monitoring infrastructure. Everybody who takes care of a critical system is pretty much left on their own to make sure it keeps operating properly. The only pseudo-mandate is that each caretaker needs to have one or more backups, and that’s enforced mostly via peer pressure.
In order to help me keep an eye on the network and telecom infrastructure that I help to maintain, I put together a web page full of charts and graphs that I refer to as my heads-up display. Two of the four(!) monitors on my desk are dedicated to displaying status information. It’s laid out in such a fashion that I routinely glance across them and my brain simply knows how everything is supposed to look on a typical day. It’s kinda like the habitual glancing in the mirrors while driving a car – I don’t even realize I’m doing it until something is way too close to my bumper or flashing lights are closing in front a distance.
Yesterday at 12:45, one of those graphs caught my attention. This particular graph shows the status of a group of PRIs that I refer to as “voice group #3.” These PRIs handle all of our inbound calls to our direct dial numbers as well as the vast majority of our outbound calls. The phone company sends us inbound calls starting from one end of that group and we send our outbound calls starting at the opposite end of the group. The graph uses the color red to show traffic on the PRI that those two sets of calls don’t usually reach – if my brain notices red on the graph, something abnormal is happening. If you look at 12:45 on that graph, you can see why it caught my attention:
I’ll note that the actual graph on my heads-up display uses a finer resolution – it shows a data point for every minute. I chose to post a copy of the 5-minute resolution graph here so that you can see how wildly different yesterday’s 12:00 and 13:00 hours were from the previous day. We normally have between 20 and 40 free channels in that group during our peak usage. This is intentional. I know that $employer is starting to do more distance learning and web demos using our internal voice conferencing system (it’s less expensive than external voice conferencing systems), so I’ve padded our capacity a bit in response (yes, it’s still less expensive, even with my capacity padding).
Another typical behavior is that we see a major peak in utilization near the top of the 13:00 hour. Usually our utilization jumps by 40-60 channels around that time. You can see why the red on the graph prompted me to pay attention, and then I got concerned because it was only 12:45 and we were down to fewer than 20 available channels on that group.
Focus your eyes on the big jump that happened right at 12:30. It’s obvious from the colors involved that this was a mad rush of about 40 inbound phone calls (each color band represents a PRI, which is capable of carrying 23 simultaneous calls). A quick check of our voice conferencing system revealed a web demo that started at 12:30 and had 42 people dialed in to it. The really bad news: The conference call was scheduled to last until 13:30.
This suggested that our typical 40-60 channel surge in activity around 13:00 was not going to go well. Fortunately, we have a backup PRI that comes in to our system via copper, primarily intended to act as a lifeboat if our SONET ring goes down and takes all of our other circuits with it. Our phone system is already configured to overflow outbound calls to that PRI – but inbound calls on voice group #3 do not overflow. If voice group #3 is full or out of service, inbound calls that try to come in via that group fail with a busy signal or other non-good result. Thus, the outbound portion of our typical 13:00 rush would be able to make use of it, but any additional inbound calls (for example, calls coming in to our voice conferencing system for the dozen or so conference calls scheduled for 13:00 that day) would be out of luck.
In order to try to mitigate the situation, I logged in to CallManager and changed the route list for outgoing long distance calls to prefer the backup copper PRI ahead of voice group #3 – fortunately this is a non-disruptive change. The vast majority of our outbound calling is long distance, so this effectively took our first 23 outbound calls and kicked them over to that PRI, leaving more channels open for inbound calls.
As the graph above suggests, we still briefly maxed out voice group #3 at 13:00, but it was very brief. Given how much churn there is with people hanging up and new calls taking their places, I would be willing to bet that a given caller might have hit one attempt that ended up with a busy signal.
As if that wasn’t enough drama, more red showed up on my heads-up display just after 13:00. Our main reception desk number comes in on its own dedicated group of PRIs (voice group #2). $employer has a strong cultural policy that all customers get a live human when they call and an equally strong policy that nothing shall detract from the experience a customer has when they place that call. Here is the relevant graph, again in its 5-minute resolution format:
Recall the backup copper PRI that I mentioned earlier. Its primary reason for existing is actually to act as an overflow or backup route for the main reception number. That purpose alone is more than sufficient to justify the $800/month that it costs to have that circuit there. Unfortunately, I had just borrowed that overflow capacity to rescue voice group #3. Not cool.
At 13:12, I quickly sent a company-wide E-mail asking that people refrain from placing or receiving personal phone calls until 14:00. I also changed the long distance call routing back to preferring voice group 3. At that point, there were only two or three channels available on the backup PRI, so the main reception number wasn’t going to have much to work with if it actually did overflow. Looking back at the logs, I see that exactly one incoming main reception call overflowed on to the backup PRI (at 13:22).
As you can see from the graphs, call volumes quieted down and returned to normal levels rather quickly after that. My next moves were to submit a purchase request for two more PRIs (one each for voice groups #2 and #3) and to start a conversation with our vendor to decide what hardware to purchase to plug them in to (our existing voice gateways are full). I have no intention of waiting to see if this becomes a pattern before I take action to facilitate such behavior.