Making the Netbroken a Network at LISA ’09

Each LISA brings a new challenge or two…or four…so far. The conference network started out barely usable, then went offline entirely twice, then came back up with sporadic outages before finally performing acceptably. I feel like I owe an explanation, so here goes.

As a bit of background, I worked with the Hotel and the Hotel’s in-house ISP weeks in advance to cover what our requirements were, what infrastructure I’d be working with, and how to best provide a good network for our attendees. The negotiated connection is a /29 with 10Mbps guaranteed, bursting into the 45Mbps total pipe, and access to the patch panels to install USENIX’s switches. Not the best connection we’ve ever had (that would be LISA ’08 at the T&C – 45Mbps all to ourselves), but far from the worst.

I did the bulk of the network setup Friday night and Saturday morning, and everything seemed fine. Once the network came under load on Sunday, it exhibited packet loss, and would not pull more than about 6Mbps, and pushed much less, around 1 – 2Mbps. It felt like 300 people on 5 ADSL lines rather than a 10Mbps routed subnet.

I emailed and called the Hotel’s ISP, but unfortunately the support staff that deals with problems like mine wasn’t available on the weekend. To make matters worse, their network is centrally managed from Salt Lake City — two timezones away — so support wouldn’t be available until Monday around 11am EST. At about 11:30am Monday I was contacted by the local onsite tech and my Sales rep. The onsite tech, who I’ll call “OT”, was in Virginia and couldn’t get to me until 12:30. So I waited, and watched the packets drop.

OT’s first step was to disconnect our network, take over an IP in my allocated sub-net, and run Speakeasy’s speed test. Surprisingly, he got poor results, just like me… He got on the phone with his “WAN team” to further investigate the issue. They determined that the bandwidth allocation wasn’t exactly what we wanted. They had set up the 10Mbps connection as five 2Mbps connections, one for each of our allocated IPs. After the WAN team configured the connection correctly, the onsite tech ran speedtest again…and got the same results. At that point I plugged our network back in so the problem could be further investigated. Downtime: 30 minutes.

By this time my main hotel contact, “HT”, has joined the party, and we all go to the 4th floor A/V room where my router is patched into their switch. OT plugs into the switch to get outside of our allocated IP space and gets the same results. So, the switch in the A/V closet can’t pull more than 6Mbps. Now why might that be? Rather than go through their whole switch fabric to figure that out, I say, “hey, how about we use some of that dark fiber and hop me down to the MPOE and directly into the core switch?” So, we tromp down two stories to check out the MPOE, now joined by the Marriott’s IT guy, who I’ll call “MT”. None of the fiber is labeled correctly, so we try to match things by color for a while as we don’t have any fiber tools. No luck. OT suggests we run a dedicated ethernet connection following the same path as the one that feeds the 4th floor switch. MT, HT and I say “noooo that run will be way too long”. OT says lets do it and see.

The new run hops from the MPOE to a PBX room, and then up to the 4th floor. Surprise! It pulls 6Mbps from the core swich, which when plugged into in the MPOE gives us 22Mbps. MT and OT go back downstairs and put a dumb switch in the path to boost the signal, and we get 22Mbps on the 4th floor. It’s about 15:00 now, so we decide to wait for the 15:30 break to cut the connection over. There’s one more little problem to work through though.

OT and I noticed that the IP he borrowed from my pool is extra sticky ARP wise. His switches won’t let go of the notion that his laptop’s MAC address and that IP are bound together for eternity. He plays with his switches some, but after 10 minutes can’t get them to release the IP/MAC mapping. Since we’re going to move my router from the 4th floor switch to the 2nd floor core switch, that’s going to happen to the router too. OT decides to reset something on all of the switches when we do the cut over to keep that from happening. 15:30 rolls around, I unplug the network, he does the reset, I plug the network back in and… it doesn’t work. I reboot my router. Now it can ping the outside world, and the LAN, but hosts on the LAN can’t reach the outside world.

This is a long story so I’ll cut out an hour and a half of trouble-shooting, which now includes me, OT, HT, and David Nolan, who I’ve worked with on past LISA networks. David and I figure out that packets are going out from the LAN and coming back to the external interface of the router, but not destined for one of my router’s IPs! Rather, they come back to an IP in the 10.foo private range used by the hotel guestroom network. This isn’t your run of the mill ARP cache problem as the IP/MAC mappings have lasted over an hour, so something fancier is wrong. The thing that saves us is David saying “Hey, let’s drop the router back to one IP and see if it works”. We do, and it does. Huzzah. NAT’ing a few hundred people behind a single IP doesn’t make me happy, so we still need to figure out why the other IPs in the block aren’t working, but the conference is back online. At 16:55. That’s 1.5 hours of downtime for those counting. Added to the initial outage, over 2 hours of downtime due to an out of spec ethernet run. I now have full respect for the 100 meter limit.

You thought we were done right? Wrong. Now the LAN hosts can pull down over 6Mbps, and first thing Tuesday, they do. Or at least they do until the network starts pausing for a few seconds periodically. Oh, and the Cacti graphs for my main switch start having holes in them. And the blinken lights on the switch start staying lit rather than blinken. And occasionally the blicken lights just go off for a few seconds. This all corresponds with the network traffic going above about 20Mbps. I’ve deployed all of my managed switches save one, a Linksys SRW2008, which isn’t configured to replace the main switch and I can’t remember the password for. Besides, I’m not sure if I trust a Linksys switch to do the job, managed or not. Still I spend a bunch of time with a console cable getting back into the switch and giving it a config that works for being the main switch. I’m still leery of using it in place of a Cisco, and decide to call it a night without replacing the main switch.

Wednesday morning is worse than Tuesday. The keynote starts and the network dies every few minutes. I’m now running ntop to track stats and its showing 30Mbps. I decide to swap one of the Cisco 3500 switches for the Linksys, AND fortunately the 3500 serving the exhibition only has one device connected, AND the exhibition doesn’t start until noon. By now you’re expecting a “BUT…”, so here it is: BUT, I don’t have a key for the room my router is in, and there is no ETA on getting a key, and, the IDF closet is locked and the person who can open it is busy helping other clients. After a lot of standing around in front of doors waiting for someone with keys, I get the switches swapped around a little before noon. And lo, it works. And keeps working. I spent lunch poking at our Xirrus Wireless Arrays, and watching the mrtg graphs go back up and beyond 20Mbps without a hiccup. Now, how do I simulate the load of 300 SysAdmins on a wireless network to catch these problems early on?