Scouting the wiley melim.

(… or how to shed 25 thousand NFSOPS per second off your filer in 3 easy steps)

So, we did a large upgrade of the environment over the weekend. It mostly went ok, despite some dramatic schedule slips. We still fit inside the overall window; it just didn’t end up as short as we wanted it.

LSF got upgraded to the latest 6.X maintenance pack. RHEL3U8 was pushed out to thousands of hosts. Filers got upgraded. We “created customer success”. (Well, not really …)

Day after the upgrade, we noticed that the newest filer that holds our LSF installs was doing sustained 40k NFSOPS for no apparent reason. Nothing pointed to any specific host doing something stupid. This meant ALL the hosts were doing something specifically stupid. Hard to tell what, though; we’d upgraded everything.

We began poking around, first with ethereal. We noticed that the host we were on was doing a getattr to the effected filer every five seconds. That’s not out of the ordinary, really. The environment is chatty. We kept poking around doing short tcpdumps and trying to correlate what was happening with any sort of traffic that might be causing all that nfs traffic. We thought it may have been some of the LSf jobs that account for 80% of our utilization.

We ran further tcpdumps while bstoping the jobs on hosts and couldn’t find any net decrease in traffic. So it couldn’t have been the jobs. We began digging deeper on this one host in the grid. Hey look, if we stop the LSF daemons, that getattr every five seconds stops.

But, the problem: a getattr every five seconds across a few thousand hosts didn’t account for the 40k nfsops we were seeing. We straced the LSF daemons and found that melim was statting one of our custom elims every five seconds. Ok, maybe we’re on the right track. But this still didn’t explain the usage pattern we were seeing. By this point, we had all gathered around to throw out ideas on where to go from here. We had left ethereal running and capturing packets. During this two minute window there was a storm of traffic to the filer that didn’t make any sense. More stracing and we lucked out: every so often after a round of statting of the elim, melim would invoke it.

And then it dawned on us: we were looking at only short snippets of traffic from ethereal and strace. And each time, we happened to be looking in between melim’s attempts to invoke the elim.

doh.

we had forgotten our LSF training: when you install LSF, the melim gets invoked on each system. It also invokes every file in the install directory that starts with elim. This particular elim only needed to run on the LSF master. Instead, because of our install architecture, the elim was available to every host. Every one was statting this file every five seconds to make sure it was still there. About once a minute (happily randomized throughout the grid), melim would go and invoke the elim. Because the elim was only designed to run on the master, it would check to see where it was run, find itself not on the master, and quit. Our one getattr every five seconds turned into a few hundred nfs ops occuring in a one second period every minute. Not so bad on one system. Multiply this by a few thousand and you start making your filer sweat. We quickly removed the elim and things immediately dropped by 25k NFSOPS.

So, in three easy steps:

1. Find process being invoked everywhere and leaving no trace of itself.
2. Stop process from being invoked every minute.
3. Profit!

Moral of the story?

It’s the little details that will screw you. One little symlink in the wrong directory caused us a few days of consternation and head scratching. So, watch where you install your elims.