I’ve been posting some of my Tae Kwon Do belt test thesis papers to my blog, and for some reason they seem to be fairly popular. I’ve had an idea to start a series of posts that tie some of our martial arts lessons/philosophy to systems administration. I’m not a black belt (yet), I just got my green belt (6th Gup) on July 20th, but I don’t think anyone will send the Ninja squads after me for titling this series “Black Belt Systems Administration”.
So, for the first installment, I pay homage to the best known martial artist of all time, Bruce Lee. Lee is quoted as saying, __”Before I learned martial arts, a punch was just a punch and a kick was just a kick. When I studied martial arts, a punch was no longer just a punch and a kick was no longer just a kick. Now I understand martial arts, and a punch is just a punch and a kick is just a kick.”__ This quote can be associated with most anything in life, not just systems administration. What Lee is saying is that when we first start something, it often seems very simple and we concentrate on the basics. As we learn more about it, we focus on the details and forget the basics. Once we know something, we ignore the details and go back to the basics again.
This is exceptionally true in systems administration. When we first start out, we only know a little bit about the system. If something isn’t working correctly, we start with the basics and check cables and connections and maybe finally just reboot. As we learn more, we start finding out about inter-process communications, shared memory, paging algorithms, sendmail rules rewriting, database blobs versus text, kernel tunables, and a million other things. When something isn’t working correctly, we don our wizard robes and hat, prepare for the thrill of battle, and start running __sar__ and checking disk I/O rates and swap rates versus page rates, and ten hours later, we finally emerge victorious with the root cause: someone bumped a SAN cable and it was loose. As we begin to truly understand our systems, we draw on our past experiences, we start with the basics and check cables and connections, we know the one or two things that are most likely to cause the problem, and maybe finally just reboot.
A few months ago, I was asked to rescue a project that had gone horribly wrong. The project had been going on for months, there were lots of hurt feelings, lots of finger pointing, a migration that was supposed to have been done wasn’t, we were still on the old equipment, but some of it had been moved and wasn’t working properly, the contractor/developer was about to walk away from the whole thing, and the three month period each year that this system was actually used was fast approaching. My first goal was to calm everyone down, and find out all of the actual issues. Although we came up with a list of about 40 things, everyone agreed there were only three real issues: the web server couldn’t talk to the database server, the database server had a failed SCSI controller, and the optical scanning application didn’t work. Although the SA team had been trying all the advanced tools they could think of, no one had concentrated on the basics. No one had looked at the database instances file to see that it had been hard coded with the old IP address of the database server, and not the hostname. No one had looked at the directory permissions that the optical scanner was trying to write to, and seen that they had been changed to match security policy (and that the scanning user was not in the database group). So, within a few hours of starting, I was able to resolve two of the major issues, simply by starting with the basics.
Remember, a punch is just a punch. Check cables, configurations, permissions, and log files first. Don’t start with running kernel debuggers, and application profiling. And sometimes, you might find that you just have to reboot.