So what are we talking about when we talk about problem solving?
There are all sorts of problems sysadmins have to deal with on a regular
basis: technical problems and personal problems, immediate problems and
distant problems, easy problems and difficult problems. What I want to
address are problems of a more technical nature. I’m not going to talk much about specific tools or things like that, but rather more general topics that come to mind when I think about problem solving.
I’m going to outline the general methodology I tend to use when issues
come up. This is mostly just the way my brain works, and not necessarily
something I’ve consciously developed.
!!Define your problem
This is a step that I have seen skipped a lot of time in the heat of
the moment. Your server is down, boss is breathing down your neck, you
don’t have time to figure out what’s wrong, you just have to ”fix it”.
Don’t let this happen. Maybe, just maybe, you restore service, but you
likely don’t really know what you did, or how to keep it from happening
again. Step back, take a breath, and spend a moment ”thinking” instead
of reacting. What symptoms are you seeing? Is something actually broken,
or is it a bad problem report?
Defining the bounds of your problem can help you get it fixed more
quickly. For example, maybe it appears that DNS is broken. If you
make some observations, perhaps you can observe that the problem is only
affecting one of your two DNS servers. I’ve noticed that people tend
to assume the worst, e.g. “LDAP IS BROKEN” rather than taking the
time to figure out that most of their LDAP queries are working fine,
but a particular server is returning bad results for a certain subset
of queries. Slowing down will help you gain perspective.
!!Break the problem into pieces
Once I have identified the problem, or at least the scope of the problem,
I often find it helpful to break it down into pieces. Say we’ve tracked
the problem down to a particular server, but still don’t know exactly
what’s wrong. In situations like this, I like to look at all of the
components of a system and rule them out as possible sources of the
problem. Start with big pieces: is the network functional, are major
apps A, B, and C running on this server all running, is anything useful
showing up in the log files, etc.
I suppose it’s here and in the previous step that I usually go through the typical “problem
solving steps” that you hear about. These are in the general order that I tend to try them in.
* Is this something I have seen before? If so, I probably know how I fixed it previously.
* Is this documented in my groups troubleshooting documentation? You do have these docs, don’t you?
* Does this ”feel” like a familiar problem? After being in the profession for 10+ years, I’ve learned that some symptoms are typical of certain types of problems. A lot of that is simply a matter of experience and familiarity with certain applications.
* Do __strace(1)__ or __lsof(1)__ turn up anything useful? More on these in a yet-to-be-written post.
* What helpful hints do the other admins in my group have to offer?
* Does google turn up anything? Have I tried searching for the errors that show up in the log files?
!!Solving the problem
You’ve identified your problem, you’ve broken it down into some smaller
pieces, and now you’re left with some questions. Is this something you
can fix immediately? Have you really found the root cause, or are you
still left with a raft of symptoms? Can you let the problem persist
so that you can develop a more permanent fix?
!Fix the immediate problem
This is generally our first instinct. In some situations, the right
answer, unfortunately, is that you need to get service restored as soon
as possible. Sometimes there is simply no time to get to the root of
the problem. Perhaps you work in health care, and doctors need to get
access to their patient data, or your work for a top 10 website where
non-trivial amounts of cash are lost for every hour the site is down.
In these situations it may be appropriate to resolve the immediate issue.
However, I suggest that after everything is functional again, you do a
post-mortem of the situation in an effort to determine the root cause.
Recently we had a problem at work with our LDAP servers. In the
middle of the day, our LDAP server processes were dying unexplainably.
It happens that in this case, those particular servers were running 5
year old versions of the LDAP server software, and were scheduled for
an upgrade. Rather than spending time trying to find the root cause,
we wrote a simple script to check that the LDAP server was running,
and if it was not, restart it. Along with this, we heavily accelerated
our schedule for upgrading these servers. While everyone on the team
would have rather had the time to identify the real source of the issue,
time constraints, and the necessity of keeping the computing center
functional, dictated that we fix the immediate problem, and come up with
a mitigation strategy.
!Determine the underlying cause
Ideally, once we have identified the problem, we are given time to
identify the root cause. Furthermore, it’s often possible for us
to __make the time__ to do this.
Let me give you an example. You’re running a cluster of web servers
behind a load balancer, and customers are intermittently experiencing
issues with your site. By identifying the bounds of your problems, you’ve
narrowed it down to a single web server in this cluster. Further breaking
the problem into pieces, you’ve been able to determine that this server is
running out of memory. Normally, we would be tempted to simple restart
the server processes, or maybe even reboot the entire server.
Instead, examine your situation a little more closely. The cluster is
provisioned such that the site can easily survive a down server. Why not
take the problem server out of your load balancer configuration, and spend
some time figuring out ”why” it has run out of memory. Use __top(1)__
to see which processes are using the most memory. Look at your
__sar(1)__ data and figure out when the memory starvation. Maybe it
can be correlated to a __cron__ job that ran in the morning. Check your
log files in more depth, maybe you’ll turn up something subtle you missed
in your first pass.
!Fix the underlying cause right now or wait
Congratulations! You’ve figured out what went wrong, and now you’ve got
the chance to fix it. Now it’s a judgement call – is this something you
can fix permanently right now, or will it require a lot of work that you
don’t have time for at the moment? Maybe you’ve been able to keep your webserver
out of the load balancer rotation for the last 4 hours, but it’s almost
peak time for your website and you really need that last server in there
to handle the increased load.
In such situations, I’ll often return to the easy solution (e.g. reboot
the server) once I know what’s wrong. In particular, I’ll open a
ticket in our ticketing system (you do have a ticketing system, don’t
you?) detailing the problem as well as my proposed resolution. During
the next off-peak time, I’ll working on implementing the permanent fix.
!!The take-away
What I’ve written here is simply my approach to solving problems.
It’s more important that you develop your own process than trying to
follow exactly what I’ve laid out.
Don’t just react; think it through and develop a plan of action.
Reactionary measures can often cause more harm than doing nothing at all,
and as the systems we maintain become more critical, it is imperative
that we make the right decisions.
Leave some comments and let me know how much you disagree with what I’ve said here 🙂