A few days ago, someone on irc.lopsa.org asked for something like [Cacti|http://www.cacti.net], but smaller, easier to manage, and that plugs into [Nagios|http://www.nagios.org]. Having just brought up such a system a few weeks ago, I was able to recommend [NagiosGraph|http://sourceforge.net/projects/nagiosgraph/].
In our situation, we make very little use of SNMP, so Cacti would have been of limited usefulness. Moreover, we had 246 services on 58 hosts already defined in Nagios, so wanted to capitalize on that work. [NagiosExchange|http://www.nagiosexchange.org] lists about a dozen graphing utilities, but I eventually settled on NagiosGraph. While other plugins promised more automation, they were also more immature and consequently difficult or impossible to initially install and configure.
NagiosGraph, on the other hand, was a breeze. It’s all Perl, so you might need a module or two. (I didn’t need to install any extra modules on my SuSE 9.3 Nagios box; YMMV.) My Nagios installation is a pretty vanilla install from source in /usr/local/nagios/, so I was able to follow the instructions in the INSTALL file very closely.
!Linking to Graphs
Note that NagiosGraph will be collecting data even if you never set up any links to its graphs from Nagios; the two functions are totally separate. It automatically records most data (although see the map file, below), but you’ll need to add the external service links — called serviceext definitions in Nagios — manually.
Because of that, I wanted it to be easier to add serviceext definitions, though, so I harnessed Nagios’ template-based configuration, which is the best part about it, IMHO. I created a directory, /usr/local/nagios/serviceext, and added it to my nagios.cfg:
cfg_dir=/usr/local/nagios/etc/serviceext
In that directory, I created template.cfg, which contains one definition:
define serviceextinfo {
name basic
notes_url /nagios/cgi-bin/show.cgi?host=$HOSTNAME$&service=$SERVICEDESC$
icon_image graph.png
icon_image_alt View graphs
register 0
}
Using this, I can add other service definitions *very* easily. For instance, one of the things I most wanted to graph was disk space, in order to figure out when I’d need to buy new disk. So I created /usr/local/nagios/etc/serviceext/disk.cfg, to which I added a number of stanzas like this:
define serviceextinfo {
service_description Disk space – /var
host_name howard,job,newman,ravi,stout,students,thor,gentoo
use basic
}
These are all the machines on which I’m monitoring /var space. (“Disk space – /var” is the name that I gave the check in Nagios.) It’s a little verbose, but you can also specify membership in a serviceextinfo definition by hostgroup. For instance, I know I’m monitoring disk space on / on every machine that runs NRPE, the Nagios Remote Plugin Executor (which could be a blog entry unto itself). I also have a hostgroup predefined for all machines that run NRPE, so my serviceext definition looks like this:
define serviceextinfo {
service_description Disk space – /
hostgroup nrpe
use basic
}
Now when I add a host or service to Nagios, I just have to add it to the nrpe hostgroup, and a lot of the NagiosGraphs configure themselves.
It bears noting that the latest version of Nagios, 2.3.1, handles serviceext definitions much better than some previous versions. If you’re planning on using NagiosGraph extensively, it’s worth upgrading.
!Aggregating Graphs
One of NagiosGraph’s weaknesses is that it doesn’t aggregate any of the graphs. If you want to see if there are disturbing trends in any of your graphs, you have to look at them all individually. Of my 246 monitored services, 63 are disk space. It would be absurd for me to click through 63 individual pages looking at graphs. So I hacked show.cgi to show an aggregation page with all of my graphs. (Yes, I mean *all*. It takes a while to come up. If you have a larger installation than I do, you’ll probably want to hack it further so that it doesn’t try to render 10,000 graphs at once.) You can find my script at http://www.nebrwesleyan.edu/people/stpierre/aggregate.cgi.
My todo list for this script:
* Group by check type, so you can just look at, e.g., all load average graphs
* Change time period easily, e.g., with a drop-down menu.
If you want to help with either of these changes, shoot me an email. I’m pretty unfamiliar with the Perl CGI interface, so I can use any help I can get.
!The map file
The map file is where the real magic happens. It’s basically a list of regular expresses that parse the output from Nagios’ check commands so that it can properly munge them into the database. It already works on most commands, but if you added your own proprietary command, you’ll probably need to add it to the map file.
In my case, I wrote a check command called “mailping” (which may be featured in a future blog), which basically sends an email and then receives it, reporting irregularities. If all is well, it simply outputs: “OK: All email received”, or maybe “OK: Email from 16:35:02 not received at 16:37:03”. (This means that the oldest email it sent has taken two minutes thus far; I’ve got it configured to start freaking out when email takes ten minutes.)
That’s not terribly useful if I want a longitudinal study of how long email usually takes, though. Luckily, Nagios contains a little-used feature called “performance data.” Your output can contain a vertical bar (“|”), and everything after that is considered performance data, and can be used by plugins _in addition to_ the pretty output. So I added something to my mailping script that added the trip length (in seconds) to the end of the output. So now, it might say something like: “OK: All email received|trip=60”. Then I added the following stanza to the map file:
{{{
# Service type: mailping
# output:OK: All email received
# perfdata:trip=60
/perfdata:trip=(d+)/
and push @s, [ trip, [ ‘trip_time’, GAUGE, $1 ] ];
}}}
Perl gurus will recognize this immediately, but it bears some explanation. Basically, it fires off a regex for “trip=(d)” within the performance data. Then, it adds the value that it pulled out of that regex (“(d)”, now $1) into a database called “trip”, associated with the key “trip_time”.
!So what?
Cacti users will recognize that NagiosGraph doesn’t offer you all the power that’s out there. In fact, for some things, like machine room temperature, we still use homebrew graphing utilities to milk more oomph out of them. But for the user who already has an extensive Nagios installation, wants to save him- or herself a bundle of work, and needs quick, utilitiarian graphs to spot trends and potential problems, NagiosGraph provides a compelling solution.