Five Things I Hate About Zabbix

Everyone has their favorite monitoring system, or at least they've selected the one they currently hate the least (often one they've never used before).  Zabbix is … not mine.  Why, do you ask?

  1. Configuration is all database.
  2. … except for the bits that aren't (proxies)
  3. Scalability
  4. Bad data handling
  5. Customizability

When the configuration is all in the database, it becomes difficult to do lots of things that you would normally expect to be able to do such as revision control and easily repeated configurations.  Sure, you can dump the relevant sections of the database into one or more SQL files and keep those in your favorite VCS, but the dumped SQL is really not very helpful if you want to find what changed, when, and who changed it.  Yes, there are audit logs which you can use for blame, but they won't help you get back to the previous state, either.  Further, if the new state involves removing data points for a check, well, the data is gone unless you do a full database restore and copy the relevant data points over to the production database (not an operation for the faint of heart).  The consistent configuration issue is even more of a bear.  The system is designed for point-and-click operation, which lends itself to error, particularly when you need to do the same things over and over.  There is a clone feature, and it's a real boon to use, but you still end up hand-editing several fields each time you use it.

 

Proxies are rather odd ducks.  You create an entry in Zabbix for a proxy, and on the relevant proxy host you then alter the configuration so that it identifies itself as the new proxy.  So, what's wrong with this you ask?  Nothing, except for the bit where there's no record in the Zabbix database of what server matches a given proxy.   Again, why does this matter?  Let me offer a real-world example.  You want your Bacula server to report the status of each backup job to Zabbix.  To make the reports useful, you want them to be associated with the clients that are being backed up.  However, if a host is assigned to a proxy, Zabbix will only accept data for you client from its proxy!  And how do you get that information?  Good question.

 

Scalability has always been one of Zabbix' ugly points.  Simply tuning your database to handle 4k inserts/seconds isn't enough, as each new data item actually turns into several inserts (pre-2.2).  Further, your history table is going to need to be sharded by time, possibly daily.  Zabbix is working on providing a Cassandra backend to work around this, but as of today, it's simply not there.  Finally, the high insert rate into the database frequently makes the web UI painfully slow.  Yes, you can add PCI SSD, and sharding, and lots of memory, but eventually you will hit the wall, and it hurts.  All that said, I will admit that I'm working from a rather larger environment than most (20k+ systems/1M+ items), so perhaps you won't find this as painful.

 

This is kind of an ugly one.  I've run into countless orphaned data items and triggers from deletions that somehow skipped certain hosts.  Worse, though, are the cases where you add an item to a template and it propagates to only 200 out of 205 hosts.  How do you know that all of your hosts have all of the items?  You don't!  The only way to actually know that your configuration is complete is to use the API to query all of the items, triggers, and graphs for each host and compare them with the relevant templates.  The amount of fun to be had doing this is not describable in professional terms.

 

The deal breaker for me, though is customizability.  Let's say, for instance, that your standard filesystem layout has /srv as a mountpoint where you want to warn at 10% free and alarm at 5% free.  Now, if all of your servers are virtually identical, you have no problem.  However, what if you have a group of servers where disk usage tends to grow rather quickly so that you want to set the thresholds at 40% and 30%?  Guess what?  You have to use a separate template for that, *and* the standard /srv items cannot exist in any of the other templates attached to the host!  In my case, I ended up with seven templates, six for each of six special cases, and one for the normal hosts.  Then you come back to the auditing problem: which of the hosts don't have any of those seven templates attached to it?  Another customizability case is screens.  It is not uncommon to want to have a screen that shows, e.g. /data/bacula for all of your Bacula storage servers.  Zabbix provides no way of collecting all of the (template-created!) graphs from a group of servers.

 

The bottom line with Zabbix is that if you have a small to mid-sized installation, and are willing to write the API and SQL tools needed to ensure both coverage and data integrity, you can end up with a decent monitoring system.  If you have a large installation, if change management is important, or if monitoring is central the your business, You should probably look elsewhere.