Jan 08

In our continual task to try and speed up Opsview, we found a bug in NSCA’s handling of aggregate writes when run in –single mode.

The specific failure scenario is this:

  1. NSCA and Nagios are told to start up
  2. A send_nsca request is received by NSCA before Nagios has created the nagios.cmd command pipe
  3. NSCA tries to write to open the command file, but sees it is not there
  4. NSCA opens the alternate dump file instead

Now when Nagios does create the nagios.cmd file, NSCA uses that … unless aggregate mode is on and daemon mode is –single. In this case, it continues to use the alternate dump file, thus Nagios doesn’t see the results from the slaves.

Here’s the patch, which we’ve also added into our source for Opsview.

As we are very keen on good testing, we’ve managed to recreate the failing behaviour in a test script. You also need a test configuration file and a patch to the test framework. If you run this test, it will show the error and then after the patch is applied, the test should pass.

Tagged with:
Nov 02

The problem

For one customer, we had a major scaling issue with distributed monitoring and NSCA. The initial setup was one master, 5 slaves using send_nsca to send passive service check results back to the master. This is the standard setup, with the ocsp_command like the submit_check_result script.

But we started to see some bad figures in the Nagios performance. The average Check Latency was showing 9.5 seconds, which seemed far too long. On the master, we could see 50+ nsca daemon processes, though they didn’t appear to be doing anything.

The revelation

The revelation came when we looked on the slave. At any one time, there was only one send_nsca running! So even though the service checks were being run in parallel, it looked like ocsp_commands were being sent serially. This had to be our bottleneck.

The solution

So we wrote a script called send_nsca_cached to cache the passive check results. The idea is that the script will take the results as usual, but write to a cache file instead of running send_nsca. This cache file would hold a start time, so if the current result exceeded the start time + cache period, then send_nsca would be invoked and send all passive results at once.

We put the script on the slave and could see that the cache file would fill in spurts – 10 entries looked to be written within half a second, but then nothing for a few seconds. Nagios does some tricks to try and spread the service check load, but I wonder if the “traffic jam” of sending the uncached way was causing the services to be bunched up together.

When we checked again in an hour, the maximum Check Latency dropped to under 1 second and the master had only 9 nsca daemons. And I guess it is much better for network load as well to send a whole bunch of data at once, rather than a single message at a time.

The warnings

There had to be some bad points.

  1. This script is only for Nagios 2.0+ because of the use of environment variables
  2. We don’t support passive host checks. Not sure if this is a good or bad thing
  3. Do not use this if your slave is not busy. As send_nsca_cached needs to be invoked in order to send results, if your ocsp_command is only invoked once every minute, then the quickest you will get a batch result sent to the master is every minute, regardless of your cache time. So only use this script on a busy slave. You could use a cache time of 0 to be the same as sending immediately
  4. Don’t make the cache time too large. The results have no timestamps, so when Nagios on the master receives the results, it will process it as if the check happened just then. Also, if there is too much data being sent, you could fill the command pipe on the Nagios master
  5. On that point, make sure the master Nagios server has command_check_interval=-1 in nagios.cfg, so that the command pipe is read as quickly as possible. There are known limitations that if the pipe is filled, processes writing to the pipe will hang until more space is available

The future

That last point about the command pipe is being (partially) addressed in Nagios 3.0. Ethan has said at the Nagios Conference in Germany there will be a new external command called PROCESS_FILE, so the idea is that nsca can drop a file down on the master with a file containing passive check results and then only one command is put into the pipe, which will then process that entire batch.

The real solution to point (3) is to let the caching be done at Nagios, rather than externally, and that is also on the radar for Nagios 3.0. So there is lots to look forward to there. But if you want something now, check out our script. It’s not a perfect script because it’s hard coded in various places and you will need to customise the send_nsca command, but we hope it helps you regardless. Enjoy!

The end?

Not quite. At the Nagios Conference, Ethan was talking to two guys who were complaining that their distributed setup had huge slowdowns. I overheard and the symptoms looked exactly the same, so I gave them a copy of the script. Apparently it helped, but they had some lock ups in Nagios, which they think were attributed to our script – so caveat emptor. They have since reverted back to using the standard uncached mechanism.

We haven’t had any issues for our customers, so we’re interested in what you find. If you have a distributed environment with similar symptoms and you are thinking of using this script, please take a note of your Check Latency and the number of nsca daemons and add a comment to this blog with some before and after statistics. We’d love to know if this works elsewhere. Good luck!

Tagged with:
Nagios © 1999-2011 Nagios Enterprises LLC. Nagios, the Nagios logo, and Nagios graphics are the servicemarks,
trademarks, or registered trademarks owned by Nagios Enterprises, LLC. All Rights Reserved.
Opsview © 2008-2011 Opsera Ltd. Opsview, the Opsview Logo, and Opsview graphics are the
trademarks or registered trademarks owned by Opsera Limited. All Rights Reserved.
preload preload preload