The problem
For one customer, we had a major scaling issue with distributed monitoring and NSCA. The initial setup was one master, 5 slaves using send_nsca to send passive service check results back to the master. This is the standard setup, with the ocsp_command like the submit_check_result script.
But we started to see some bad figures in the Nagios performance. The average Check Latency was showing 9.5 seconds, which seemed far too long. On the master, we could see 50+ nsca daemon processes, though they didn’t appear to be doing anything.
The revelation
The revelation came when we looked on the slave. At any one time, there was only one send_nsca running! So even though the service checks were being run in parallel, it looked like ocsp_commands were being sent serially. This had to be our bottleneck.
The solution
So we wrote a script called send_nsca_cached to cache the passive check results. The idea is that the script will take the results as usual, but write to a cache file instead of running send_nsca. This cache file would hold a start time, so if the current result exceeded the start time + cache period, then send_nsca would be invoked and send all passive results at once.
We put the script on the slave and could see that the cache file would fill in spurts – 10 entries looked to be written within half a second, but then nothing for a few seconds. Nagios does some tricks to try and spread the service check load, but I wonder if the “traffic jam” of sending the uncached way was causing the services to be bunched up together.
When we checked again in an hour, the maximum Check Latency dropped to under 1 second and the master had only 9 nsca daemons. And I guess it is much better for network load as well to send a whole bunch of data at once, rather than a single message at a time.
The warnings
There had to be some bad points.
- This script is only for Nagios 2.0+ because of the use of environment variables
- We don’t support passive host checks. Not sure if this is a good or bad thing
- Do not use this if your slave is not busy. As send_nsca_cached needs to be invoked in order to send results, if your ocsp_command is only invoked once every minute, then the quickest you will get a batch result sent to the master is every minute, regardless of your cache time. So only use this script on a busy slave. You could use a cache time of 0 to be the same as sending immediately
- Don’t make the cache time too large. The results have no timestamps, so when Nagios on the master receives the results, it will process it as if the check happened just then. Also, if there is too much data being sent, you could fill the command pipe on the Nagios master
- On that point, make sure the master Nagios server has command_check_interval=-1 in nagios.cfg, so that the command pipe is read as quickly as possible. There are known limitations that if the pipe is filled, processes writing to the pipe will hang until more space is available
The future
That last point about the command pipe is being (partially) addressed in Nagios 3.0. Ethan has said at the Nagios Conference in Germany there will be a new external command called PROCESS_FILE, so the idea is that nsca can drop a file down on the master with a file containing passive check results and then only one command is put into the pipe, which will then process that entire batch.
The real solution to point (3) is to let the caching be done at Nagios, rather than externally, and that is also on the radar for Nagios 3.0. So there is lots to look forward to there. But if you want something now, check out our script. It’s not a perfect script because it’s hard coded in various places and you will need to customise the send_nsca command, but we hope it helps you regardless. Enjoy!
The end?
Not quite. At the Nagios Conference, Ethan was talking to two guys who were complaining that their distributed setup had huge slowdowns. I overheard and the symptoms looked exactly the same, so I gave them a copy of the script. Apparently it helped, but they had some lock ups in Nagios, which they think were attributed to our script – so caveat emptor. They have since reverted back to using the standard uncached mechanism.
We haven’t had any issues for our customers, so we’re interested in what you find. If you have a distributed environment with similar symptoms and you are thinking of using this script, please take a note of your Check Latency and the number of nsca daemons and add a comment to this blog with some before and after statistics. We’d love to know if this works elsewhere. Good luck!
Recent Comments