Jan 11

We’ve just fixed a bug in Nagios® which an Opsview user had raised to us. A change made to Nagios in version 3.2.2 caused an issue where service alerts were being raised in the nagios.log file for every result that came back from a host that was down. This had the impact of adding lots of extra alerts that were overwhelming Opsview’s event views.

Continue reading »

Tagged with:
Dec 07

The JNRPE server provides an open source Java implementation of the Nagios Remote Plugin Executor (NRPE). This is much more efficient for performing JMX checks than regular NRPE as you only need to start one JVM rather than a JVM instantiation per check (as performed by check_jmx invoking java -jar JMXQuery.jar).

Continue reading »

Tagged with:
Feb 03

Back in October 2008, Opsera acquired Altinity on the strength of the Opsview product and our customer and user base. It was a great marriage as Opsera were providing IT consultancy and wanted to expand into products, of which Opsview is now added with Ops Mail Manager. And there are some really smart people working at Opsera, which is always a good thing!

Continue reading »

Tagged with:
Aug 05

NRPE is great for getting plugin information from a remote host. We wanted to use it to get passive data regarding events, such as syslog entries that SEC had highlighted. This meant we needed two things: multi-line support and larger amounts of output.

Continue reading »

Tagged with:
Mar 13

Michael Prochaska was having trouble with compiling NDOutils on Solaris 10. Since we have an interest in getting Opsview working on Solaris (the upcoming 2.12 release will add Solaris 10 as a supported platform), we offered to help. So this is the result of his company, Bacher Systems, sponsoring our work.

Continue reading »

Tagged with:
Jan 15

Netflow is a great feature of Cisco IOS that allows you a view into the traffic that flows over your Cisco network devices, what that traffic is, where it came from and where it is going.

We wanted to make good use of this information and so we started looking for a way for Opsview to monitor it.

With a little configuration of IOS and some open source magic we achieved just that. Now our Opsview servers are keeping tabs on the data moving across our Cisco devices.

So true to our open source way of life we published our setup as part of the Opsview documentation.

Tagged with:
Jan 08

In our continual task to try and speed up Opsview, we found a bug in NSCA’s handling of aggregate writes when run in –single mode.

The specific failure scenario is this:

  1. NSCA and Nagios are told to start up
  2. A send_nsca request is received by NSCA before Nagios has created the nagios.cmd command pipe
  3. NSCA tries to write to open the command file, but sees it is not there
  4. NSCA opens the alternate dump file instead

Now when Nagios does create the nagios.cmd file, NSCA uses that … unless aggregate mode is on and daemon mode is –single. In this case, it continues to use the alternate dump file, thus Nagios doesn’t see the results from the slaves.

Here’s the patch, which we’ve also added into our source for Opsview.

As we are very keen on good testing, we’ve managed to recreate the failing behaviour in a test script. You also need a test configuration file and a patch to the test framework. If you run this test, it will show the error and then after the patch is applied, the test should pass.

Tagged with:
Jun 21

With Opsview, one of the big features is the simple distributed monitoring – you just select a drop down to associate a host with a slave server and then when you hit the Opsview reload button, all the Nagios configurations are generated as you’d expect (slaves monitoring, master with freshness checking, automatic distribution to slaves, synchronized reloading). It works amazingly well.

But one of the niggly issues we have is that some services go stale before we think they should. So we’ve been tweaking some of the algorithms for setting the freshness_threshold.

One situation we found was that when the master was being restarted, a busy master can lose some slave results during the reload (due to the infamous command pipe being full limitation). So when the master comes back, it could lose one polling cycle’s result from the slave and mark the service as stale before the slave has had a chance to send the next result.

So we patched it, by adding the freshness_threshold to the program_start time instead of the service’s last check time. And we sent an the email to the nagios-devel mailing list to inform. This was accepted into Nagios 2.1. And we got less stale results – hooray!

Roll forward a year. Michelle Craft then discovered that this patch caused a problem – if you set a passive service to have a freshness_threshold of 1 day, but you restart Nagios every day, then the service never expires its freshness threshold. That’s a bad bug, and I’m quite ashamed that slipped through.

Fortunately, we had a solution. Ethan wrote a patch very quickly for Nagios 3, but we wanted something a bit more robust.

At Altinity, we’re big fans of testing. This is not because we like to test – heck, we hate testing as much as the next developer. But we hate regression and unintended consequences more. With the Nagios Plugins, there’s a really large set of tests that get run for every nightly build, with a nice web page that displays the state. One of the tools that makes it happen is LibTap, a library written by Nik Clayton. This is a way of testing C code with output in a perl test format. Apparently, a lot of FreeBSD tests are being written in libtap to prove there are no regressions.

There are some instructions on the Nagios Plugins site for installing libtap on your development servers.

So we’ve fixed this problem now by moving the freshness calculation algorithm into a separate file and then writing a small C program with dummy services and hosts to test that the right thresholds are being returned. The benefits were immediate – I found I had put a wrong bracket around an if statement when one of the tests failed.

The patch, which consists of a patch file, a new freshness.c file and a tarball for the new test directory, applies cleanly onto Nagios 2.9. You need to run autoconf afterwards. ./configure will detect the existence of libtap and compile the test executables. Then when you run make test, it should execute the test and make sure it works properly (you may need to export LD_LIBRARY_PATH=/usr/local/lib to get the libtap library detected properly at runtime).

Tests are hard to do, but worthwhile in the long run. I see it as making sure things still continue to work the same way you expect. And that has to be a good thing.

Hopefully this can be the start of some automated testing for our favourite open source monitoring system!

Tagged with:
Apr 27

We have been asked by a customer if it is possible to change a check command for a service depending on the time of day.

Why would this be useful?

Well, if a server runs time critical processes during the day and slow running batch processes over night, how can a service check command take into account how it is supposed to report on CPU or memory usage without generating false alerts? Yes, you could write your own plugins to take account of the time and react accordingly for each check this needs to be for, but these would have to be installed on each host for each service, the wealth of plugins from http://www.nagiosexchange.org/ cannot easily be used, setting the system up takes longer, and it is all much harder to maintain.

Instead, we have made changes to the service stanza within the Nagios configuration files to include a “check_timeperiod_command <timeperiod>,<command>” entry:

define service {
host_name server1
service_description Free Widgets
check_command check_widget -w 40% -c 20%
check_timeperiod_command nonworkhours,check_widget -w 5% -c 2%
.....
}

You get the idea….

check_command provides the default check for the day. During the nonworkhours period, the alternative command and arguments are used instead.

This seems far too useful to the community to keep to ourselves, so we offer the patch for Nagios 2.8 here, for peer review and comments (all of which are very welcome).

And here is a patch for ndoutils 1.4b2 that goes with it.

Enjoy!

Update: Patches for Nagios 3.0.6 and NDOutils 1.4b7 are available

Tagged with:
Apr 02

We’ve encountered some problems with mysql detection in NDOUtils – it doesn’t work on one of our redhat servers. The specific problem is that the ceil function is not found, which is because -lm is missing from the list of libraries to add at link time:


utils.o(.text+0x14e): In function `ndo_dbuf_strcat':
: undefined reference to `ceil'
collect2: ld returned 1 exit status

Rather than adding that library in manually (along with the -lz library that we found earlier for Mac OS X), we should use information from mysql_config to construct the compile flags. However, this is a bit tricky because of the various permutations.

Fortunately, the Nagios Plugins have a solution already. They have an m4 file, called np_mysqlclient.m4, that is used to detect mysql_config and this returns data from the msyql_config for configure to use.

So we’ve patched NDOUtils so that it uses this m4 file now. In order to use, you have to apply the patch to configure.in, add a new m4/ directory to the top level and copy np_mysqlclient.m4 into m4/. Then run:

aclocal -I m4
autoconf
./configure --with-mysql=DIR

The detection is the same as in the Nagios Plugins: ./configure will try to find mysql_config in DIR/bin/mysql_config, otherwise will look in the PATH.

The nice thing is that if the logic for detection needs to be enhanced, we can update the m4 file and propagate the changes back to the Nagios Plugins as well. So everyone wins!

There’s also a patch for CFLAGS in src/Makefile.in (which were getting overridden – presumably for testing), a small header change in config.h.in and some Makefile.in changes because make errors were getting lost by the cd .. command.

We’ve tested this on a Mac OS X server, a Debian Etch server, and 32bit and 64bit Redhat, and it is looking good.

Unfortunately, it means deprecating the –with-mysql-inc and –with-mysql-lib configure options. Hopefully, you’ll see why this way is so much nicer.

Here’s the patch against CVS HEAD.

Update: Here’s the patch, reworked for NDOutils 1.4b3

Update: You can get the tarball with just this patch here

Tagged with:
Nagios © 1999-2011 Nagios Enterprises LLC. Nagios, the Nagios logo, and Nagios graphics are the servicemarks,
trademarks, or registered trademarks owned by Nagios Enterprises, LLC. All Rights Reserved.
Opsview © 2008-2011 Opsera Ltd. Opsview, the Opsview Logo, and Opsview graphics are the
trademarks or registered trademarks owned by Opsera Limited. All Rights Reserved.
preload preload preload