Aug 24

PluginsThe Nagios Plugins project recently released a new version. Amongst the changes is a new feature which we added for a customer. The requirement was to measure the rate of change for SNMP counters. The standard check_snmp plugin is great at getting information, but only at a specific moment in time. For some things, you want to check and alert on the rate of change. There’s a lot of interesting metrics that you can get from SNMP which are Counter32 or Counter64 values. An example is IP-MIB::ipInAddrErrors.0. This counts the number of packets that are discarded due to invalid IP addresses – probably due to network errors or an attempt at infiltrating your network.

Continue reading »

Tagged with:
Jul 14

In a standard Nagios plus database implementation, you use NDOutils to store information in a database. While we think NDOutils is fantastic, there are some major limitations with it as you monitor more hosts. With Opsview, we want to scale. We’ve already done lots of work with NDOutils, including adding view-like helper tables, updating the database asynchronously, improved indices and speeding up the time to load the configuration at a Nagios reload. Now we want to share an amazing improvement we’ve discovered.

Continue reading »

Tagged with:
Oct 09

At the heart of Opsview is the Nagios monitoring engine. One of the policies we have with Opsview is to keep the number of changes of our dependent software as low as possible. We do this by keeping track of all the patches we apply and pushing these back upstream (though recently we haven’t had as much time as we’d like…).
Continue reading »

Aug 21

We have a long history of working with Nagios. Most of our technical staff has used Nagios for many years – both here at Opsera and at previous employers. Our Opsview product is based on Nagios because we believe it is the best open source monitoring engine available.

Continue reading »

Tagged with:
Feb 03

Back in October 2008, Opsera acquired Altinity on the strength of the Opsview product and our customer and user base. It was a great marriage as Opsera were providing IT consultancy and wanted to expand into products, of which Opsview is now added with Ops Mail Manager. And there are some really smart people working at Opsera, which is always a good thing!

Continue reading »

Tagged with:
Aug 05

NRPE is great for getting plugin information from a remote host. We wanted to use it to get passive data regarding events, such as syslog entries that SEC had highlighted. This meant we needed two things: multi-line support and larger amounts of output.

Continue reading »

Tagged with:
Mar 13

Michael Prochaska was having trouble with compiling NDOutils on Solaris 10. Since we have an interest in getting Opsview working on Solaris (the upcoming 2.12 release will add Solaris 10 as a supported platform), we offered to help. So this is the result of his company, Bacher Systems, sponsoring our work.

Continue reading »

Tagged with:
Jan 15

Netflow is a great feature of Cisco IOS that allows you a view into the traffic that flows over your Cisco network devices, what that traffic is, where it came from and where it is going.

We wanted to make good use of this information and so we started looking for a way for Opsview to monitor it.

With a little configuration of IOS and some open source magic we achieved just that. Now our Opsview servers are keeping tabs on the data moving across our Cisco devices.

So true to our open source way of life we published our setup as part of the Opsview documentation.

Tagged with:
Jan 08

In our continual task to try and speed up Opsview, we found a bug in NSCA’s handling of aggregate writes when run in –single mode.

The specific failure scenario is this:

  1. NSCA and Nagios are told to start up
  2. A send_nsca request is received by NSCA before Nagios has created the nagios.cmd command pipe
  3. NSCA tries to write to open the command file, but sees it is not there
  4. NSCA opens the alternate dump file instead

Now when Nagios does create the nagios.cmd file, NSCA uses that … unless aggregate mode is on and daemon mode is –single. In this case, it continues to use the alternate dump file, thus Nagios doesn’t see the results from the slaves.

Here’s the patch, which we’ve also added into our source for Opsview.

As we are very keen on good testing, we’ve managed to recreate the failing behaviour in a test script. You also need a test configuration file and a patch to the test framework. If you run this test, it will show the error and then after the patch is applied, the test should pass.

Tagged with:
Sep 29

With Nagios 3 rapidly approaching and Opsview celebrating being a full open source project (GPL licensed, source code repository online, Sourceforge project), we think it is time to share some of our Nagios patches.

These are the latest patches you can find for Opsview within our code repository. Some are Opsview specific, but a lot can be incorporated into the core code – we’ll say which is which. You can see all these on our SVN site (we’ve even tagged the current version so this will stay in our repository), but here’s the lowdown:

Freshness checking, with separated file and tests

If there’s a patch we definitely want to have applied to the core code, it will be this one. Not because of freshness checking per-se (though we’ll explain why later), but because of the included libtap tests.

As much as we love Nagios, we’re always a bit concerned that regressions may occur. We have complete faith in Ethan, but he’s human and unintended effects may occur. In fact, we made one when we originally suggested this freshness patch, so with testing, future changes should hopefully not cause regressions.

It requires work to add in tests and to separate out files, and we intend to stick to our commitment to add new tests in. But the framework needs to be put in place to encourage other tests, otherwise the overhead for Altinity is too high.

Think of tests like this: the code is the generalised form; the test is under specific conditions. The key is to try and get more and more conditions to prove that things work as expected.

Have a look at the test – we think it is easy to see what is being tested here (to get it from our svn repository, you need to extract the tarball). And note how comprehensive it is – we think every case is considered. A change in logic anywhere will be immediately spotted.

We’ve refactored how the calculated freshness_threshold is arrived at so that we can run tests against it.

There’s also an arbitrary 15 seconds added to the freshness threshold. We’ve made that a new variable called additional_freshness_latency in nagios.cfg, so you can tweak it without recompiling Nagios.

Could be applied to core. Please :)

More freshness tweaks

Another thing we found was that Nagios is very fast in reading 10,000 services (5 seconds), but slows down dramatically with NDOutils integrated (2 minutes). It appears to be reading configuration and then sending to the broker modules. Since NDOUtils is synchronously updated, Nagios is waiting while mysql is running the necessary SQL. We’ve updated the freshness code by introduced a new variable called monitoring_start. This is when Nagios actually starts monitoring, as opposed to program_start which is the HUP time. We get a better idea of how long it takes Nagios to startup.

We’ve written a little plugin that returns performance data about the startup time.

Also, we’ve pushed the threshold forward a little bit more to include the max_host_check_spread/max_service_check_spread, which is important for new services.

We’ve updated the tests to reflect the changes. Patches on top of other patches get really hard to maintain, which is why we need the libtap tests integrated into the core code.

Could be applied to core.

Initial passive state as OK

This is one where we change the Nagios CGIs to show passive states as OK. We just like everything green.

We don’t expect this in core.

Issue commands

This has been applied to Nagios 3.

Status link to Nagiosgraph

This helps with our integration to Nagiosgraph.

We don’t expect this in core.

Passive checks do not check host

We’ve discussed this before.

We don’t expect this in core.

Ignore certain retained data

We’ve mentioned this before on the nagios-devel mailing list. Ethan has made changes to Nagios 3 to support this behaviour.

Adding a time=X to the statusmap

With the AJAX goodies we have in Opsview, we found the statusmap wasn’t updating correctly. It appears that some browsers try to use cache data in an XHTTPrequest if the URL is the same. We’ve added this to the URL so that it is always unique.

This is AJAX specific, so we don’t expect this in core.

W3 validation: history.cgi

We’ve big on valid HTML. Partly this is because we wrap the CGI output and remove the use of framesets in Opsview. However, it means the HTML has to be valid. We found several problems in the validity of history.cgi and other CGIs below.

A great tool is HTML Validator, which runs as a Firefox plugin – this tells you if your HTML is valid.

Could be applied to core.

Esccalation via notification levels

This is an extra field to the contact stanza where you can specify that they will receive notifications only after the Nth notification. This makes it an easy way of doing escalations.

We’ve spotted an issue where if no notifications are sent, the notification number doesn’t get incremented. Maybe this is best as a different macro.

Could be applied to core, but requires a bit more thinking.

Documentation patches for validation

The use of markup caused problems for us, so we’ve fixed some of the docs.

Could be applied to core.

W3 validation: extinfo.c

We’ve fixed some validation errors with divs. Have you seen HMTL Validator? :)

Could be applied to core.

Trust authentication

This patch stops the Author box from being altered by the logged in user. Ethan has applied something similar to Nagios 3.

Already in Nagios 3.

Slice services within hosts

This patch allows a contact in a contactgroup to only see a subset of services. Normally, a contact to a host sees all the services, but this allows the contact to only see the services specified.

This is possible by setting the contact to not have the host in the contactgroup, but then that stops the contact from taking action on that host.

This could be applied to core, but is a (relatively) major change to the use of contactgroups.

Extinfo icon links to service notes

We find that users click on the extinfo icon and then get a bit worried when nothing happens. We make it a clickable link. There are also a few validation fixes here too (should really be separated out).

Could be applied to core.

Object dump

This is a good one! As you know, we love tests. One thing we do with Opsview is make sure that the configuration being generated is the same after we’ve tinkered with the rules. We tried to find a good way of doing this – initially we thought about using Nagios::Object to read the config data and then do a diff to find the changes. However, this didn’t take into account all the relationships.

What we really wanted was some expanded form of the config files.

It then hit us – Nagios already does this! It uses the object.cache file as an expanded version of the configuration objects for the CGIs to use. So we’ve patched the core nagios executable so -o will now output to stdout this cache file and then exit. It works great in our testing.

Could be applied to core.

Retain status file over a reload

In our quest to make Nagios more friendly, there’s nothing worse than getting the dreaded “Nagios is not running” screen on your browser. This patch adds a new command line option -F, for fast-reload.

It does two things:

  • It doesn’t delete the status file on a HUP signal. This gives the impression that Nagios is still running even though no new status information is being updated. We think this is acceptable – after all, CGIs are displaying the “latest” data, it just so happens that there is no update at this precise moment. The status file age doesn’t change, so nagiostat will show that the data is getting older, but it removes that scary screen
  • We ignore the pre-flight check. As part of Opsview, we validate the config before we send a HUP signal, so this is redundant. Along with the long startup times for Nagios, we find this makes Nagios a lot more responsive for large scale systems

Could be applied to core, possibly as two different command line options.

Check command by time period

This is a nice feature which we’ve discussed before. We have customers asking to run a different command based on a timeperiod. The most obvious use is altering the thresholds for the load of a server – a server may run batch work overnight thus increasing its load.

Could be applied to core.

Using relative path names for config files

We run tests internally on new versions of Opsview, trying to prove that our generated config files do not change unexpectedly. One thing we hit was the use of full path names in nagios.cfg. This meant we either had to change the path on the fly or move directories around.

This patch allows the use of a relative path. The path is taken as relative to the directory that holds nagios.cfg. We find this works really well.

There is a dependency on dirname(), which will probably have to be changed to a cross platform implementation.

Could be applied to core.

Making forcecheck option

By default, force check is on when you Reschedule an active check. In a distributed environment where you have a “set to stale” script as the active check, this is not wanted. We change it so that the form enables only if the field is passed through.

We then alter some of the links so that the field is off by default based on whether the service is actively checked.

Could be applied to core.

Add hosts to hostgroup in same order

We make the members field in the contactgroups stanza optional. What this means is that we can add the members of a contactgroup via the contact instead. This turns out to be significantly faster in our configuration generation scripts. Thus we also remove the error in the nagios configuration about the stanza information.

We also add the contacts into the list in the same order as they are processed. When it was added in reverse order, our tests were failing because the order was not preserved.

Could be applied to core.

Handle initial state

In NDO, if a service starts up in an error state, a state change is recorded. However, if a service starts up in OK, a state change is not. This patch will cause a state change to occur.

Technically it is a state change from a PENDING to an OK, so it should be recorded. This helps us in the NDO nagios_statehistory table, which we’ll discuss about more in a future blog.

Could be applied to core.

Validation error in statusmap cgi

An incorrectly placed </form> caused problems with our AJAX screens. This fixes. Did we mention HMTL Validator?

Could be applied to core.

Latency values for passive checks

While working on freshness checking, we discovered that the latency values were incorrect. In fact, looking in the NDO db told us this. This fixes the calculation.

Could be applied to core.

Do not resend retained status to NDO

On startup, Nagios writes all the current host/service status to NDO. However, the database already knows this. This causes problems on large scale systems.

A side effect is that if NDO is switched on after Nagios is running for a long time, each object needs to have a new status result before NDO sees it, but this is probably acceptable.

Another impact is that other future broker modules might want the retained status information, so maybe this is best implemented at the broker level, but we couldn’t see an easy way of passing only this particular case.

This also has an impact to NDO, so there’s a patch required there.

Could be applied to core.

Segfault when processing no output

We had a big problem with a customer’s system where it was crashing occasionally. We had to analyse coredumps and eventually found the problem: on the master server, if the plugin output is only “|” for a passive host check, then sometimes a segfault would occur.

We think this is related to parsing the plugin output, but only if passive checks are processed with a backtrace from check_host.

Anyway, we’ve fixed it by changing the algorithm for parsing the plugin output. Our guess is that strtok is causing the problem, but we really don’t understand why. Sigh.

With this patch, our customer’s Nagios has not crashed for a 1 month – so we’re safe!

Code be applied to core.

Returning passive latency values in nagiostats

With the fix to the passive latency values, we then want to find out what the values are for passive latency over a long period of time.

This patch updates nagiostats.

Code can be applied to core.

Is that all?

Yes, for now! We’ve made lots of changes to Nagios over the last 12 months, which we think are suitable for core. Sorry for not publishing them sooner.

If you want to have an Altinity compiled version of Nagios, just do this:

cd /tmp
svn export http://svn.opsview.org/opsview/trunk/opsview-base
cd opsview-base
make nagios

This will patch Nagios and run ./configure with our usual settings (there are some dependencies (autoconf, automake) required, but we’ll leave that for you to work out!). You’ll get the exact version of Nagios that we use in Opsview – in fact, you’ll get them before our customers get them!

We’ll do a similar Patch Day for NDO soon and talk about some of the performance tuning we’ve been doing for our large customers.

Enjoy!

Tagged with:
Opsview © Opsera Limited 2010 All Rights Reserved
Nagios © 1999-2009 Ethan Galstad. Respective copyrights apply to third party source code
Opsview is a registered trademark of Opsera Limited. Nagios is a registered trademark of Nagios Enterprises. All Rights Reserved
preload preload preload