This Monday morning, we got lots of calls from our users where Opsview slave systems running Nagios were raising freshness alerts because checks weren’t being run within their specified period.
Suspiciously, this was also the weekend where clocks went back one hour due to daylight savings changes.
Hosts and services were meant to be scheduled in 5 minute intervals on our internal system, were marked to be run on 26th October at 23:00 instead of 25th October.
So there seems to be a problem within Nagios where the rescheduling has been adversely affected by the clock changes. We expect this will affect any countries still using Daylight Saving Time when they revert over the next few weeks.
This affects all active checks on all Nagios systems, not just a distributed environment.
We’re going to investigate this more deeply within Nagios because this definitely worked fine over the last few years. From the Nagios mailing lists, it seems to be in Nagios 3.2.0, but not Nagios 3.0.6. However, it has affected Opsview systems that run Nagios 3.0.6 and I think it is due to this patch which we brought forward from Nagios 3.2.0.
The workaround at the moment is to recheck all your hosts and services again. In Opsview 3.3.2, you can use the Mass recheck functionality.
Alternatively, you can use this script which we’ve written. Download it, put it in /usr/local/nagios/bin and run it. It will submit a SCHEDULE_HOST_CHECK and SCHEDULE_HOST_SVC_CHECKS for all hosts in your system, using the objects.cache file to get all the host names. There is a random 5 minute difference applied, so that not all the checks will run at the same time.
You may need to adapt the script if you use different paths, but it will work on all Opsview systems.
In a distributed environment, you only need to run this on the Opsview master server and the requests will be sent to slaves automatically.
Update: We’ve been testing this and have found the following results:
- If you use a timezone such as Europe/London, then the bug was triggered between 2009-10-25 23:00:00 and 2009-10-26 00:00:00
- If you use a timezone such as America/New_York, then the bug will be triggered between 2009-11-01 23:00:00 and 2009-10-02 00:00:00
- If you use UTC, then the bug will not be triggered
Basically, it occurs on the day when the time for the nagios daemon goes back an hour.
We’ve managed to recreate the bug in a libtap test, so we’re working on a fix to Nagios.
Update: We’ve applied a patch to core Nagios for this fix. Using our tests, we’ve found that the bug happens only during the 23:00 to 00:00 on the day when clocks go back in time.
Since it is an automated test, we ran the test checking every half an hour, a million times and … it started to fail … in 2038. This is a known Year 2038 problem. I think there will be changes to Nagios before then

Opsview is a leading Open Source application and network monitoring suite. Labs is where our engineers discuss new projects, new approaches and new frameworks they’re using.



I suspected the time change was the cause of my slaves being stale, thank you for the workaround!!
Thank you very much for your workaround-script for this bug. Works nicly!
Thanks for the script! it works well on a Nagios install…which is useful by the time I get Opsview to work (on Solaris)
Thanks for the script, I saw this post but didn’t fully read it until this morning everying went to a unknown state at the same time. We aren’t quite up to the current version so I’m glad you provided the mass recheck script.
Good job, keep up the great work.
Thanks for posting this. I’ve been trying all morning to figure out why my checks suddenly started ignoring their check_interval. I’m really surprised (and disappointed) that this is not mentioned on nagios.org anywhere.
Our master is working fine with the script, but slaved hasn’t picked up the changes. Will keep looking.
Fixed now. Thanks! Just needed to restart everything.