For most companies websites are their corporate face to the world. Any downtime can be costly especially if the sites are used for e-commerce. Web monitoring checks can be set up quickly and easily in Opsview giving you powerful alerting capabilities to check on crashed servers, website attacks and more. Here are 10 easy steps to set up website monitoring in Opsview: Continue reading »
We had a customer problem where there were hundreds of NRPE processes on one of their monitored servers. It was quite bizarre because strace wouldn’t attach to the process. Lsof said there was an established connection from the Nagios server, but when we looked on that server, netstat said there was no such connection! I’ve never seen anything like that before!
Well, the customer’s network team said there was significant packet loss between the Nagios server and the NRPE box (obvious really, when all services to that host were complaining about SSL handshakes). Syslog showed lots of errors too:
Oct 17 10:41:46 host nrpe[2300]: Error: Could not complete SSL handshake. 5 Oct 17 10:42:26 host nrpe[2317]: Could not read request from client, bailing out... Oct 17 10:42:26 host nrpe[2317]: INFO: SSL Socket Shutdown.
It looks to me like the SSL handshake was probably continuing to retry, but the connection must have been severed because it took too long. When our client tried to do an ssh onto the NRPE server, it was taking too long and he had to Cntrl-C to break out. We realised that NRPE should have some sort of timeout itself too.
So we’ve created a patch to NRPE 2.5.2. There is a new parameter in nrpe.cfg called connection_timeout. NRPE now sets an alarm just before handling a connection and then resets it before running the check command. It would have been best to have an alarm set over the entire session, but my_system sets an alarm handler too to make sure the command being executed does not exceed its timeout. This problem is probably SSL only, but the patch is over the connection regardless.
Testing on our customer’s servers, we found that check_nrpe returned CHECK_NRPE: Error – Could not complete SSL handshake as expected and the nrpe daemon then died gracefully when it exceeded the connection_timeout parameter.
It would be too ironic for a monitoring system to cause a box to die – although I hear BMC Patrol has that feature

Opsview is a leading Open Source application and network monitoring suite. Labs is where our engineers discuss new projects, new approaches and new frameworks they’re using.
Recent Comments