Jan 15

It has been some time since we last talked about SNMP trap handling, but there’s been some major developments. Recall we use the perl module SNMP::Trapinfo to process a incoming trap. We think this works really well, but there was a major piece of functionality our customer wanted:


Complex calculation of whether a trap passes a test

And by complex, we mean complex. Here’s an example trap:


dastardly.altinity.net
10.243.196.251
SNMPv2-MIB::sysUpTime.0 119:2:04:40.34
SNMPv2-MIB::snmpTrapOID.0 CERENT-454-MIB::remoteAlarmIndication
CERENT-454-MIB::cerent454NodeTime.0 20060814114937D
CERENT-454-MIB::cerent454AlarmState.9216.remoteAlarmIndication notAlarmedNonServiceAffecting
CERENT-454-MIB::cerent454AlarmObjectType.9216.remoteAlarmIndication ds1
CERENT-454-MIB::cerent454AlarmObjectIndex.9216.remoteAlarmIndication 9216
CERENT-454-MIB::cerent454AlarmSlotNumber.9216.remoteAlarmIndication 2
CERENT-454-MIB::cerent454AlarmPortNumber.9216.remoteAlarmIndication port36
CERENT-454-MIB::cerent454AlarmLineNumber.9216.remoteAlarmIndication 0
CERENT-454-MIB::cerent454AlarmObjectName.9216.remoteAlarmIndication DS1-2-36-7
SNMP-COMMUNITY-MIB::snmpTrapAddress.0 216.243.196.251

Our customer wanted to be able to say: “Give me a critical alert if cerent454AlarmState.9216.remoteAlarmIndication is not ‘cleared’ and the cerent454AlarmSlotNumber is greater than 5″. Well, this was impossible with our previous setup. I still don’t know why it is called Simple Network Management Protocol…

We sat down to think about this and then realised we probably need an arbitrary way of calculating an SNMP trap, but the last thing we wanted to do was write a syntax parser. That would involve a whole new language, all the parsing work involved, etc, etc. This would take months of work!

Looking for inspiration, we realised OpenNMS has claimed this type of functionality. We downloaded a copy and tried to install it, but hit loads of pre-requisites. We’re very lazy – we should evaluate other technologies, but if it is too much of a pain to install, then we’ll give up right away!

Undeterred, we went for the next best thing – their documentation! Searching around, we found the section on evaluating traps. It appears that OpenNMS have a table called events, which is a list of all the things that happened. Then there are various filters which evaluate against those events to work out whether something needs to be alerted on. SNMP traps are converted into this event format and dropped into that table.

(As an aside, Nagios holds no such processing logic. All that complicated processing is handled by the plugins. Nagios only cares about the result. This is a feature :) )

It then dawned on us the beauty part of OpenNMS’ design: rules are expressed as SQL statements.

Let me repeat that again: rules are just SQL statements. If the SQL evaluates to 1, then an alert is raised, otherwise ignored. Fantastic! This does away with all the “design your own syntax” work, with a clear, recognised language! No duplication of work!

So the above requirement could be met with a rule in OpenNMS (we think! We haven’t actually tried this!) that says:

(cerent454AlarmState != 'cleared') & (cerent454AlarmSlotNumber > 5)

which equates to a SQL statement like:

SELECT ipaddr
FROM ipinterface
WHERE ipaddr in (SELECT ipaddr FROM ipinterface, node
WHERE cerent454AlarmState != 'cleared'
AND ipinterface.nodeid =node.nodeid)
AND ipaddr in (SELECT ipaddr FROM ipinterface, snmpInterface
WHERE cerent454AlarmSlotNumber > 5
AND ipinterface.ipaddr = snmpInterface.ipaddr);

But we couldn’t do that with SNMP::Trapinfo – no SQL database. Tacking on DBI.pm support would be terrible. But then it hit us – why not use Perl? Most sysadmins know perl syntax and it would allow useful functionality like regular expressions, which are not as powerful in SQL.

How do we express the SNMP trap variables? Well, we already have that in SNMP::Trapinfo – macros. ${CERENT-454-MIB::cerent454AlarmState.9216.remoteAlarmIndication} evaluates as notAlarmedNonServiceAffecting in the example trap, but instead of making it a line to display, wrap it up in some perl code:

“${CERENT-454-MIB::cerent454AlarmState.9216.remoteAlarmIndication}” eq “cleared”

(These Cerent devices also make it difficult to find a specific variable because it encodes the object index number, 9216, into the oid name. Sigh – no one said SNMP had to be Simple or consistent. To overcome this, we introduced the idea of a wildcard for an OID tuple, so the above could be written as “${CERENT-454-MIB::cerent454AlarmState.*.remoteAlarmIndication}” eq “cleared”. There are some issues if there are multiple OIDs which match this name, but we assume that only one matches…)

There’s a new method in SNMP::Trapinfo called eval. This evaluates the string as a snippet of perl code and gets the return code. There are three possible results that come back from the eval:

  • 1 = true – the perl snippet runs and evaluates true
  • 0 = false – the perl snipper evaluates as false
  • undef = error – the perl code did not run correctly (most likely is syntax errors)

This last case is possible if the variable name does not exist. For instance, the expansion of ‘${CERENT-454-MIB::cerent454AlarmSlotNumber.*.remoteAlarmIndication} > 5′ would convert to ‘ > 5′ which is not valid perl code if the trap coming in did not contain the desired variable.

So our way of expressing the rule required is:


"${cerent454AlarmState.9216.remoteAlarmIndication}" ne "cleared" && cerent454AlarmSlotNumber.9216.remoteAlarmIndication > 5

We have a basic wrapper script that if this code returns as true, we send a passive check to Nagios.

One final thing: we have a front end application to configure the perl snippet of code. This is obviously tainted. We don’t necessarily know what is contained in the code, so it could do things like “system(‘rm -fr $HOME’)”. We added on the Safe module, so now it is restricted to only running specific operators, like the comparison and regexps and mathematical functions. Good security lets us sleep at night :)

SNMP::Trapinfo is now released on CPAN. We use this for our SNMP trap processing and we think it works fantastically well. And this continues our aim of making the base portions of Opsview as solid as possible.

Tagged with:
Jun 07

Last time we looked at how to get SNMPtraps received into Nagios. This time we’ll show how Opsview handles the configuration of it.

Recall that the new design is:

  • SNMP packet received by snmptrapd
  • snmptrapd’s traphandle calls snmptrap2nagios
  • snmptrap2nagios, if applicable, will write to the Nagios command file

In Opsview, we use a web interface to configure the traps we are interested in. On this screen, we define the traps we expect to receive.
list_traps.png


Each trap has an alert level and a message. The message can use macros which are supported by our perl module SNMP::Trapinfo. You define them on this screen.

define_trap.png


If desired, you can deny this trap, so when the trap gets to snmptrap2nagios, it will be discarded. It is possible to deny the trap at snmptrapd, but we haven’t done that (as you normally have to be root to change snmptrapd’s configuration file. However, this is a worthwhile enhancement if there are lots of traps received).

But that’s not all! If a trap is allowed, you can then select to process it or ignore it at the host level. Here’s the configuration screen you get for defining your service check. This service check is then associated with a host.
defining_servicecheck.png

What is the defer option? Well, we thought there are 3 possible actions when a trap is received:

  1. You want to process it
  2. You want to ignore it
  3. You weren’t expecting it!

So defer means you haven’t said either way. This is the default for any new traps.

When this servicecheck has been linked to a host, Opsview will then configure Nagios with a service check called Interface which will accept traps linkUp and linkDown. The state of this servicecheck in Nagios will change dependant on the alert level defined for these traps.

So we have 2 levels of filtering:

  • globally deny a trap
  • ignore a trap on a host basis

What if, say, a Security trap arrives for a host that does not have the service associated with it? This is an exception, which needs manually intervention. Opsview has a table called snmptrapexceptions which stores all the traps that snmptrap2nagios hasn’t been told what to do with. When we thought about it, there were 5 distinct error conditions:

  • A trap was received with no valid trapname
  • A trap was received where the trapname was not fully translated – this usually means a MIB file has not been loaded
  • A trap was not recognised – it has not been defined
  • A trap was received, but it was not expected for this host (defer)
  • A trap was received for a host that is not defined to Nagios

We have a screen which shows all the exceptions and then gives operational options on what actions to take next.
exceptions_list.png


Notice there is a Promote Mib button. When we distribute Opsview, we put all our known MIBs into /usr/local/nagios/snmp/all. However, there is a penalty with loading unnecessary MIBs. So we configure snmpd to only load mibs in the default area and /usr/local/nagios/snmp/load.

When you click Promote Mib, we use a perl module called Net::Dev::MIBLoadOrder, which can tell you which MIB a specific OID belongs to. We then copy that MIB file into /usr/local/nagios/snmp/load and restart snmpd. This is one major administrative headache reduced!

Once you’ve redefined the actions you want, we tell Opsview to reprocess all the snmptrap exceptions based on the new rules (but no passive checks are submitted to Nagios). This will reduce the exceptions table so then an administrator would continue to set new rules until there are no exceptions left.

Astute readers may be wondering: “what happens if I receive a trap which is bad (linkDown) and then a good trap (linkUp) on the same service check”? The answer is that the bad trap will make the service go CRITICAL/WARNING, but the good trap will make the service OK. This means it may potentially get lost. We’ve made a decision to take this limitation, rather than use the is_volatile or stalking_options. However, we’re in discussions with Ethan to see if we can enhance Nagios to cope with this types of event.

Our aim is to make Opsview as easy to use, while we continue to improve Nagios and the general open source universe, through software or knowledge sharing. We hope this gives you an insight how you can get SNMPtraps working with Nagios.

Tagged with:
Nagios © 1999-2011 Nagios Enterprises LLC. Nagios, the Nagios logo, and Nagios graphics are the servicemarks,
trademarks, or registered trademarks owned by Nagios Enterprises, LLC. All Rights Reserved.
Opsview © 2008-2011 Opsera Ltd. Opsview, the Opsview Logo, and Opsview graphics are the
trademarks or registered trademarks owned by Opsera Limited. All Rights Reserved.
preload preload preload