Oct 24

Setting up distributed monitoring in mission critical production environments is a complex task; configuration can be challenging and mistakes costly. Opsview Enterprise edition and Opsview Syncmaster module make deploying an enterprise monitoring system easy and reduce the risks associated with migrating configuration objects from development to production environments. Here’s how: Continue reading »

Tagged with:
Oct 03

monitoring SNMP | OpsviewThis post outlines how to get SNMP traps from ESX hosts and monitor them in Opsview. The first part deals with configuring SNMP traps to get them working correctly with ESX hosts, part 2 tells you how to monitor them with Opsview.

The following steps worked on ESX 4.1. Depending on versions you may have different results. For simplicity, I used 10.0.0.1 as IP for my ESX host, and 10.0.0.99 for my SNMP trap handler. Continue reading »

Tagged with:
Mar 24

SNMP is one of those useful, but mis-understood technologies. I think it doesn’t help that the name is Simple Network Management Protocol, yet when you first start, you get hit by these ridiculous OIDs like .1.3.6.1.2.1.1.1.0 for the system description. It just doesn’t feel simple. Sigh.

However, every networking device manufacturer supports it – and there’s an open source network management system based on it – so we looked into how we could integrate SNMP into Opsview. Polling SNMP devices is already supported through active checks. The next step was receiving traps, which are passive by nature.

There’s a good article in Sysadmin magazine by Francois Meehan, where he describes how to get SNMP traps integrated into Nagios. His design is:

  • snmptrap received by snmptrapd
  • snmptrapd calls snmptt (snmp trap translator)
  • snmptt defines what alert levels each trap should take and then writes to syslog
  • SEC can handle correlation of events, but in this case is configured to read syslog and then pass any single event to a custom python script called snmptraphandling.py
  • snmptraphandling.py then puts an entry on Nagios’ command file based on the hostname and the alert level

That’s a lot of layers! I’m a big fan of the KISS approach, so we went further into how these things worked.

Snmptrapd is from the Net-SNMP project. Though there are other (mainly commercial) implementations, this seems to be the most popular. You configure snmptrapd to invoke a command, called a traphandle, when it receives a SNMP trap. The interface to the traphandle is simple: just call any executable and pass stdin with the:

  1. the host name of the originating packet
  2. the ip of the originating packet
  3. the contents of the packet

An example packet:

cisco2611.lon.altinity
192.168.10.20
RFC1213-MIB::sysUpTime.0 0:18:14:45.66
SNMPv2-MIB::snmpTrapOID.0 IF-MIB::linkDown
RFC1213-MIB::ifIndex.2 2
RFC1213-MIB::ifDescr.2 "Serial0/0"
RFC1213-MIB::ifType.2 ppp
OLD-CISCO-INTERFACES-MIB::locIfReason.2 "administratively down"
SNMP-COMMUNITY-MIB::snmpTrapAddress.0 192.168.10.20
SNMP-COMMUNITY-MIB::snmpTrapCommunity.0 "public"
SNMPv2-MIB::snmpTrapEnterprise.0 CISCO-SMI::ciscoProducts.186

However, snmptt’s documentation suggests that you run snmptrapd with the -On flag, which means “do not translate OIDs to names”.

So the above equivalent would be received by snmptt as:

cisco2611.lon.altinity
192.168.10.20
.1.3.6.1.2.1.1.3.0 0:18:13:59.95
.1.3.6.1.6.3.1.1.4.1.0 .1.3.6.1.6.3.1.1.5.3
.1.3.6.1.2.1.2.2.1.1.2 2
.1.3.6.1.2.1.2.2.1.2.2 "Serial0/0"
.1.3.6.1.2.1.2.2.1.3.2 ppp
.1.3.6.1.4.1.9.2.2.1.1.20.2 "administratively down"
.1.3.6.1.6.3.18.1.3.0 192.168.10.20
.1.3.6.1.6.3.18.1.4.0 "public"
.1.3.6.1.6.3.1.1.4.3.0 .1.3.6.1.4.1.9.1.186

The reason for this is that snmptt has its configuration file indexed by OID. If you do not use the -On flag, snmptt will translate back into OIDs before finding the right entry.

In order for snmptt to know the OIDs, you have to import MIBs into snmptt and then define what the message and alert level is, using the OID as the key. It will then give you a set of macros which you can use to define your message.

Here’s where we disagreed with snmptt’s design – why bother importing MIBs? Obviously, snmptrapd needs to understand MIBs and it does a good job of translating OIDs. By giving snmptt that MIB information too means maintaining MIB importing in two places.

When I get stuck trying to understanding the point of something, I ask myself: What is the custom data? This is important because this needs to be maintained and it leads to the answer of What is the value?.

Snmptt’s value is that lookup between the OID and the message and alert level (and the default message is not that helpful – it takes the 1st line of the description of the MIB and adds the arguments at the end). This is called the snmptt_conf_files in their language, but I’ll call it the message catalogue.

But there is a performance impact with parsing the message catalogue. If snmptrapd calls a perl script which is reading this catalogue at every invocation, then there’s going to be a hit if there are lots of traps being received. This is why snmptt has a daemon mode. The last thing we want is another daemon!

So then we thought: “What about leaving snmptrapd to do the translation?” Instead of indexing by OID, we could index by the trapname itself. This leaves all the MIB information at the snmptrapd level – removing our administrative nightmare – and our glue code would just be text parsing, which perl, our tool of choice, is ideally suited for.

This message catalogue is precisely the type of Nagios configuration data that we want Opsview to excel at. In fact, snmptt missed a trick in that it doesn’t know which host/service to submit the passive check to. This is left to the snmptraphandling.py script, which just does it by putting onto hostname, then alert level (so every host has 3 and only 3 services with regards to snmptraps).

Our traphandle, which we call snmptrap2nagios, therefore needs to:

  • be fast – it could be invoked hundreds of times a minute
  • process the textual data to convert to a message and an alert level
  • know which service on which host wants this alert
  • submit a passive check to Nagios

Since snmptt has some useful code regarding macros, we need to emulate that. This is generic information and is not tied to the rest of Opsview, so we’ve written this as a perl module called SNMP::Trapinfo and we’ve published this on CPAN.

In Francois’ design, SEC was not used for any filtering so we’ve removed it. This removes the need to write to syslog as well.

So now the architecture looks like this:

  • SNMP packet received by snmptrapd
  • snmptrapd’s traphandle calls snmptrap2nagios
  • snmptrap2nagios, if applicable, will write to the Nagios command file

Much cleaner!

Stay tuned for the next post when we discuss how we handle filtering and exceptions.

Update: We forgot to credit Alex Burger for his work on SNMPTT, which lots of users appreciate. Also, Ethan has got a page on integration of SNMPtraps in the Nagios documentation which we didn’t see until recently.

Update: Part 2 posted here.

Tagged with:
Feb 24

Continuing our run of useful SNMP OIDs…

One of the most commonly monitored statictics is filesystem usage. Here is how you do it with SNMP. All OIDs listed are available under MIB-II.

Note: <int> is an integer corresponding to the filesystem number. Most systems will have multiple partitions / filesystems.

Description

.1.3.6.1.2.1.25.2.3.1.3.<int>

Description of filesystem. On a Unix system examples would be / or /home. Under Windows expect C:/, D:/ etc/

Capacity

.1.3.6.1.2.1.25.2.3.1.5.<int>

Capacity of filesystem in blocks

Usage

.1.3.6.1.2.1.25.2.3.1.6.<int>

How many blocks are currently being used to store data

Blocksize

.1.3.6.1.2.1.25.2.3.1.4.<int>

Blocksize in bytes. Important because other stats are in blocks.

Maths

So to find capacity of filesystem in bytes you need to multiply the size in blocks with the block size. Same principle applies to calculating how much of the filesystem is in use.

If you want to display values in Kb / Mb / Gb remember to divide by 1024 each time.

Tagged with:
Jan 19

This post assumes a basic knowledge of SNMP and describes MIB-II OIDs that are handy for monitoring network devices – mainly switches and routers. These OIDs should be present on all SNMP capable devices.

These OIDs all sit under the MIB-II higherarchy

.iso.org.dod.internet.mgmt.mib-2.system.

sysName.0

String containing system name, if configured. Useful for working out which device you are querying.


.sysLocation.0

String containing system location, if configured. Again, useful for working out which device you are querying.


.sysUpTime.0

System uptime in 1/100 of a second. Useful for detecting recently restarted equipment. This counter is actually from the time SNMP was started but usually this is analogous to system uptime.


NOTE:
For the following OIDs <int> is a integer corresponding to the interface number. So to find the description of interface three you need to query ifDescr.3


.ifDescr.<int>

String containing interface description, eg:

  • FastEthernet0/1
  • Serial0/2
  • Loopback0


.ifType.<int>

Similar to ifDescr Gives more specific technical information on interface. Eg:

  • ethernetCsmacd
  • frameRelay
  • softwareLoopback

A full list of interface types can be found here:
http://www.iana.org/assignments/ianaiftype-mib


.ifSpeed.<int>

Speed of interface in bits per second.


.ifOperStatus.<int>

Operational status of interface – up or down. Whether the interface is actually connected or not.


.ifAdminStatus.<int>

Administrative status of interface – whether the interface has been configured to be up or down. (For Cisco: shutdown / no shutdown)


.ifInUcastPkts.<int>

Number of inbound unicast packets received. An entry also exists for outbound packets: ifOutUcastPkts. For traffic statistics it is necessary to monitor the change in this value over time.


.ifInErrors.<int>

Total packet errors for this interface. Again, an equivalent entry also exists for outbound packets: ifOutErrors.


.ipInReceives.0

Total number of received IP packets


.ipInHdrErrors.0

Inbound IP packets discarded because of errors in header.


.ipInAddrErrors.0

Inbound Ip packets discarded because of addressing issues.


.ipInDiscards.0

Inbound IP packets discarded for other reasons (not header or address)


.ipOutNoRoutes.0

No route to host. High values indicate a routing issue.

And that is just about it…

Tagged with:
Nagios © 1999-2011 Nagios Enterprises LLC. Nagios, the Nagios logo, and Nagios graphics are the servicemarks,
trademarks, or registered trademarks owned by Nagios Enterprises, LLC. All Rights Reserved.
Opsview © 2008-2011 Opsera Ltd. Opsview, the Opsview Logo, and Opsview graphics are the
trademarks or registered trademarks owned by Opsera Limited. All Rights Reserved.
preload preload preload