<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Opsview Labs &#187; Nagios</title>
	<atom:link href="http://labs.opsview.com/tag/nagios/feed/" rel="self" type="application/rss+xml" />
	<link>http://labs.opsview.com</link>
	<description>Opsview&#039;s Engineering Blog</description>
	<lastBuildDate>Fri, 20 Jan 2012 09:32:54 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Virtual Insanity: How To Remain Lean And Green</title>
		<link>http://labs.opsview.com/2012/01/virtual-insanity-how-to-remain-lean-and-green/</link>
		<comments>http://labs.opsview.com/2012/01/virtual-insanity-how-to-remain-lean-and-green/#comments</comments>
		<pubDate>Thu, 12 Jan 2012 09:39:30 +0000</pubDate>
		<dc:creator>James Peel</dc:creator>
				<category><![CDATA[Green IT]]></category>
		<category><![CDATA[cloud monitoring]]></category>
		<category><![CDATA[Nagios]]></category>
		<category><![CDATA[Opsview]]></category>
		<category><![CDATA[Virtualization]]></category>

		<guid isPermaLink="false">http://labs.opsview.com/?p=2106</guid>
		<description><![CDATA[
			
				
			
		With the likes of cloud computing and virtualisation starting to become staples for today’s business, IT environments are continuing to grow in complexity. Furthermore, there is growing pressure on many organisations to reduce the environmental impact of their IT systems.
In response to these developments, organisations need to change the way they manage and therefore monitor [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Flabs.opsview.com%2F2012%2F01%2Fvirtual-insanity-how-to-remain-lean-and-green%2F">
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Flabs.opsview.com%2F2012%2F01%2Fvirtual-insanity-how-to-remain-lean-and-green%2F&amp;style=normal&amp;b=2" height="61" width="50" />
			</a>
		</div><p><strong><a style="font-weight: bold;" href="http://labs.opsview.com/wp-content/uploads/2012/01/green-it.jpg"><img class="size-full wp-image-2108 alignleft" style="margin-bottom: 8px; margin-right: 10px;" title="green-it" src="http://labs.opsview.com/wp-content/uploads/2012/01/green-it.jpg" alt="" width="175" height="117" /></a>With the likes of cloud computing and virtualisation starting to become staples for today’s business, IT environments are continuing to grow in complexity. Furthermore, there is growing pressure on many organisations to reduce the environmental impact of their IT systems.</strong></p>
<p>In response to these developments, organisations need to change the way they manage and therefore monitor their IT infrastructure. For example, with the advent of virtualisation, organisations now have a fundamentally different infrastructure platform from which they are running business systems. This in turn requires a different monitoring approach.<span id="more-2106"></span></p>
<p>If organisations can’t adapt their physical world approaches to monitoring, they could find that they aren’t made aware of systems problems until users start complaining about downtime.</p>
<p>When it comes to cloud they need to consider how they are going to monitor both the performance and use of cloud services. Finally, with it being unlikely that organisations will move all services to the cloud they need to find ways to monitor environmental factors such as the temperature of data centres and the CO2 emissions of IT.</p>
<p>This data will not only allow them to find ways to reduce energy costs, but also report back up the organisation when it comes to the environmental targets that have been set.</p>
<p>At the same time as being presented with an increasingly complex environment to monitor, IT departments are under pressure to dramatically reduce the cost of day to day operations. The result is a conundrum that requires new ways of thinking to solve.</p>
<h3>Taking The Pain Out Of Virtualisation</h3>
<p>Virtualisation has been utilised by many organisations looking to improve operational efficiency. Thus, with physical and virtual machines now present in many businesses, monitoring and managing both effectively is vital.</p>
<p>When implementing and monitoring virtualisation, businesses face a number of challenges – all of which add to the complexity of IT monitoring. Virtualisation allows you to improve server utilisation. Yet, when you increase the utilisation of a previously under-used server, it can be difficult to know how that server will manage the increased load.</p>
<p>Therefore, monitoring the performance of the VM host and all the virtual machines running on it is imperative otherwise organisations could find they suffer from performance slow downs or worse still, downtime. The problem is virtualisation doesn’t behave like, or conform to the same rules as physical hardware and therefore often traditional approaches to monitoring both infrastructure and applications don’t meet the grade.</p>
<p>You often find that the VM itself can be monitored, but there is no insight into what’s going on within it or into the applications it’s hosting.</p>
<p>A further challenge for a lot of organisations is staying on top of the growing sprawl of virtualisation, due to the ease of creating virtual machines. To date, this is something that many have struggled with using older proprietary IT monitoring and management tools.</p>
<p>To combat these potential issues, organisations need to update their approaches to IT monitoring. They need to use tools that can provide insight into virtualisation and that enables them to understand how every application on every virtual machine is running, what problems could occur and how they can be remedied.</p>
<p>Without this investment in IT monitoring, virtualisation will not bring the ROI expected and the increased complexity could in fact result in ongoing performance issues.</p>
<h3>Up In The Air – The Cloud Effect</h3>
<p>The acceptance of cloud computing has increased over the past few years. Now, the terms public, private and hybrid are understood by the majority of organisations, with data centre association Afcom citing that more than 70 per cent of UK businesses are already implementing cloud, or seriously considering it.</p>
<p>Adopting a cloud model allows businesses to move some applications and services off-premise, resulting in reduced costs, with less IT equipment required on-premise. However, cloud does potentially present a number of challenges when it comes to monitoring the overall performance of an organisation’s IT infrastructure.</p>
<p>With this in mind, organisations should start thinking about how they are going to monitor and assess cloud performance in the future, in order to guarantee IT performance. The difficulty at the moment is that there are not many standards around cloud when it comes to moving applications between public and private clouds, and consequently monitoring them.</p>
<p>Therefore, before choosing a monitoring tool, organisations need to ensure that the solution has the flexibility to adapt to any future changes once standards do eventually emerge.</p>
<p>Another challenge organisations should be looking out for as they begin to use cloud-based services, is that they could experience cloud sprawl. Although one of the benefits of the cloud model is that users can buy services on a pay-as-you-go basis, organisations will need to make sure they are controlling cloud deployments, as it can be very easy to continue paying for cloud services long after they have finished using them.</p>
<p>This is for the simple reason that people often forget to tell the cloud provider that they have finished using the service and therefore keep getting billed. Ultimately, businesses will need to have a consolidated view of all cloud services meaning they will then be able to monitor how much these are being used. This way, they can make sure they only pay for what they are actually using rather than paying for what they have deployed.</p>
<p><strong>Going Green Without Feeling Blue – Green IT Made Simple</strong></p>
<p>In addition to adopting new technologies or IT models to reduce costs, organisations are also under pressure to become greener. This pressure comes not just from the board as they look to achieve cost savings, but also through growing public and legislative pressure on them to reduce carbon emissions.</p>
<p>The UK Government has set a target of cutting CO2 emissions by 34% of 1990 levels by 2020. Businesses play a key part in this and the pressure is on to adhere to the new rules and regulations that aim to ensure we hit those targets.</p>
<p>The main challenge for many organisations however, is that they don’t know how much power is being consumed by their IT, especially as data centres are now often located away from an organisation’s main site. Organisations must be able to construct a better picture of energy usage in their data centres – building environmental factors into monitoring to identify areas where further energy savings can be made.</p>
<p>For example, organisations need to be able to tell when and where they can power down servers that are not in use. They also need to monitor data centre temperature very carefully – ensuring cooling systems are working efficiently and keeping their servers at optimum temperature.</p>
<p>Furthermore, IT monitoring can also help organisations build a picture to help them decide which old systems can be decommissioned if they are not performing sufficiently. Being green does not need to be difficult. The right tools will indicate the right processes to improve your environmental credentials without the need for more investment or even more IT complexity.</p>
<p>As the IT landscape continues to change, it is becoming increasingly clear that the monitoring tools and techniques that worked in the past simply don’t suit the modern environment. More flexibility and agility is needed in IT monitoring, while reducing the overall cost. Overall, IT monitoring can help organisations improve their business and environmental performance. The challenge is now to ensure they have the right tools and techniques in place to do this.</p>
<div style="border: 1px solid #ccc; background-color: #f5f5f5; padding: 8px;">
<h3>About the author</h3>
<p>James Peel is product manager at Opsview. He has over 12 years&#8217; experience in IT services and infrastructure management, with a focus on building data centres and developing automated monitoring and management systems.</p>
<p>Article first published in <a href="http://www.businesscomputingworld.co.uk/virtual-insanity-how-to-remain-lean-and-green/">Business Computing World, 12 Jan 2011</a></p>
</div>
]]></content:encoded>
			<wfw:commentRss>http://labs.opsview.com/2012/01/virtual-insanity-how-to-remain-lean-and-green/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Monitoring Apache Solr with Opsview</title>
		<link>http://labs.opsview.com/2011/12/monitoring-apache-solr-with-opsview/</link>
		<comments>http://labs.opsview.com/2011/12/monitoring-apache-solr-with-opsview/#comments</comments>
		<pubDate>Mon, 12 Dec 2011 14:58:29 +0000</pubDate>
		<dc:creator>rbramley</dc:creator>
				<category><![CDATA[Configuration]]></category>
		<category><![CDATA[DevOps]]></category>
		<category><![CDATA[Development]]></category>
		<category><![CDATA[Grails]]></category>
		<category><![CDATA[Hudson]]></category>
		<category><![CDATA[JMX]]></category>
		<category><![CDATA[Nagios]]></category>
		<category><![CDATA[System Management]]></category>
		<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[agentless checks]]></category>
		<category><![CDATA[apache solr]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[nrpe]]></category>
		<category><![CDATA[Opsview]]></category>

		<guid isPermaLink="false">http://labs.opsview.com/?p=2012</guid>
		<description><![CDATA[
			
				
			
		Apache Solr is an open source enterprise search service from the Lucene project. Solr is written in Java and runs as a standalone full-text search server within a servlet container such as Tomcat.
Like any service or component in your architecture, you’ll want to monitor it to ensure that it’s available and gather performance data to [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Flabs.opsview.com%2F2011%2F12%2Fmonitoring-apache-solr-with-opsview%2F">
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Flabs.opsview.com%2F2011%2F12%2Fmonitoring-apache-solr-with-opsview%2F&amp;style=normal&amp;b=2" height="61" width="50" />
			</a>
		</div><p><a href="http://labs.opsview.com/wp-content/uploads/2011/12/solr.jpg"><img class="alignleft size-full wp-image-2027" title="solr" src="http://labs.opsview.com/wp-content/uploads/2011/12/solr.jpg" alt="monitoring Apache Solr" width="150" height="83" /></a><a title="Apache Solr" href="http://lucene.apache.org/solr/">Apache Solr</a> is an open source enterprise search service from the Lucene project. Solr is written in Java and runs as a standalone full-text search server within a servlet container such as Tomcat.</p>
<p>Like any service or component in your architecture, you’ll want to monitor it to ensure that it’s available and gather performance data to help with tuning.</p>
<p>In this post, we’ll look at how we can monitor Solr, what performance metrics we might want to gather and how we can easily achieve this with Opsview.</p>
<p><span id="more-2012"></span></p>
<div style="border: 1px solid #ccc; background-color: #f5f5f5; padding: 8px;">
<h2>Requirements</h2>
<ul>
<li>Installed version of Opsview <a title="Download Opsview" href="http://www.opsview.com/downloads">[download]</a></li>
<li>Apache Solr Custom Plugin <a title="Download Apache Solr Custom Plugin" href="https://github.com/rbramley/Opsview-solr-checks">[download]</a></li>
</ul>
</div>
<p><br /></p>
<h2>A check list for service checks</h2>
<p>Solr is built on Lucene so follows the same layout, an index contains documents that are comprised of fields. As part of the search service value add over Lucene, Solr provides a number of useful ways of obtaining health status / monitoring metrics:</p>
<ol>
<li>Health-check status using the <em>/admin/ping</em> handler</li>
<li>The admin statistics page <em>/admin/stats.jsp</em> (XML styled with XSL)</li>
<li><a href="http://wiki.apache.org/solr/SolrJmx">JMX MBeans</a></li>
</ol>
<p>The list of applicable checks could be defined by whether it is a health check or a data gathering check – but this would lead to a lot of overlap. Instead the list is divided into the checks that can be performed remotely (without an installed agent on the server) and those that are best performed locally to the Solr server.</p>
<h2>Remote (agent-less) checks</h2>
<p>What should we look for over the network?</p>
<p>Firstly we can have a host-level check which may perform a network level ping. Next we can check TCP connectivity to the servlet container port and then make an HTTP GET request to the Solr ‘front page’ and check for a known string (e.g. Welcome to Solr).</p>
<p>Now we’ve made it up to the application layer so can start to perform Solr specific checks.</p>
<p>Items to monitor may include (delete as applicable):</p>
<ol>
<li>Ping status</li>
<li>Number of docs</li>
<li>Number of queries / queries per second</li>
<li>Average response time</li>
<li>Number of updates</li>
<li>Cache hit ratios</li>
<li>Replication status</li>
<li>Synthetic queries</li>
</ol>
<h2>Agent-based checks</h2>
<p>Installing an <a title="Opsview Agents" href="http://www.opsview.com/downloads/opsview-agents">Opsview agent</a> on the Solr server means we can run additional checks over NRPE (Nagios Remote Plugin Executor). This could be operating system level checks such as memory/disk utilisation or CPU load, or the following:</p>
<ol>
<li>Java servlet container process is running</li>
<li>JMX checks e.g. heap memory or custom MBeans</li>
<li>File age</li>
<li>Log parsing for exceptions</li>
</ol>
<p>The Solr wiki describes how to configure JMX support: <a title="Configure JMX support" href="http://wiki.apache.org/solr/SolrJmx">http://wiki.apache.org/solr/SolrJmx.</a></p>
<h2>Opsview configuration</h2>
<p>For the rest of this article you&#8217;ll need to have <a title="Download Opsview" href="http://www.opsview.com/downloads">Opsview</a> installed (or the <a title="Opsview VMWare Appliance" href="http://www.opsview.com/downloads/opsview-3-vmware-virtual-appliance">Opsview VMWare appliance</a>) and have completed the <a title="Opsview Quick Start Guide" href="http://docs.opsview.com/doku.php?id=opsview3.14:quickstart">Quick Start.</a></p>
<h2>Solr-specific Plugin</h2>
<p>Install the Solr plugin at <a title="Opsview Solr Plugin" href="https://github.com/rbramley/Opsview-solr-checks">https://github.com/rbramley/Opsview-solr-checks</a> into /usr/local/nagios/libexec/</p>
<p>The check_solr plugin was developed using Perl, so that it could be contributed back to Opsview. It requires the CPAN XML::XPath module (sudo cpan -i XML::XPath).</p>
<p>The plugin includes usage instructions, check_solr -h which can also be viewed in Opsview by selecting the ‘Show Plugin Help‘ link beneath the Plugin drop down (see Figure 1). The -u option can be used to specify the URL path for multi-core set-ups.</p>
<h2>Service check setup</h2>
<p>Figure 1 gives an example of a service check configuration.</p>
<p><a href="http://labs.opsview.com/wp-content/uploads/2011/12/figure_1_with_help.png"><img class="aligncenter size-full wp-image-2013" title="figure_1_with_help" src="http://labs.opsview.com/wp-content/uploads/2011/12/figure_1_with_help.png" alt="Opsview service check configuration." width="542" height="699" /></a></p>
<p>Figure 2 shows the <em>agentless</em> service check group with plugins and their arguments.</p>
<p><a href="http://labs.opsview.com/wp-content/uploads/2011/12/solr-agentless-monitoring1.png"><img class="aligncenter size-full wp-image-2015" title="solr-agentless-monitoring" src="http://labs.opsview.com/wp-content/uploads/2011/12/solr-agentless-monitoring1.png" alt="solr agentless monitoring" width="500" height="252" /></a></p>
<h2>Host configuration</h2>
<p>Figure 3 shows a simplistic host setup with a ping check.</p>
<p><a href="http://labs.opsview.com/wp-content/uploads/2011/12/set_up_host1.png"><img class="aligncenter size-full wp-image-2017" title="set_up_host" src="http://labs.opsview.com/wp-content/uploads/2011/12/set_up_host1.png" alt="set up host" width="500" height="596" /></a></p>
<p>Figure 4 is an extract from the <strong>Monitors</strong> tab, where we select the checks we want performed for the current host.</p>
<p><a href="http://labs.opsview.com/wp-content/uploads/2011/12/monitors.png"><img class="aligncenter size-full wp-image-2018" title="monitors" src="http://labs.opsview.com/wp-content/uploads/2011/12/monitors.png" alt="monitors" width="288" height="226" /></a></p>
<h2>Viewing output</h2>
<p>The check results shown in Figure 5 are visible by navigating through the host group hierarchy.</p>
<p><a href="http://labs.opsview.com/wp-content/uploads/2011/12/viewing-output.png"><img class="aligncenter size-full wp-image-2019" title="viewing-output" src="http://labs.opsview.com/wp-content/uploads/2011/12/viewing-output.png" alt="" width="500" height="192" /></a></p>
<p><a href="http://labs.opsview.com/wp-content/uploads/2011/12/viewing-output.png"></a>If you click on the graph icon of <em>Solr Cache Hit Ratios</em> this will drill down onto the graph shown in Figure 6.</p>
<p>Clicking on the graph icon for <em>Solr Avg Response Time – standard</em> will take you to the graphs in Figure 7.</p>
<p><a href="http://labs.opsview.com/wp-content/uploads/2011/12/cache_hit_ratios.png"><img class="aligncenter size-full wp-image-2021" title="cache_hit_ratios" src="http://labs.opsview.com/wp-content/uploads/2011/12/cache_hit_ratios.png" alt="cache hit ratios" width="500" height="209" /></a><a href="http://labs.opsview.com/wp-content/uploads/2011/12/avg_req_time.png"><img class="aligncenter size-full wp-image-2022" title="avg_req_time" src="http://labs.opsview.com/wp-content/uploads/2011/12/avg_req_time.png" alt="average request time" width="500" height="449" /></a></p>
<p>If you shutdown Solr, then the check results will start to turn critical and show in red as per Figure 8.</p>
<p><a href="http://labs.opsview.com/wp-content/uploads/2011/12/post-shutdown-alert.png"><img class="aligncenter size-full wp-image-2023" title="post-shutdown-alert" src="http://labs.opsview.com/wp-content/uploads/2011/12/post-shutdown-alert.png" alt="post shoutdown alert" width="500" height="197" /></a></p>
<h2>Alternatives</h2>
<p>There are a few other plugins available for monitoring Solr from Opsview, depending on your needs:</p>
<ul>
<li><a href="http://code.google.com/p/nagios-plugins-shamil/">http://code.google.com/p/nagios-plugins-shamil</a> – provides ping, replication status and num docs</li>
<li><a href="http://code.google.com/p/solr-nagios-check">http://code.google.com/p/solr-nagios-check</a> – provides QPS, response time and num docs</li>
</ul>
<p>Also, chapter 8 of the recently published <a href="http://www.amazon.co.uk/gp/product/1849516065/ref=as_li_ss_tl?ie=UTF8&amp;tag=leanjavaengi-21&amp;linkCode=as2&amp;camp=1634&amp;creative=19450&amp;creativeASIN=1849516065">Apache Solr 3 Enterprise Search Server</a><img src="http://www.assoc-amazon.co.uk/e/ir?t=leanjavaengi-21&amp;l=as2&amp;o=2&amp;a=1849516065" border="0" alt="" width="1" height="1" /> book includes a section on Monitoring Solr Performance.</p>
<h2>Summary</h2>
<p>Using <em>check_solr</em> in conjunction with <a title="Opsview Open Source Monitoring" href="http://www.opsview.com">Opsview</a> allows you to  ensure that your Solr server is available and provides you with metrics  that can help you tune your Solr configuration.</p>
<p>This can be complemented  with additional agent-based operating system and JMX checks to give you  a full picture view.</p>
<div>
<div style="border: 1px solid #ccc; background-color: #f5f5f5; padding: 8px;">
<h3>About the Author</h3>
<p>Robin Bramley is a hands-on Technical Manager / Lead Architect at an Open Source software &amp; services company who has spent the majority of the last decade working with Java, mobile &amp; Open Source across sectors including Financial Services &amp; High Growth / start-ups. You can view Robin&#8217;s personal blog at <a href="http://leanjavaengineering.wordpress.com/">www.leanjavaengineering.com</a></p>
<h4>Legal Disclaimer</h4>
<p>This blog post is contributed by a member of the Opsview community.  The Opsview project and Opsera Ltd accept no responsibility for the  accuracy of its content and are not liable for any direct or indirect  damages caused by its use.</p>
</div>
</div>
]]></content:encoded>
			<wfw:commentRss>http://labs.opsview.com/2011/12/monitoring-apache-solr-with-opsview/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>6 ways to get the most out of Opsview Alerts</title>
		<link>http://labs.opsview.com/2011/09/how-to-get-the-most-out-of-opsview-alerts/</link>
		<comments>http://labs.opsview.com/2011/09/how-to-get-the-most-out-of-opsview-alerts/#comments</comments>
		<pubDate>Mon, 19 Sep 2011 12:10:05 +0000</pubDate>
		<dc:creator>brian.king</dc:creator>
				<category><![CDATA[Nagios]]></category>
		<category><![CDATA[Opsview]]></category>
		<category><![CDATA[android]]></category>
		<category><![CDATA[Apache]]></category>
		<category><![CDATA[distributed monitoring]]></category>
		<category><![CDATA[Host groups]]></category>
		<category><![CDATA[Linux]]></category>
		<category><![CDATA[mobile]]></category>
		<category><![CDATA[monitoring alerts]]></category>
		<category><![CDATA[service groups]]></category>
		<category><![CDATA[system administrators]]></category>
		<category><![CDATA[Windows]]></category>

		<guid isPermaLink="false">http://labs.opsview.com/?p=1093</guid>
		<description><![CDATA[
			
				
			
		Alerts happen. They are the reason why monitoring applications were created: to alert us when servers need attention. The difference between an effective network monitoring system and an annoying one is a fine line between information and noise. Alerts should be descriptive and prompt an administrative action, not elicit a huff of frustration.  Here [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Flabs.opsview.com%2F2011%2F09%2Fhow-to-get-the-most-out-of-opsview-alerts%2F">
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Flabs.opsview.com%2F2011%2F09%2Fhow-to-get-the-most-out-of-opsview-alerts%2F&amp;style=normal&amp;b=2" height="61" width="50" />
			</a>
		</div><p><strong><a href="http://labs.opsview.com/wp-content/uploads/2011/09/alert120px.png"><img class="alignleft size-full wp-image-1125" style="margin-bottom: 8px; margin-right: 10px;" title="alert120px" src="http://labs.opsview.com/wp-content/uploads/2011/09/alert120px.png" alt="" width="120" height="120" /></a></strong>Alerts happen. They are the reason why monitoring applications were created: to alert us when servers need attention. The difference between an effective <a href="http://www.opsview.com/learn/network-monitoring">network monitoring</a> system and an annoying one is a fine line between information and noise. Alerts should be descriptive and prompt an administrative action, not elicit a huff of frustration.  Here are a few ways to keep your Opsview installation (and you) effective and relevant in your company.<span id="more-1093"></span></p>
<h2>Use a Smartphone</h2>
<p>A smartphone should be a tool on every system administrator’s bat belt. The more mobile you are, the more time you spend away from the Operations Center. Why not take Opsview with you? With <a href="http://www.opsview.com/products/enterprise-modules/sms-messaging">Opsview Mobile for Android</a>, you can do just that. There is an ambitious roadmap to the mobile app including support for other devices, but if you have an Android there is no reason to wait getting it installed.</p>
<p>The app handles basic needs very well, including a real-time overview of all hosts and services and alert acknowledgement. If you are away from the office (like at the beach!) and get an alert on your phone, acknowledge it and then make a call to your backup (hopefully you have one!) who can begin corrective action. If you don’t have a backup, you at least have a heads-up to an issue at work and can go back to sipping a drink from a coconut.</p>
<p style="text-align: center;"><a href="http://www.opsview.com/products/opsview-mobile"><img class="aligncenter" src="http://www.opsview.com/sites/default/files/images/bannerOpsviewMobile_released.png" alt="Opsview Mobile" width="600" height="361" /></a></p>
<h2>Use a Real Email Address</h2>
<p>Create an email address that can be dedicated to your mobile. For example, create a Gmail address and configure your smartphone for an audible notification on new messages to that address. Smartphone text messages don’t give you the entire story, only a few characters to let you know a host or service is having a problem. There may be more to it that is detailed in the Additional Information section of the alert. A disk utilization error of 95% may be something that can wait until you get back to your office to debug where as 100% would prompt you to boot your laptop to resolve as soon as possible. The only way to know is to have all the alert information in hand (literally).</p>
<h2>Modify Alert Templates</h2>
<p>The more information you put in the alert, the better chance you can delegate action. Modify the default alert templates to include more information that can help other people, such as a help desk, route tickets more effectively. Since Opsview has <a href="http://www.opsview.com/learn/opsview-for-nagios-users">Nagios</a> under the hood, all Nagios macros are available. (A complete macro list can be found on <a href="http://nagios.sourceforge.net/docs/3_0/macrolist.html">Sourceforge</a>.</p>
<p>An example would be inserting comments using the <a href="http://www.opsview.com/products/screenshots">Opsview UI</a> on a host group or individual host, then changing the template to include the macro output for $HOSTGROUPNOTES$ or $HOSTNOTES$. Comments could include where to route tickets or links to documents to solve common problems that first level support can handle. If the issue to too complex, level one support will know which direction to escalate the ticket. The default template is located in /usr/local/nagios/libexec/notifications/com.opsview.notificationmethods.email.tt.</p>
<p>It’s a good idea to keep a backup of any changes you make since the file will be overwritten with each Opsview upgrade.</p>
<h2>Set up Layered Email Profiles with Time Periods</h2>
<p>The rub with any <a href="http://www.opsview.com/learn/server-monitoring">server monitoring</a> system is no one wants a critical alert at three in the morning, but proper administration can’t be done without notifications. Administrators should embrace alerts, specifically warning alerts since they allow for proactive work to be done preventing critical alerts. That being said, no one wants a warning alert at three in the morning. Fortunately, each Contact in Opsview can have multiple Profiles which can have different layers of alerts. For your work email, create profiles for warning and critical to be sent 24&#215;7. For your Gmail that your phone accesses, create profiles for warning alerts 8&#215;5 and another profile for critical alerts 24&#215;7.  Be sure to name your profiles logically for easier administration, such as EmailPhoneWarning or EmailWorkCritical.</p>
<h2>Send Alerts to Host and Service Administrators</h2>
<p>An IT shop may have specific administrators, such as Windows or Linux admins. Windows administrators may not care to get alerts when Apache is down and Linux admins may not want to be woken up because a Windows server blue-screened.</p>
<p>Digging deeper into the Contact Profile, notifications can be set up for Host Groups and Service Groups. Configure each user to get alerts for their responsible services. Anytime someone gets an alert that they ignore because it is someone else’s responsibility, noise is created and alerts are assumed and disregarded, lowering the value of the entire monitoring system. If you want people to feel your pain, correct the issue and send out an email to work addresses that you were up all night dealing with your problem. It’s not a bad idea to show people that the system is working as it should, notifying only the responsible parties of critical issues (plus it gets you off the hook for coming in late the next morning).</p>
<h2>Test New Checks Before Enabling Notifications</h2>
<p>New checks are rolled out constantly in a changing environment. But new checks put immediately in production may produce false alarms that annoy other administrators and the help desk. Since Opsview includes built-in features to help monitor and trend check results, every addition should go through a testing period with notifications disabled. After a week, check the Service Graph to find the highest and lowest values to appropriately tune warning and critical thresholds.</p>
<p>Using the Alert Summary, you can determine if time periods should be used for a check. For example, a service may become unavailable during a nightly backup. The check_interval must remain the same, but checks need to be suspended for two hours each night while the backup occurs. You will be able to confidently tune the time period rather than make an uninformed guess at a black out range. Making accurate adjustments before a check “goes live” with notifications enabled greatly reduces unnecessary alerts and allows administrators to maintain faith in the system.</p>
<p>Find out more about <a href="http://www.opsview.com/products/enterprise-modules/sms-messaging">Opsview alerts.</a></p>
<div style="border: 1px solid #ccc; background-color: #f5f5f5; padding: 8px;">
<h3>About the Author</h3>
<p>Paul Fleetwood started as a Unix Administrator in 1999. He has rolled out Opsview at small and large companies including a distributed installation that monitored 600 hosts and 5000 services. Paul currently works for an award-winning custom content publisher in North Carolina and spends all his free time with his wife and three very active sons.</p>
</div>
]]></content:encoded>
			<wfw:commentRss>http://labs.opsview.com/2011/09/how-to-get-the-most-out-of-opsview-alerts/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>10 Ways to Make Your Monitoring System Scale</title>
		<link>http://labs.opsview.com/2011/09/10-ways-to-make-your-monitoring-system-scale/</link>
		<comments>http://labs.opsview.com/2011/09/10-ways-to-make-your-monitoring-system-scale/#comments</comments>
		<pubDate>Tue, 06 Sep 2011 09:00:27 +0000</pubDate>
		<dc:creator>tcallway</dc:creator>
				<category><![CDATA[Forked software]]></category>
		<category><![CDATA[Frameworks]]></category>
		<category><![CDATA[MSPs]]></category>
		<category><![CDATA[Nagios]]></category>
		<category><![CDATA[Opsview]]></category>
		<category><![CDATA[SNMP]]></category>
		<category><![CDATA[System Management]]></category>
		<category><![CDATA[api]]></category>
		<category><![CDATA[business systems]]></category>
		<category><![CDATA[icinga]]></category>
		<category><![CDATA[multi-tenancy]]></category>
		<category><![CDATA[security]]></category>
		<category><![CDATA[network monitoring]]></category>
		<category><![CDATA[open source monitoring]]></category>

		<guid isPermaLink="false">http://labs.opsview.com/?p=1009</guid>
		<description><![CDATA[
			
				
			
		
Freeware IT monitoring tools are used by thousands of organisation worldwide however using them to monitor complex network, server and application installations can be quite a challenge.  This blog post takes the basic capabilities of one such tool, Nagios® Core, and shows how you can scale it with Opsview for use in enterprise environments.

Distributed [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Flabs.opsview.com%2F2011%2F09%2F10-ways-to-make-your-monitoring-system-scale%2F">
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Flabs.opsview.com%2F2011%2F09%2F10-ways-to-make-your-monitoring-system-scale%2F&amp;style=normal&amp;b=2" height="61" width="50" />
			</a>
		</div><p><a href="http://labs.opsview.com/wp-content/uploads/2011/09/hyper-scalability_2.jpg"><img class="alignleft size-full wp-image-1012" style="margin-bottom: 8px; margin-right: 10px;" title="hyper scalability_2" src="http://labs.opsview.com/wp-content/uploads/2011/09/hyper-scalability_2.jpg" alt="" width="104" height="127" /></a></p>
<p>Freeware IT monitoring tools are used by thousands of organisation worldwide however using them to monitor complex network, server and application installations can be quite a challenge.  This blog post takes the basic capabilities of one such tool, <a href="http://www.opsview.com/company/legal/trademarks#NagiosTrademarkStatement">Nagios® Core</a>, and shows how you can scale it with Opsview for use in enterprise environments.</p>
<p><span id="more-1009"></span></p>
<h2>Distributed monitoring</h2>
<p>Building and managing a complex<a href="http://www.opsview.com/learn"> distributed monitoring</a> environment with Nagios Core is no mean feat. With Opsview you get distributed monitoring that’s easy to setup and simple to maintain.  You can monitor your devices and applications from a central location and grow the system without growing the monitoring complexity.</p>
<h2>Slave server clustering</h2>
<p>Opsview can automatically load-balance across multiple slaves and reallocate monitoring duties if a slave server fails, giving you high availability and scalability without additional overhead.</p>
<div id="attachment_1050" class="wp-caption alignright" style="width: 208px"><a href="http://labs.opsview.com/wp-content/uploads/2011/09/Clustering_diagram550px.png"><img class="size-full wp-image-1050   " style="margin-bottom: 8px; margin-left: 10px; border: 1px solid #ccc;" title="Clustering_diagram550px" src="http://labs.opsview.com/wp-content/uploads/2011/09/Clustering_diagram550px.png" alt="" width="198" height="152" /></a><p class="wp-caption-text">Example Clustering Model</p></div>
<h2>Master server clustering</h2>
<p>Management of Opsview is performed on a single master server, however master servers can be clustered giving you the high availability and redundancy needed for mission critical monitoring.</p>
<h2>Separate database server</h2>
<p>Opsview can be run on a separate database server so you can move intensive reporting activity to a dedicated machine and fine tune the server for better performance.</p>
<h2>Efficient configuration UI</h2>
<p>Nagios Core is capable of monitoring thousands of devices, but maintaining configuration on expanding systems can quickly become a problem. Opsview handles this with an easy to use interface and middleware layer which tackles the complexity of configuring individual software components.</p>
<h2>‘Single pane of glass’ monitoring</h2>
<p>Unlike Nagios Core where data may be gathered from a number of systems and presented in different ways, Opsview’s intuitive web interface displays all your monitoring information in one place, with a top down view on system status.  Devices and applications can be easily grouped by business process and their status displayed using simple &#8216;traffic lights&#8217; so you can easily see the health of critical and non-critical groups. This makes monitoring and maintaining large, complex systems less time consuming and more efficient with a scalable architecture to cover all your systems and locations.</p>
<div id="attachment_1031" class="wp-caption aligncenter" style="width: 560px"><a href="http://labs.opsview.com/wp-content/uploads/2011/09/configurationUI550px.png"><img class="size-full wp-image-1031" style="border: 1px solid #ccc;" title="configurationUI550px" src="http://labs.opsview.com/wp-content/uploads/2011/09/configurationUI550px.png" alt="" width="550" height="248" /></a><p class="wp-caption-text">Opsview&#39;s Host Group Hierarchy View</p></div>
<p style="text-align: center;">
<h2>Distributed alerting</h2>
<p>Slave servers monitored by Opsview can handle their own notifications, allowing autonomy if communication is lost between master and slave servers. Alerts can be sent by the Master server or slave server by email / sms so you’re always in touch with the health of your system, no matter the location or your systems.</p>
<h2>Automated APIs</h2>
<p>Opsview APIs speed up system configuration by automatically populating and updating host information saving you time and effort as your system grows.</p>
<div id="attachment_1048" class="wp-caption aligncenter" style="width: 560px"><a href="http://labs.opsview.com/wp-content/uploads/2011/09/API_diagram_Opsview550px1.jpg"><img class="size-full wp-image-1048" title="API_diagram_Opsview550px" src="http://labs.opsview.com/wp-content/uploads/2011/09/API_diagram_Opsview550px1.jpg" alt="" width="550" height="285" /></a><p class="wp-caption-text">Example use cases for Opsview&#39;s RESTful API</p></div>
<h2>SNMP trap processing</h2>
<p>Nagios Core has no native support for SNMP trap processing. Opsview’s SNMP engine accepts incoming traps, analyses the data and decides how to handle them. In-built SNMP discovery allows SNMP objects to be detected and monitored easily and rules can be configured through the management UI.</p>
<h2>Notification profiles</h2>
<p>With Nagios Core you can be inundated with monitoring information, not all of it useful. In Opsview you can set-up notification profiles so the right people get the right information at the right time. Only want to know about email server status during business hours? No problem. Need SMS alerts about your webstore? It’s covered. Notification profiles can also combined with Opsview’s service desk module to automatically assign support tasks to engineers, helping streamline incident management.</p>
<div style="border: 1px solid #ccc; background-color: #e6e6e6; padding: 6px;"><strong>IMPORTANT LEGAL NOTICE: No affiliation, partnership, joint-venture or any other commercial relationship exists between Opsera Ltd, the makers of Opsview, and Nagios Enterprises LLC, the trademark holders of Nagios.</strong></div>
]]></content:encoded>
			<wfw:commentRss>http://labs.opsview.com/2011/09/10-ways-to-make-your-monitoring-system-scale/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>10 Ways to Make Your Monitoring System Easier</title>
		<link>http://labs.opsview.com/2011/08/10-ways-to-make-your-monitoring-system-easier/</link>
		<comments>http://labs.opsview.com/2011/08/10-ways-to-make-your-monitoring-system-easier/#comments</comments>
		<pubDate>Tue, 30 Aug 2011 10:18:12 +0000</pubDate>
		<dc:creator>tcallway</dc:creator>
				<category><![CDATA[Forked software]]></category>
		<category><![CDATA[Nagios]]></category>
		<category><![CDATA[Opsview]]></category>
		<category><![CDATA[SNMP]]></category>
		<category><![CDATA[System Management]]></category>
		<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[api]]></category>
		<category><![CDATA[business systems]]></category>
		<category><![CDATA[network monitoring]]></category>
		<category><![CDATA[open source monitoring]]></category>

		<guid isPermaLink="false">http://labs.opsview.com/?p=995</guid>
		<description><![CDATA[
			
				
			
		
Many freeware  IT monitoring tools are great but using them to manage complex systems can be a real challenge. It can also be unforgiving on anyone less than expert in configuring the system with mistakes being punished by a complete stop in monitoring activity.

Distributed monitoring

Opsview takes the complexities of its core engine, Nagios® Core, [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Flabs.opsview.com%2F2011%2F08%2F10-ways-to-make-your-monitoring-system-easier%2F">
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Flabs.opsview.com%2F2011%2F08%2F10-ways-to-make-your-monitoring-system-easier%2F&amp;style=normal&amp;b=2" height="61" width="50" />
			</a>
		</div><p><a href="http://labs.opsview.com/wp-content/uploads/2011/08/staples-easy-button.png"><img class="alignleft size-full wp-image-1001" style="margin-bottom: 8px; margin-right: 10px;" title="staples-easy-button" src="http://labs.opsview.com/wp-content/uploads/2011/08/staples-easy-button.png" alt="" width="120" height="120" /></a></p>
<p>Many freeware <a href="http://www.opsview.com/learn/whitepapers/importance-it-monitoring-assuring-key-business-services-availability"> IT monitoring</a> tools are great but using them to manage complex systems can be a real challenge. It can also be unforgiving on anyone less than expert in configuring the system with mistakes being punished by a complete stop in monitoring activity.</p>
<p><span id="more-995"></span></p>
<h2>Distributed monitoring</h2>
<p><img class="alignright size-full wp-image-1020" style="margin-bottom: 8px; margin-left: 10px;" title="distributedMonitoring" src="http://labs.opsview.com/wp-content/uploads/2011/08/distributedMonitoring.jpg" alt="" width="169" height="163" /></p>
<p>Opsview takes the complexities of its core engine, <a href="http://www.opsview.com/company/legal/trademarks#NagiosTrademarkStatement">Nagios® Core</a>, and makes distributed monitoring simple. All management is performed on a single master server and communication with slaves is handled by Opsview&#8217;s middleware layer. Provision is included for geographically diverse monitoring and to cope with potentially unreliable WAN connections between servers.</p>
<h2>Host attributes</h2>
<p>A feature you won’t find in Nagios Core, <a href="http://labs.opsview.com/2011/10/5-steps-to-organising-your-server-monitoring-with-attributes/">host attributes</a> help simplify configurations by allowing you to create multiple services based on a set of pre-defined attributes.  You can assign one or many attributes to a host and set service checks to use the attributes for a host to then create multiple services for monitoring.</p>
<h2>Keywords</h2>
<p>Opsview’s keyword function gives you a flexible way of grouping hosts and services. You can tag devices, business processes and applications giving you a convenient way of seeing the status of the groups, e.g. critical IT systems, network circuits or business users and customers.</p>
<p><img class="alignleft" style="margin-bottom: 8px; margin-right: 10px;" title="Cloning" src="http://www.opsview.com/sites/all/themes/opsview/images/opsviewApplianceIcon90px.png" alt="" width="90" height="82" /></p>
<h2>Cloning capability</h2>
<p>Chances are when you’re configuring or adding devices and services to your network many of them will be quite similar.  To save time you can simply choose to clone an existing device or service monitored with Opsview and add it to the network.</p>
<h2>SNMP discovery</h2>
<p>Nagios Core provides support for SNMP via its plugin project, but it doesn’t provide support for processing SNMP traps. Opsview does this automatically. A powerful processing engine accepts incoming traps, analyses the data and then decides how they should be processed.  In-built SNMP discovery also means SNMP objects can be detected and monitored with ease without the need for human intervention.</p>
<h2><img class="alignright" style="margin-bottom: 8px; margin-left: 10px;" title="Service Desk Connector" src="http://www.opsview.com/sites/default/files/service-desk-connector600px.png" alt="" width="252" height="83" />Notification profiles</h2>
<p>Opsview helps you avoid information overload by easily creating complex business rules that define who gets alerts, how they get them and why. Combined with the Opsview service desk module you get a powerful notification tool that helps speed up mean time to repair and streamline workflows.</p>
<h2>Configuration UI</h2>
<p>Configuring and maintaining a system with Nagios Core can become difficult the bigger and more complex the monitoring environment gets. Opsview’s configuration UI means you don’t need to be a Nagios Core expert to get your monitoring up and running.   All the software processes are kept ‘under the hood’ and presented via an intuitive interface so you can see the information that makes a difference to your business without getting caught up with software.</p>
<h2>APIs for Automation</h2>
<p>Opsview includes automated APIs for configuration, monitoring and notification which makes system set-up pain-free and scaling simple.  The APIs also make integrating with other IT Management tools easy.</p>
<h2>SLA reporting</h2>
<p><img class="alignright" style="margin-bottom: 8px; margin-left: 10px; border: 1px solid #ccc;" title="Reports Module" src="http://www.opsview.com/sites/default/files/reports800px.jpg" alt="Reports Module" width="182" height="128" />Opsview’s <a href="http://www.opsview.com/products/enterprise-modules/reports">Reports Module</a> can automatically generate custom reports in line with business requirements. If you have to produce regular reports for your management or customers, this module will save you hours by generating the reports you need when you need them. The reports can be sent out automatically on email in PDF, HTML, Excel, ODT or XML to your chosen distribution list.</p>
<h2>Service desk integration</h2>
<p>When you <a href="http://www.opsview.com/products/enterprise-modules/service-desk-connector">integrate your service desk with Opsview</a> you get a powerful tool for automating incident reporting. Tickets can be created in your system based on alerts generated by Opsview, saving time and freeing up resources. Out of-the-box support is included for Service-Now.com, Bestpractical’s Request Tracker and Atlassian JIRA.</p>
<div style="border: 1px solid #ccc; background-color: #e6e6e6; padding: 6px;"><strong>IMPORTANT LEGAL NOTICE: No affiliation, partnership, joint-venture or any other commercial relationship exists between Opsera Ltd, the makers of Opsview, and Nagios Enterprises LLC, the trademark holders of Nagios.</strong></div>
]]></content:encoded>
			<wfw:commentRss>http://labs.opsview.com/2011/08/10-ways-to-make-your-monitoring-system-easier/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Next generation distributed monitoring, the Opsview way</title>
		<link>http://labs.opsview.com/2011/01/next-generation-distributed-monitoring-the-opsview-way/</link>
		<comments>http://labs.opsview.com/2011/01/next-generation-distributed-monitoring-the-opsview-way/#comments</comments>
		<pubDate>Tue, 25 Jan 2011 10:50:01 +0000</pubDate>
		<dc:creator>tonvoon</dc:creator>
				<category><![CDATA[Development]]></category>
		<category><![CDATA[Opsview]]></category>
		<category><![CDATA[distributed monitoring]]></category>
		<category><![CDATA[Nagios]]></category>

		<guid isPermaLink="false">http://labs.opsview.com/?p=751</guid>
		<description><![CDATA[
			
				
			
		One of Opsview&#8217;s great features is distributed monitoring, which we&#8217;ve had for over 5 years now. From the web user interface, you can assign hosts to a slave system and Opsview will take care of all the configuration work for you: from the slave configuration files, to the slave results sent to the master, to [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Flabs.opsview.com%2F2011%2F01%2Fnext-generation-distributed-monitoring-the-opsview-way%2F">
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Flabs.opsview.com%2F2011%2F01%2Fnext-generation-distributed-monitoring-the-opsview-way%2F&amp;style=normal&amp;b=2" height="61" width="50" />
			</a>
		</div><div id="attachment_765" class="wp-caption alignleft" style="width: 126px"><a class="lightbox" title="communityIcon" href="http://www.opsview.com/downloads/opsview-3-vmware-virtual-appliance"><img class="size-full wp-image-765   " style="margin: 0pt 10px 5px 0pt;" title="communityIcon" src="http://labs.opsview.com/wp-content/uploads/2011/01/communityIcon.png" alt="Opsview Community 3.11 - the next generation of distributed monitoring for free" width="116" height="116" /></a><p class="wp-caption-text">Opsview Community 3.11 - Free distributed monitoring</p></div>
<p>One of Opsview&#8217;s great features is distributed monitoring, which we&#8217;ve had for over 5 years now. From the web user interface, you can assign hosts to a slave system and Opsview will take care of all the configuration work for you: from the slave configuration files, to the slave results sent to the master, to the master configuration with freshness checking.</p>
<p>We do all the system integration work, so you don&#8217;t have to.</p>
<p><span id="more-751"></span>However, there are some limitations in our chosen technologies. We use <a href="http://nagios.sourceforge.net/docs/3_0/addons.html#nsca">NSCA</a>, which is the most <a href="http://nagios.sourceforge.net/docs/3_0/distributed.html">common method</a> in the Nagios&amp;reg; world, and while we&#8217;ve <a href="http://labs.opsview.com/tag/nsca/">made improvements</a> to it that have gone back upstream, there are some baked-in limitations:</p>
<ul>
<li>Only the first 511 bytes of plugin output was returned to the master, limiting the usefulness of the information you could display</li>
<li>Only the 1st line of data was returned, meaning you had to cram output together</li>
<li>NSCA communication used fixed size packets which were inefficient</li>
<li>While results were sent, Nagios would wait for completion, introducing a bottleneck</li>
<li>If there was a communication problem with the master, results were dropped</li>
</ul>
<p>Sometimes to move forward, you have to leave the past behind.</p>
<p>So we did that &#8211; we ripped out NSCA from Opsview&#8217;s slave communications and we&#8217;ve addressed every one of these limitations &#8211; and added a few nice extras too!</p>
<p>We&#8217;ve chosen <a href="http://code.google.com/p/nrd/">NRD</a> (Nagios Result Distributor) as our core technology. This is a library, written by one of our partners, <a href="http://capside.com/">CAPSiDE</a>. There are many reasons we chose this, but the top four are:</p>
<ul>
<li>It is based on perl, which is our language of choice</li>
<li>It has taken the test suite we developed for <a href="http://labs.opsview.com/2007/01/the-importance-of-being-earnestly-tested/">NSCA</a> and enhanced it, demonstrating a mature approach to code development</li>
<li>The client and server code is a thin shim over the libraries, which means you can easily create your own clients</li>
<li>We have a good relationship with CAPSiDE and they have given us access to their code repository</li>
</ul>
<p>We&#8217;ve spent some time understanding the core NRD code, enhancing it, fixing some issues and adding in some great new features. CAPSiDE have also released it on <a href="http://search.cpan.org/dist/NRD-Daemon/lib/NRD/Daemon.pm">CPAN</a> for wider consumption.</p>
<p>So Opsview&#8217;s new process for sending results from a slave is:</p>
<p style="text-align: center;"><a class="lightbox" title="NRD architecture" href="http://labs.opsview.com/wp-content/uploads/2011/01/NRD-architecture.png"><img class="size-medium wp-image-754 aligncenter" title="NRD architecture" src="http://labs.opsview.com/wp-content/uploads/2011/01/NRD-architecture-300x279.png" alt="" width="300" height="279" /></a></p>
<p>A couple of other amazing features we&#8217;ve squeezed in:</p>
<ul>
<li>A known Nagios limitation is the named pipe to submit results. We&#8217;ve overcome this by writing directly to the checkresults spool directory &#8211; this reduces a Nagios processing cycle on the Opsview master</li>
<li>We&#8217;ve implemented transactions in the results, so if the client has a failure communicating to the server, the client will back off and retry again in 5 seconds. This guarantees you do not have duplicated results</li>
<li>The nrd daemon on the master will dynamically add more servers as workload increases, thanks to the features of <a href="http://search.cpan.org/dist/Net-Server/lib/Net/Server.pod">Net::Server</a></li>
<li>As all communication between master and slaves is over a tunnelled SSH session, we&#8217;ve updated our Opsview check scripts to restart these tunnels if the slave is exhibiting communication errors</li>
</ul>
<p>With all this extra capabilities, you would think there is a cost in performance. But in fact, our testing shows that performance has got better!</p>
<p style="text-align: center;"><a class="lightbox" title="nrd performance table" href="http://labs.opsview.com/wp-content/uploads/2011/01/nrd-performance-table.png"><img class="size-medium wp-image-755 aligncenter" title="nrd performance table" src="http://labs.opsview.com/wp-content/uploads/2011/01/nrd-performance-table-300x62.png" alt="" width="300" height="62" /></a></p>
<p>(Based on sending 2016 results in a single transaction over an SSH tunnel from a slave to a master. Times measured on the client.)</p>
<p>This shows that we are getting an average 62% improvement in all aspects of slave communication back to the master!</p>
<p>We are thrilled we&#8217;ve added this major new functionality into Opsview and have taken distributed monitoring another huge step further over any of our competitors.</p>
<p>But the best thing is: this is available immediately with our Opsview Community 3.11 release. Install the VM, add a slave and you will get this new architecture setup as part of the process. And if you are an existing Opsview user, you get a silky smooth switch-over. We&#8217;ve done a lot of testing to ensure that as part of the upgrade, Opsview will automatically switch any slaves to this architecture and start sending results in the new NRD way.</p>
<p style="text-align: center;"><a title="Download Opsview VM Appliance" href="http://www.opsview.com/downloads/opsview-3-vmware-virtual-appliance"><img class="size-full wp-image-762 aligncenter" title="downloadNow_gb" src="http://labs.opsview.com/wp-content/uploads/2011/01/downloadNow_gb.png" alt="" width="197" height="42" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://labs.opsview.com/2011/01/next-generation-distributed-monitoring-the-opsview-way/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Nagios bugs and how to fix them permanently</title>
		<link>http://labs.opsview.com/2011/01/nagios-bugs-and-how-to-fix-them-permanently/</link>
		<comments>http://labs.opsview.com/2011/01/nagios-bugs-and-how-to-fix-them-permanently/#comments</comments>
		<pubDate>Tue, 11 Jan 2011 16:53:25 +0000</pubDate>
		<dc:creator>tonvoon</dc:creator>
				<category><![CDATA[Development]]></category>
		<category><![CDATA[Nagios]]></category>
		<category><![CDATA[bug]]></category>
		<category><![CDATA[continuous integration]]></category>
		<category><![CDATA[fix]]></category>
		<category><![CDATA[Hudson]]></category>

		<guid isPermaLink="false">http://labs.opsview.com/?p=707</guid>
		<description><![CDATA[
			
				
			
		We&#8217;ve just fixed a bug in Nagios® which an Opsview user had raised to us. A change made to Nagios in version 3.2.2 caused an issue where service alerts were being raised in the nagios.log file for every result that came back from a host that was down. This had the impact of adding lots [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Flabs.opsview.com%2F2011%2F01%2Fnagios-bugs-and-how-to-fix-them-permanently%2F">
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Flabs.opsview.com%2F2011%2F01%2Fnagios-bugs-and-how-to-fix-them-permanently%2F&amp;style=normal&amp;b=2" height="61" width="50" />
			</a>
		</div><p>We&#8217;ve just fixed a bug in Nagios® which an <a href="http://opsview.com">Opsview</a> user had raised to us. A <a href="http://tracker.nagios.org/view.php?id=128">change made to Nagios</a> in version 3.2.2 caused an issue where service alerts were being raised in the <em>nagios.log</em> file for every result that came back from a host that was down. This had the impact of adding lots of extra alerts that were overwhelming <a href="http://www.opsview.com/learn/screenshots">Opsview&#8217;s event views</a>.</p>
<p><span id="more-707"></span>To reproduce the problem in Nagios 3.2.3:</p>
<ol>
<li>Create a host with 2 service checks</li>
<li>Let this run normally</li>
<li>Shutdown the host</li>
<li>The first service check will notice the state change and set the host to be checked. It will go into a SOFT state and the service will go into a check attempt of 2 and continue into a hard state correctly</li>
<li>The 2nd service check will see that the host is DOWN and force a hard state failure with check attempt 1 of a maximum 4. However, this hard state change did not set the last_hard_state flag correctly, which meant every subsequent check was considered to be a new hard state failure and hence a SERVICE ALERT was raised every time in <em>nagios.log</em></li>
</ol>
<p>This took a long time to track down, but we&#8217;ve found the problem and fixed it. Our fix is pushed to Nagios <a href="http://article.gmane.org/gmane.network.nagios.cvs/3045">already</a>.</p>
<p>While this bug is annoying, we&#8217;re upset that this had an impact on a customer system. We make it our principle to keep as up to date with Nagios as possible because Opsview is a <em>shallow fork</em> of Nagios &#8211; we make only the changes that are necessary to support our customers and we push our changes back upstream where we can.</p>
<p>We&#8217;ve developed a lot of trust with our users &#8211; we make the upgrade process for Opsview as easy as possible because we want all our users to get to the latest version (in fact, we&#8217;ve just had one user update their Opsview from 4 years ago, right up to the latest version, going through over a hundred database changes!).</p>
<p>One thing we do to make sure our systems work as expected, is to continuously test our latest versions of Opsview. We use <a href="http://hudson-ci.org/">Hudson</a> to test Opsview on every change &#8211; currently this runs 5269 individual tests, taking 1 hour 46 minutes.</p>
<p>We want to bring this level of quality assurance to Nagios &#8211; included in our fix is a <a href="http://article.gmane.org/gmane.network.nagios.cvs/3046">test case</a> that checks exactly this issue. Running tests on Nagios will now show that this problem is fixed forever and our nightly builds of Opsview includes these too.</p>
<p>So now everyone can sleep easier knowing that this problem is never going to happen again.</p>
<p><small>Please note: Nagios, the Nagios logo, and Nagios graphics are the servicemarks,  trademarks, or registered trademarks owned by Nagios Enterprises</small></p>
]]></content:encoded>
			<wfw:commentRss>http://labs.opsview.com/2011/01/nagios-bugs-and-how-to-fix-them-permanently/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>JNRPE &amp; check_jmx on 64-bit JVMs</title>
		<link>http://labs.opsview.com/2010/12/jnrpe-check_jmx-on-64-bit-jvms/</link>
		<comments>http://labs.opsview.com/2010/12/jnrpe-check_jmx-on-64-bit-jvms/#comments</comments>
		<pubDate>Tue, 07 Dec 2010 14:47:57 +0000</pubDate>
		<dc:creator>rbramley</dc:creator>
				<category><![CDATA[Frameworks]]></category>
		<category><![CDATA[JMX]]></category>
		<category><![CDATA[Nagios]]></category>
		<category><![CDATA[Opsview]]></category>
		<category><![CDATA[java]]></category>

		<guid isPermaLink="false">http://labs.opsview.com/?p=675</guid>
		<description><![CDATA[
			
				
			
		The JNRPE server provides an open source Java implementation of the Nagios Remote Plugin Executor (NRPE). This is much more efficient for performing JMX checks than regular NRPE as you only need to start one JVM rather than a JVM instantiation per check (as performed by check_jmx invoking java -jar JMXQuery.jar).

However this efficiency gain comes [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Flabs.opsview.com%2F2010%2F12%2Fjnrpe-check_jmx-on-64-bit-jvms%2F">
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Flabs.opsview.com%2F2010%2F12%2Fjnrpe-check_jmx-on-64-bit-jvms%2F&amp;style=normal&amp;b=2" height="61" width="50" />
			</a>
		</div><p><a class="lightbox"  title ="java-logo-small" href="http://labs.opsview.com/wp-content/uploads/2010/12/java-logo-small.png"><img src="http://labs.opsview.com/wp-content/uploads/2010/12/java-logo-small.png" alt="" title="java-logo-small" width="125" height="166" class="alignleft size-full wp-image-690" /></a>The JNRPE server provides an open source Java implementation of the Nagios Remote Plugin Executor (NRPE). This is much more efficient for performing JMX checks than regular NRPE as you only need to start one JVM rather than a JVM instantiation per check (as performed by check_jmx invoking java -jar JMXQuery.jar).</p>
<p><span id="more-675"></span></p>
<p>However this efficiency gain comes with a few compromises, it doesn’t handle composite data as well as other variants of check_jmx and doesn’t support performance data. It would be nice to see some community action to produce a definitive check_jmx.</p>
<p>Luckily with <a href="http://www.opsview.com">Opsview</a>, you can work around the lack of performance data using the map.local file so you can get performance graphs to assist with correlation or view rate of change over time.</p>
<p>In JNRPE version 0.6.3 there is also an integer overflow bug in the check_jmx base plugin.</p>
<p>The maximum value for a Java Integer is 2^31 – 1 (or 2,147,483,647) – if a number exceeds this, as an Integer is a signed number, it will wrap around to a negative number. For example on a JVM set with a 5GB heap, we saw the following output:</p>
<p><code>JMX OK - HeapMemoryUsage.used is -2037300184</code></p>
<p>Amusingly with the initial warning/critical threshold values set too low, this caused the state to flap between OK (when the heap was larger than a 32 bit number and hence negative) and CRITICAL after garbage collection had brought it beneath 2GB.</p>
<p>This has been raised with the JNRPE project as issue <a href="http://sourceforge.net/tracker/?func=detail&amp;aid=3131380&amp;group_id=204486&amp;atid=989804">3131380</a> – though unfortunately having raised it without signing in, I now can’t attach the following simple patch to the parseData method.</p>
<p><code>Index: CCheckJMX.java</code><br />
<code>=================================================================== </code><br />
<code>--- CCheckJMX.java (revision 243)</code><br />
<code>+++ CCheckJMX.java (working copy)</code><br />
<code>@@ -334,7 +334,7 @@</code><br />
<code>- private int parseData(Object o)</code><br />
<code>+ private long parseData(Object o)</code><br />
<code>{</code><br />
<code>if (o instanceof Number)</code><br />
<code>- return ((Number) o).intValue();</code><br />
<code>+ return ((Number) o).longValue();</code><br />
<code>else</code><br />
<code>- return Integer.parseInt(o.toString());</code><br />
<code>+ return Long.parseLong(o.toString());</code><br />
<code>}</code></p>
]]></content:encoded>
			<wfw:commentRss>http://labs.opsview.com/2010/12/jnrpe-check_jmx-on-64-bit-jvms/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Enhancing NRPE for large output</title>
		<link>http://labs.opsview.com/2008/08/enhancing-nrpe-for-large-output/</link>
		<comments>http://labs.opsview.com/2008/08/enhancing-nrpe-for-large-output/#comments</comments>
		<pubDate>Tue, 05 Aug 2008 00:28:35 +0000</pubDate>
		<dc:creator>tonvoon</dc:creator>
				<category><![CDATA[Nagios]]></category>
		<category><![CDATA[Opsview]]></category>
		<category><![CDATA[nrpe]]></category>

		<guid isPermaLink="false">http://labs.opsview.com/2008/08/enhancing-nrpe-for-large-output.html</guid>
		<description><![CDATA[Problem: How to enhance NRPE to allow more data to be sent in each query, while still being backwards compatible
]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Flabs.opsview.com%2F2008%2F08%2Fenhancing-nrpe-for-large-output%2F">
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Flabs.opsview.com%2F2008%2F08%2Fenhancing-nrpe-for-large-output%2F&amp;style=normal&amp;b=2" height="61" width="50" />
			</a>
		</div><p><a href="http://www.nagios.org/download/addons/">NRPE</a> is great for getting plugin information from a remote host. We wanted to use it to get passive data regarding events, such as syslog entries that <a href="http://www.estpak.ee/~risto/sec/">SEC</a> had highlighted. This meant we needed two things: multi-line support and larger amounts of output.</p>
<p><span id="more-83"></span></p>
<p>Multi-line is already in NRPE 2.12 &#8211; this was added by Matthias Flacke last year. However, the limit for data is 1K.</p>
<p>We wanted to be able to bump that figure up to 16K. There&#8217;s a common.h variable which is called MAX_PACKETBUFFER_LENGTH which is set to 1024. We found we could increase this value and then more data was returned. But there were two problems with it: </p>
<ul></p>
<li>it broke backwards compatibility</li>
<p></p>
<li>it increased the size of each packetk</li>
<p>
</ul>
</p>
<p>The 2nd had an impact on the network. Instead of 1K packets being sent between client and server, we now got 16K packets sent, even if the data contained was small.</p>
<p>The first was worst: it meant you needed to update the client (check_nrpe) with the server (nrpe) at the same time, otherwise you&#8217;d get lots of NRPE errors in Nagios with only one change.</p>
<p>So we&#8217;ve designed a compatible way: we&#8217;ve added a new packet type called RESPONSE_PACKET_WITH_MORE.</p>
<p>The idea is that check_nrpe will see if the packet returned is of the type RESPONSE_PACKET_WITH_MORE. If so, it will read subsequent packets and append that to the existing data, until it gets a RESPONSE_PACKET. So to read 16K worth of data, check_nrpe reads 16 x 1K packets. Of course, only updated nrpe daemons will send this, so this remains fully backwards compatible with existing nrpe daemons.</p>
<p>The patch is <a href="http://altinity.blogs.com/dotorg//nrpe_multiline.patch" title="nrpe_multiline.patch">here</a>. We&#8217;ve also cleanup up some of the graceful_close calls.</p>
<p>Now the process to update your NRPE agents would be: </p>
<ol></p>
<li>update the central check_nrpe, then</li>
<p></p>
<li>update your agents at your leisure</li>
<p>
</ol>
<p>
And you won&#8217;t get any alerts during this period!</p>
<p>Note: during testing, we found that the limit for returned data from some linux kernels was 4K, even though nrpe was coded with 16K as the limit. This is due to kernel limitations in using pipe() for the interprocess communication.</p>
]]></content:encoded>
			<wfw:commentRss>http://labs.opsview.com/2008/08/enhancing-nrpe-for-large-output/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>NSCA&#8217;s aggregate writing</title>
		<link>http://labs.opsview.com/2008/01/nscas-aggregate-writing/</link>
		<comments>http://labs.opsview.com/2008/01/nscas-aggregate-writing/#comments</comments>
		<pubDate>Tue, 08 Jan 2008 15:05:59 +0000</pubDate>
		<dc:creator>tonvoon</dc:creator>
				<category><![CDATA[Nagios]]></category>
		<category><![CDATA[Opsview]]></category>
		<category><![CDATA[distributed]]></category>
		<category><![CDATA[nagconf]]></category>
		<category><![CDATA[NSCA]]></category>

		<guid isPermaLink="false">http://labs.opsview.com/2008/01/nscas-aggregate-writing.html</guid>
		<description><![CDATA[
			
				
			
		In our continual task to try and speed up Opsview, we found a bug in NSCA&#8217;s handling of aggregate writes when run in &#8211;single mode.
The specific failure scenario is this:

NSCA and Nagios are told to start up
A send_nsca request is received by NSCA before Nagios has created the nagios.cmd command pipe
NSCA tries to write to [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Flabs.opsview.com%2F2008%2F01%2Fnscas-aggregate-writing%2F">
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Flabs.opsview.com%2F2008%2F01%2Fnscas-aggregate-writing%2F&amp;style=normal&amp;b=2" height="61" width="50" />
			</a>
		</div><p>In our continual task to try and speed up <a href="http://opsview.org">Opsview</a>, we found a bug in NSCA&#8217;s handling of aggregate writes when run in &#8211;single mode.</p>
<p>The specific failure scenario is this:</p>
<ol></p>
<li>NSCA and Nagios are told to start up
<li>A send_nsca request is received by NSCA before Nagios has created the nagios.cmd command pipe
<li>NSCA tries to write to open the command file, but sees it is not there
<li>NSCA opens the alternate dump file instead
</ol>
</p>
<p>Now when Nagios <em>does</em> create the nagios.cmd file, NSCA uses that &#8230; unless aggregate mode is on and daemon mode is &#8211;single. In this case, it continues to use the alternate dump file, thus Nagios doesn&#8217;t see the results from the slaves.</p>
<p>Here&#8217;s the <a href="http://altinity.blogs.com/dotorg//nsca_aggregate_writes_to_alternate.patch" title="nsca_aggregate_writes_to_alternate.patch">patch</a>, which we&#8217;ve also <a href="http://trac.opsview.org/changeset/697">added</a> into our source for Opsview.</p>
<p>As we are very keen on good <a href="http://altinity.blogs.com/dotorg/2007/01/the_importance_.html">testing</a>, we&#8217;ve managed to recreate the failing behaviour in a <a href="http://altinity.blogs.com/dotorg//nsca_alternate.t" title="nsca_alternate.t">test script</a>. You also need a test <a href="http://altinity.blogs.com/dotorg//nsca_aggregate.cfg" title="nsca_aggregate.cfg">configuration file</a> and a <a href="http://altinity.blogs.com/dotorg//nsca_aggregate_writes_to_alternate_tests.patch" title="nsca_aggregate_writes_to_alternate_tests.patch">patch</a> to the test framework. If you run this test, it will show the error and then after the patch is applied, the test should pass.</p>
]]></content:encoded>
			<wfw:commentRss>http://labs.opsview.com/2008/01/nscas-aggregate-writing/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

