Automated Monitoring with Nagios and Puppet

"If it's not monitored, it doesn't exist." That's my phrasing of a common idea; I've just absorbed it over the years and I'm not sure ( and couldn't easily determine ) to whom I should give credit.

Recently, I set up a monitoring service that used Puppet to generate configuration for Nagios to monitor various hosts and services. Bringing up a trivial, experimental monitoring server isn't difficult exactly, but there are some details that are easily forgotten. The Getting started section below shows my notes to get as far as running nagios and monitoring at the most trivial but useful level. It's intended to be a zero-to-something guide.

The next section, Generating basic Nagios configuration with Puppet, demonstrates an approach to automating the generation of nagios configuration based on the puppet server's collection of nagios_host resources exported by each node. If you're not yet at least somewhat familiar with puppet, this article might seem rather sparse on explanation, but if you are, it should give a good jump on getting some value out of it all.

Getting started

There are a few basic steps to getting going with nagios. Here is how I start a minimally useful new nagios deployment. These instructions are highly specific to RHEL-based linux distributions like CentOS in the details, but following analogous approaches on debian (based) linux and BSD systems should be profitable. One cautionary note, don't try this with nginx unless you consider yourself an expert with nginx, otherwise you'll likely waste a lot of time. Of course, it's perfectly possible, it's just that most of the various articles that trickle to the top of web search results appear to be apache focused/out of date or for one reason or another, less than helpful.

On hosts to be monitored

I use NRPE as my nagios agent. Basically the idea is that each host to be monitored runs an NRPE agent which gathers certain information about the machine and the nagios server contacts the NRPE agent instead directly checking all those details itself. This moves load from the server to the clients and reduces network chattiness.

My NRPE agents are themselves managed by puppet and although it is very straightforward to do so, it is outside the scope of this article.

Ensure that EPEL yum repo is installed first.

Install NRPE and some nagios plugins:

yum install nrpe nagios-plugins{-swap,-users,-procs,-disk,-load}

In every environment, I modify the following directive in the nrpe config (/etc/nagios/nrpe.cfg):

allowed_hosts=<IP address of the monitoring server>, 127.0.0.1

I also modify the lines in nrpe.cfg that define commands, here are some examples:

command[check_users]=/usr/lib64/nagios/plugins/check_users -w 5 -c 10
command[check_load]=/usr/lib64/nagios/plugins/check_load -w 15,10,5 -c 30,25,20
command[check_var]=/usr/lib64/nagios/plugins/check_disk -w 20% -c 10% -p /var
command[check_slash]=/usr/lib64/nagios/plugins/check_disk -w 20% -c 10% -p /
command[check_zombie_procs]=/usr/lib64/nagios/plugins/check_procs -w 5 -c 10 -s Z
command[check_total_procs]=/usr/lib64/nagios/plugins/check_procs -w 150 -c 200

The parameters for the threshhold values for warning/critical status should probably be configured by puppet as well, based on the machine's usage and resources. For example, you might have a template for nrpe.cfg that outputs a check_load command that is dependent on the number of cores in the machine.

Start nrpe:

service nrpe start
chkconfig nrpe on

On the monitoring host

Installing software

Ensure that EPEL yum repo is installed first.

Install nagios and some plugins:

yum install nagios nagios-plugins nagios-plugins-fping gd gd-devel nagios-plugins-nrpe

My php-capable webserver in this case is apache/php-fpm and it is itself managed by puppet, but if you're doing it by hand, it will look something like this. I tend to use the IUS repo for php/mysql, if you use stock php, you will want to s/php53u/php/ on the following command.

Install apache, php and create a password for your nagios administrative user:

yum install httpd httpd-tools php53u{,-cli,-fpm,-gd}
htpasswd -c /etc/nagios/passwd nagiosadmin

Here is some sample Apache config for /etc/httpd/conf.d/nagios.conf, which you can modify if you have different locations or different auth/auth:

ScriptAlias /nagios/cgi-bin/ "/usr/lib64/nagios/cgi-bin/"

<Directory "/usr/lib64/nagios/cgi-bin/">
  #SSLRequireSSL
   Options ExecCGI
   AllowOverride None
   Order allow,deny
   Allow from all
   AuthName "Nagios Access"
   AuthType Basic
   AuthUserFile /etc/nagios/passwd
   Require valid-user
</Directory>

Alias /nagios "/usr/share/nagios/html"

<Directory "/usr/share/nagios/html">
   #SSLRequireSSL
   Options None
   AllowOverride None
   Order allow,deny
   Allow from all
   AuthName "Nagios Access"
   AuthType Basic
   AuthUserFile /etc/nagios/passwd
   Require valid-user
</Directory>

Start your webserver:

service php-fpm start
service httpd start
chkconfig php-fpm on
chkconfig httpd on

Point your web browser at http://<hostname>/nagios and verify that you can see the nagios page.

Almost minimal nagios configuration

Configure nagios with a command to check using nrpe.

In /etc/nagios/objects/commands.cfg:

define command{
        command_name    check_nrpe
        command_line    $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$
}

Configure nagios with a minimal set of hostgroups. Every host will use this hostgroup initially. In /etc/nagios/objects/hostgroups.cfg:

define hostgroup {
  hostgroup_name    all-servers
  alias             All my servers
  members           *
}

Configure nagios with an almost minimal set of services. Every host will use only these initially. These checks refer to commands already defined in commands.cfg that send requests to nrpe. In /etc/nagios/objects/services.cfg:

define service{
  hostgroup_name            all-servers
  use                       generic-service
  service_description       /var freespace
  check_command             check_nrpe!check_var
}

define service{
    hostgroup_name           all-servers
    use                      generic-service
    service_description      / freespace
    check_command            check_nrpe!check_slash
}

Test with:

nagios -v /etc/nagios/nagios.cfg

Once the nagios pre-flight check succeeds to your satisfaction, run nagios and set it to start on boot:

service nagios start
chkconfig nagios on

In order to ensure that nagios reads the configuration files that we will have puppet create, add this line to /etc/nagios/nagios.cfg:

cfg_dir=/etc/nagios/resource.d

Generating basic Nagios configuration with Puppet

In an environment where (virtual) machines are automatically instantiated, basic monitoring should come "for free" at the time of installation. I am assuming that you already have a basic Puppet installation with puppet version 2.7.12+ (on the server, at least). The approach here is preferred for Puppet 3.0+ as well.

Basic Overview

Install PuppetDB.
Configure the puppet server to use PuppetDB.
Create a puppet class for a nagios server.
Create a puppet class that exports a nagios_host resource for each node.
Each host that is to be monitored runs puppet and should "export nagios_host resources".
PuppetDB stores these exported resources.
The puppet agent running on the monitoring server collects the exported nagios_host resources and writes nagios config files accordingly.

Once this is working, we can expand to Generate nagios configuration to monitor services.

Install PuppetDB

Here are Puppet Labs instructions to install PuppetDB and use PuppetDB.

Ensure you have the puppetlabs yum repo enabled.

On your PuppetDB server:

yum install puppetdb-terminus
puppetdb-ssl-setup
service puppetdb start

Configure the puppet server to use PuppetDB

Create /etc/puppet/puppetdb.conf:

[main]
server = puppetdb.example.com
port = 8081

Edit the puppet server's /etc/puppet/puppet.conf:

[master]
  storeconfigs = true
  storeconfigs_backend = puppetdb

Create, if necessary, /etc/puppet/routes.yaml:

---
master:
  facts:
    terminus: puppetdb
    cache: yaml

A nagios::server class

This is somewhat primitive, but it works and is simple. In modules/nagios/manifests/init.pp:

class nagios::server {
  package { ["nagios","nagios-plugins","nagios-plugins-nrpe"]:
    ensure => installed,
  }
  service { nagios:
    ensure  => running,
    enable  => true,
    require => Exec['make-nag-cfg-readable'],
  }

  # This is because puppet writes the config files so nagios can't read them
  exec {'make-nag-cfg-readable':
    command => "find /etc/nagios -type f -name '*cfg' | xargs chmod +r",
  }

  file { 'resource-d':
    path   => '/etc/nagios/resource.d',
    ensure => directory,
    owner  => 'nagios',
  }

  # Collect the nagios_host resources
  Nagios_host <<||>> {
    require => File[resource-d],
    notify  => [Exec[make-nag-cfg-readable],Service[nagios]],
  }
}

With this you can have puppet create a server with nagios config files based on the collection of exported nagios_host resources as follows:

node "nagios.example.com" {
  include nagios::server
}

Export nagios_host resources

Create a puppet class that is included or inherited by every node that exports a nagios_host resource. In the simplest case, I could have modules/nagios/manifests/export.pp:

class nagios::export {
  @@nagios_host { $::fqdn:
    address       => $::ipaddress,
    check_command => 'check-host-alive!3000.0,80%!5000.0,100%!10',
    hostgroups    => 'all-servers',
    target        => "/etc/nagios/resource.d/host_${::fqdn}.cfg"
  }
}

At this point I'm able to include this in every node. In manifests/site.pp:

node default {
  include nagios::export
}

Then, once everything settles and all nodes have successfully performed a puppet run, I can run puppet on the nagios server and it should start monitoring all nodes. It's a little simplistic at this point; each node is only monitored for ping-up and free diskspace, but it is still an improvement over a non-puppetized nagios setup because every new node gets added to monitoring automatically, which relieves the administrative burden and promotes accuracy.

AllGoodBits.org