Automated Monitoring with Nagios and Puppet
"If it's not monitored, it doesn't exist." That's my phrasing of a common idea; I've just absorbed it over the years and I'm not sure ( and couldn't easily determine ) to whom I should give credit.
Recently, I set up a monitoring service that used Puppet to generate configuration for Nagios to monitor various hosts and services. Bringing up a trivial, experimental monitoring server isn't difficult exactly, but there are some details that are easily forgotten. The Getting started section below shows my notes to get as far as running nagios and monitoring at the most trivial but useful level. It's intended to be a zero-to-something guide.
The next section, Generating basic Nagios configuration with Puppet, demonstrates an approach to automating the generation of nagios configuration based on the puppet server's collection of nagios_host resources exported by each node. If you're not yet at least somewhat familiar with puppet, this article might seem rather sparse on explanation, but if you are, it should give a good jump on getting some value out of it all.
Getting started
There are a few basic steps to getting going with nagios. Here is how I start a minimally useful new nagios deployment. These instructions are highly specific to RHEL-based linux distributions like CentOS in the details, but following analogous approaches on debian (based) linux and BSD systems should be profitable. One cautionary note, don't try this with nginx unless you consider yourself an expert with nginx, otherwise you'll likely waste a lot of time. Of course, it's perfectly possible, it's just that most of the various articles that trickle to the top of web search results appear to be apache focused/out of date or for one reason or another, less than helpful.
On hosts to be monitored
I use NRPE as my nagios agent. Basically the idea is that each host to be monitored runs an NRPE agent which gathers certain information about the machine and the nagios server contacts the NRPE agent instead directly checking all those details itself. This moves load from the server to the clients and reduces network chattiness.
My NRPE agents are themselves managed by puppet and although it is very straightforward to do so, it is outside the scope of this article.
Ensure that EPEL yum repo is installed first.
Install NRPE and some nagios plugins:
yum install nrpe nagios-plugins{-swap,-users,-procs,-disk,-load}
In every environment, I modify the following directive in the nrpe config (/etc/nagios/nrpe.cfg):
allowed_hosts=<IP address of the monitoring server>, 127.0.0.1
I also modify the lines in nrpe.cfg that define commands, here are some examples:
command[check_users]=/usr/lib64/nagios/plugins/check_users -w 5 -c 10 command[check_load]=/usr/lib64/nagios/plugins/check_load -w 15,10,5 -c 30,25,20 command[check_var]=/usr/lib64/nagios/plugins/check_disk -w 20% -c 10% -p /var command[check_slash]=/usr/lib64/nagios/plugins/check_disk -w 20% -c 10% -p / command[check_zombie_procs]=/usr/lib64/nagios/plugins/check_procs -w 5 -c 10 -s Z command[check_total_procs]=/usr/lib64/nagios/plugins/check_procs -w 150 -c 200
The parameters for the threshhold values for warning/critical status should probably be configured by puppet as well, based on the machine's usage and resources. For example, you might have a template for nrpe.cfg that outputs a check_load command that is dependent on the number of cores in the machine.
Start nrpe:
service nrpe start chkconfig nrpe on
On the monitoring host
Installing software
Ensure that EPEL yum repo is installed first.
Install nagios and some plugins:
yum install nagios nagios-plugins nagios-plugins-fping gd gd-devel nagios-plugins-nrpe
My php-capable webserver in this case is apache/php-fpm and it is itself managed by puppet, but if you're doing it by hand, it will look something like this. I tend to use the IUS repo for php/mysql, if you use stock php, you will want to s/php53u/php/ on the following command.
Install apache, php and create a password for your nagios administrative user:
yum install httpd httpd-tools php53u{,-cli,-fpm,-gd} htpasswd -c /etc/nagios/passwd nagiosadmin
Here is some sample Apache config for /etc/httpd/conf.d/nagios.conf, which you can modify if you have different locations or different auth/auth:
ScriptAlias /nagios/cgi-bin/ "/usr/lib64/nagios/cgi-bin/" <Directory "/usr/lib64/nagios/cgi-bin/"> #SSLRequireSSL Options ExecCGI AllowOverride None Order allow,deny Allow from all AuthName "Nagios Access" AuthType Basic AuthUserFile /etc/nagios/passwd Require valid-user </Directory> Alias /nagios "/usr/share/nagios/html" <Directory "/usr/share/nagios/html"> #SSLRequireSSL Options None AllowOverride None Order allow,deny Allow from all AuthName "Nagios Access" AuthType Basic AuthUserFile /etc/nagios/passwd Require valid-user </Directory>
Start your webserver:
service php-fpm start service httpd start chkconfig php-fpm on chkconfig httpd on
Point your web browser at http://<hostname>/nagios and verify that you can see the nagios page.
Almost minimal nagios configuration
Configure nagios with a command to check using nrpe.
In /etc/nagios/objects/commands.cfg:
define command{ command_name check_nrpe command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$ }
Configure nagios with a minimal set of hostgroups. Every host will use this hostgroup initially. In /etc/nagios/objects/hostgroups.cfg:
define hostgroup { hostgroup_name all-servers alias All my servers members * }
Configure nagios with an almost minimal set of services. Every host will use only these initially. These checks refer to commands already defined in commands.cfg that send requests to nrpe. In /etc/nagios/objects/services.cfg:
define service{ hostgroup_name all-servers use generic-service service_description /var freespace check_command check_nrpe!check_var } define service{ hostgroup_name all-servers use generic-service service_description / freespace check_command check_nrpe!check_slash }
Test with:
nagios -v /etc/nagios/nagios.cfg
Once the nagios pre-flight check succeeds to your satisfaction, run nagios and set it to start on boot:
service nagios start chkconfig nagios on
In order to ensure that nagios reads the configuration files that we will have puppet create, add this line to /etc/nagios/nagios.cfg:
cfg_dir=/etc/nagios/resource.d
Generating basic Nagios configuration with Puppet
In an environment where (virtual) machines are automatically instantiated, basic monitoring should come "for free" at the time of installation. I am assuming that you already have a basic Puppet installation with puppet version 2.7.12+ (on the server, at least). The approach here is preferred for Puppet 3.0+ as well.
Basic Overview
- Install PuppetDB.
- Configure the puppet server to use PuppetDB.
- Create a puppet class for a nagios server.
- Create a puppet class that exports a nagios_host resource for each node.
- Each host that is to be monitored runs puppet and should "export nagios_host resources".
- PuppetDB stores these exported resources.
- The puppet agent running on the monitoring server collects the exported nagios_host resources and writes nagios config files accordingly.
Once this is working, we can expand to Generate nagios configuration to monitor services.
Install PuppetDB
Here are Puppet Labs instructions to install PuppetDB and use PuppetDB.
Ensure you have the puppetlabs yum repo enabled.
On your PuppetDB server:
yum install puppetdb-terminus puppetdb-ssl-setup service puppetdb start
Configure the puppet server to use PuppetDB
Create /etc/puppet/puppetdb.conf:
[main] server = puppetdb.example.com port = 8081
Edit the puppet server's /etc/puppet/puppet.conf:
[master] storeconfigs = true storeconfigs_backend = puppetdb
Create, if necessary, /etc/puppet/routes.yaml:
--- master: facts: terminus: puppetdb cache: yaml
A nagios::server class
This is somewhat primitive, but it works and is simple. In modules/nagios/manifests/init.pp:
class nagios::server { package { ["nagios","nagios-plugins","nagios-plugins-nrpe"]: ensure => installed, } service { nagios: ensure => running, enable => true, require => Exec['make-nag-cfg-readable'], } # This is because puppet writes the config files so nagios can't read them exec {'make-nag-cfg-readable': command => "find /etc/nagios -type f -name '*cfg' | xargs chmod +r", } file { 'resource-d': path => '/etc/nagios/resource.d', ensure => directory, owner => 'nagios', } # Collect the nagios_host resources Nagios_host <<||>> { require => File[resource-d], notify => [Exec[make-nag-cfg-readable],Service[nagios]], } }
With this you can have puppet create a server with nagios config files based on the collection of exported nagios_host resources as follows:
node "nagios.example.com" { include nagios::server }
Export nagios_host resources
Create a puppet class that is included or inherited by every node that exports a nagios_host resource. In the simplest case, I could have modules/nagios/manifests/export.pp:
class nagios::export { @@nagios_host { $::fqdn: address => $::ipaddress, check_command => 'check-host-alive!3000.0,80%!5000.0,100%!10', hostgroups => 'all-servers', target => "/etc/nagios/resource.d/host_${::fqdn}.cfg" } }
At this point I'm able to include this in every node. In manifests/site.pp:
node default { include nagios::export }
Then, once everything settles and all nodes have successfully performed a puppet run, I can run puppet on the nagios server and it should start monitoring all nodes. It's a little simplistic at this point; each node is only monitored for ping-up and free diskspace, but it is still an improvement over a non-puppetized nagios setup because every new node gets added to monitoring automatically, which relieves the administrative burden and promotes accuracy.