Basic Monitoring with Ganglia

There are two main meanings of the word "monitoring" to sysadmins:

Wearing an operations hat, we are interested in "Event Monitoring" in order to trigger alerts when something bad happens so it can be fixed.
Wearing an engineering hat, we are interested in performance/utilisation monitoring over time in order to better use the resources available.

Ganglia is a tool primarily designed to focus on the latter goal; while you can look at ganglia to see when a machine or even a service is unavailable, there are tools that are probably more suited to that job, such as Nagios. Ganglia is more suitable for looking at trends to help with engineering improvements and capacity planning.

Ganglia is a distributed monitoring system. It uses an agent, gmond, running on each host to collect data about that host and send it to one or more collectors across the network. On the collector(s), gmond receives the data and gmetad aggregates and stores the data in a file using RRDtool. There is a (PHP based) web interface to display the data.

Basic Configuration

For a simple setup, you need configuration for gmond on each machine you want to configure and gmond, gmetad and the web interface on the collector. I'm using unicast UDP, but ganglia happily supports multicast if your network will permit.

gmond

The default configuration file, gmond.conf, is probably very close to what you need at least initially, but I specify some identification information and the destination for my collected stats:

cluster {
  name = "Test"
  owner = "DH"
  latlong = "unspecified"
  url = "unspecified"
}

udp_send_channel {
    host = collector.example.com
    port = 8649
    ttl = 1
}

gmetad

gmetad polls gmond (or another gmetad) for data; only one line in gmetad.conf needs to be edited:

data_source "Test" localhost

This means that gmetad will be asking the gmond running locally for information about the cluster named "Test". If you have more than one head node per cluster, you can give gmetad backup sources of information for this cluster by appending additional hostnames to the data_source line:

data_source "Test" localhost anotherhost.domain.local:8650

Permissions

Ensure that the RRDs are readable by the web server and writable by the user for gmetad.

Start it up

Initially, you might want to start the gmond/gmetad binaries by hand, possibly with the -d option to get extra debugging output, but once you're comfortable just use your OS's normal procedure for startup scripts and you're golden.

Tiered Ganglia Architecture

Having successfully set up a single machine, the next step is include all the members of various groups' computing clusters that we support. Here's a brief reference for unicast based configuration for a tiered ganglia grid.

Individual machines: gmond

Each monitored server should run gmond, with a gmond.conf that contains:

cluster {
   name = "<clustername>"
   owner = "<owner>"
   latlong = "unspecified"
   url = "unspecified"
}
udp_send_channel {
   host = <clusterlead>
   port = 8649
}

subject to the following definitions:

clustername: The name of the cluster, for example "Numbers", "Games", "Colours"
owner: The owning group, for example "Computational Chemistry", "DSP"
building: The name of the building where it is physically located.
clusterlead: The fully qualified domainname of the lead host of the cluster

I should probably note that the udp_send_channel stanza might need a ttl parameter, so it's probably best to declare it explicitly. It should be equal to the number of tiers in your ganglia architecture. So if your node's gmond is handing the data to another gmond which is then polled by gmetad, it should be '2'. If your node's gmond is queried directly by gmetad, it can be just '1'.

The Ganglia Master

The ganglia server should run gmetad so that the grid>cluster hierarchy works correctly.

gmetad.conf contains:

gridname "<metaclustername>"
data_source "<clustername>" <clusterlead1>
data_source "<another_clustername> <clusterlead2>

subject to the following definitions:

clustername: The name of the cluster, for example "Numbers", "Games", "Colours"
clusterlead: The fully qualified domainname of the head node of the cluster in question

Modification for multicast

gmond benefits from using multicast because each gmond can listen to the multicast address and garner information about the whole cluster, so if you configure several gmond nodes in gmetad's data_source entry for that cluster, losing one node won't mean you lose data for the whole cluster.

Here are example stanzas for gmond.conf (select your preferred multicast address and port; I like one port per cluster, but this is not necessary):

udp_send_channel {
  bind_hostname = yes
  mcast_join = 239.2.0.1
  port = 8648
  ttl = 2
}

/* You can specify as many udp_recv_channels as you like as well. */
udp_recv_channel {
  mcast_join = 239.2.0.1
  port = 8648
  bind = 239.2.0.1
}

And then the gmetad.conf config will look like:

data_source "Test" node1.example.com:8648 node2.example.com:8648

AllGoodBits.org