Fixing Erroneous Data in Ganglia Metrics by Editing RRDtool
Once upon a time, some of my network graphs generated by ganglia showed that some of my machines managed to shunt > 450 Petabytes/second of network traffic for about 45 seconds. Given that these things have a couple of gigabit NICs, I figured that we hadn't broken Physics and that these numbers were Incorrect.
This led me to discover that, contrary to my previous understanding/assumption, the RRDtool files that ganglia uses to store its time-series data are not too difficult to work with. This is because there is a straightforward editing pattern of dump-to-xml, edit, restore-from-xml.
I have a ganglia cluster called 'kvm' so the rrds live in /var/lib/ganglia/rrds/kvm.
4 files contained errors: bytes_in.rrd, bytes_out.rrd, pkts_in.rrd, pkts_out.rrd
the rrdtool files contain lines like:
<!-- 2012-05-25 08:28:00 EDT / 1337948880 --> <row><v> 3.0768559000e+05 </v></row>
but there are 4 anomalous lines in each file like:
<!-- 2012-05-25 08:29:15 EDT / 1337948955 --> <row><v> 7.6378559000e+17 </v></row>
See that 'e+17'? 12 orders of magnitude away from the surrounding numbers? Really? No.
Here's my "one-liner" to fix this mess:
for host in kvm1 kvm2 kvm3; do host=${host}$( hostname | sed -e 's/^[a-z]*//') for m in bytes pkts; do for d in in out; do rrdtool dump /var/lib/ganglia/rrds/kvm/${host}/${m}_${d}.rrd > /tmp/${m}_${d}.xml; sed -i -e 's/e+17 /e+05 /' /tmp/${m}_${d}.xml; mv /var/lib/ganglia/rrds/kvm/${host}/${m}_${d}.rrd /tmp/ ; rrdtool restore /tmp/${m}_${d}.xml /var/lib/ganglia/rrds/kvm/${host}/${m}_${d}.rrd; done; done; done
Now all I have to do is work out why this happened in the first place. But at least I have some more practice and less aversion concerning RRDtool.