[rrd-developers] [Ganglia-developers] Integrating rrdcachedwith Ganglia

Tue Sep 30 15:33:40 CEST 2008

On Tue, Sep 30, 2008 at 01:01:55PM +0100, Daniel.Pocock at barclayscapital.com wrote:
> > I get lots of errors like this, and gaps in my graphs:
> > 
> > Sep 30 11:05:42 servername rrdcached[18002]: queue_thread_main:
> > rrd_update_r 
> > (/.../rrds/unspecified/servername/disk_total.rrd) failed with 
> > status -1. (/.../rrds/unspecified/servername/disk_total.rrd:
> > illegal attempt to update using time 1222768885 when last 
> > update time is
> > 1222768885 (minimum one second step))

Daniel,

The daemon makes no attempt to ensure that you are updating the files
correctly.  Any RRD process (with or without rrdcached) that tries to
update the same file without advancing the time_t will generate this
error.

> I decided to have a closer look at what is happening:
> 
> - the errors are only logged when someone retrieves a graph (in other
> words, when rrdcached is told to flush everything)

They would also be logged when the daemon flushes periodically to disk (-w
and -f timers).

> - I increased the polling interval (now 3 seconds)
> 
> - compiling with debug symbols and setting a breakpoint on the error, I
> discover the following:
> 
> Breakpoint 1, queue_thread_main (args=0x0) at rrd_daemon.c:703
> 703           RRDD_LOG (LOG_NOTICE, "queue_thread_main: "
> (gdb) info locals
> ci = (cache_item_t *) 0x1658ded0
> file = 0x2aaab85e9b80 "/.../unspecified/__SummaryInfo__/mem_buffers.rrd"
> values = (char **) 0x2aaaab33b010
> status = -1
> i = 2
> values_num = 59739

values_num is the number of values enqueued for a particular RRD file.
You said you have been running for (300sec / (3sec/update)) = 100 updates.
Therefore, each file should not have more than 100 updates.

It looks like you are sending the updates multiple times to the daemon.
It would take almost 597 duplicate strings per file to advance values_num
that far in only 100 poll intervals.

This explains the duplicate strings here:

> (gdb) x/s *(0x2aaaab33b010)
> 0x1658dfb0:      "1222774961:314936.000:1"
> (gdb) x/s *(0x2aaaab33b010+8)
> 0x1658f460:      "1222774961:314936.000:1"

I am not familiar with Ganglia, but it looks to me like the process that
feeds the RRD file is not separating multiple targets into unique file
names.  By writing every output to the same file name, it's enqueueing
lots of duplicate results.  Are you polling ~597 targets by chance?

If you have Ganglia configured to log errors, it should have been logging
the same RRD errors before you started using rrdcached..  They would have
been logged to "daemon.info".  rrdcached logs the update messages to
"daemon.notice".

You may want to enable this line to confirm what Ganglia is trying to
write:

$GANGLIA/monitor-core/gmetad/rrd_helpers.c:60
  /* debug_msg("Updated rrd %s with value %s", rrd, val); */

-- 
 kevin brintnall =~ /kbrint at rufus.net/