[rrd-developers] rrd_update_r and rrdcached conspire to lose data

Wed Oct 12 21:31:32 CEST 2011

[moved from rrd-users]

I can reproduce this problem at whim.  I tracked it down to a design flaw in the rrd_update_r() error handling.

rrd_update_r() loops over a list of updates for a given file.  It calls process_arg() for each update (ie. each timestamp).  If process_arg() returns an error, then rrd_update_r() breaks out of the loop and aborts processing.  The remaining updates are not processed.

rrdcached calls rrd_update_r() to write updates to the rrd.  Regardless of whether rrd_update_r() succeeded or failed, a "wrote" record is still added to the journal.  And ci->last_update_stamp has already been incremented by handle_request_update() so both the runtime state and the journal indicates that the updates have been written when in fact nothing at all may have been written.

Contributing to the problem is that handle_request_update() writes an "update" record to the journal even if the update timestamp is older than ci->last_update_stamp.  When the journal is replayed, this record will cause the batch of updates to fail when rrd_update_r() is called.

To reproduce:

Enable rrdcached
Create an rrd
Write some data to it using timestamps: TS+0, TS+1, TS+2, TS+3
Flush and confirm the rrd "last_update" is correct
Write updates TS+3, TS+4, TS+5   (important that TS+3 update is repeated)

Restart rrdcached
rrdcached will replay journal.  It will log an error about TS+3.
Check the rrd, it won't be modified.
Try to write updates TS+4 and TS+5.  rrdcached will reject the updates with an illegal timestamp error.
You'll never be able to write updates TS+4 and TS+5, they're lost forever (unless you bypass rrdcached or replay+delete the journal)

Options for fixing:

- Change rrd_daemon.c:queue_thread_main() to loop over values/values_num, calling rrd_update_r() once for each timestamp.  Log errors, but continue looping.  This option is inefficient and only addresses the bug as it affects rrdcached.

- Change rrd_update_r() to distinguish between fatal errors and report-and-continue errors.  Not sure if rrd_update_r() has the ability to report multiple errors, maybe only the first report-and-continue type error is reported.

You could also change handle_request_update() to only journal if the timestamp is sane.  But that's not enough because you can still have situations where rrdcached does not yet know "last_update_stamp"  (the value is initialized to 0 and never initialized from the file itself).  Hmm... initializing last_update_stamp with the value in the rrd file would improve the accuracy of rrdcached's response to "update".

I'll be coding some kind of fix so my system stops losing data.  I'll post a diff when it's stable.

Steve

----- Original Message -----
From: Steve Jankowski <stevejankowski at yahoo.com>
To: "rrd-users at lists.oetiker.ch" <rrd-users at lists.oetiker.ch>
Cc: 
Sent: Tuesday, October 11, 2011 4:08 PM
Subject: [rrd-users] rrdcached notion of last_update_time out of sync with rrd file

I'm encountering situations where rrdcached's notion of the last_update_time is out of sync with what's actually in the rrd file.  This is leading to the loss of hours worth of data.  I have a rough idea of what triggers the problem, but do not yet have steps to reproduce.  I'll describe my usage and hopefully this will trigger some suspicions.

Environment: CentOS 5.4, rrdtool 1.4.3 / 1.4.5
   (problem encountered with rrdtool 1.4.3.   During debugging, I switched to 1.4.5 rrdcached using the journals created by 1.4.3)
Usage: All updates via rrdcached, rrd_fetch against subset of rrds running once a minute
Size: ~7800 rrds, each updated once a minute.  Updates applied within one second.

My stats processing application sucks up a minute's worth of application stats each minute, converts them into "update" commands and writes them directly to rrdcached via unix socket.  This app holds a long-lived connection to the rrdcached.  A separate monitoring application scrapes stats from a subset of the rrds once a minute (using librrd) with the rrdcached daemon address configured (ie. it sends a flush to rrdcached prior to fetching).

This works great unless the stat processing application dies.  The minutely application stats are buffered up and processed at high-speed when the stats processing app is restarted.  (currently a manual restart, so there could be hours worth of stats to crunch).  While catching up, in excess of 10,000 updates/sec are sent to rrdcached.

The updates appear to accepted by rrdcached (response to each update is "0 errors, enqueued 1 value(s).").  However, a large number of the rrds have no data for the whole catching up time while some have all the data.  Rrd info on the problem files shows this (even after flush and rrdcached restart):

filename = "acme.rrd"
rrd_version = "0003"
step = 60
last_update = 1318330860
header_size = 8248

However, manually adding the next time results in this error:

update /var/acme/stats/rrd/customer/acme/acme_v1.rrd 1318330920:2741387100176:2744313219528:2344931134016:2321908650392:2366106587368:2384946112584:17:73:753673980704:771235491136:357218014544:348830922000:1743477:142642:36463:213:208:5:156248210:1312447
-1 illegal attempt to update using time 1318330920 when last update time is 1318334820 (minimum one second step)

Why does rrdcached think the last update time is 1318334820 when the file is 1318330860 ?

Restarting rrdcached has no effect.  The only work-around I found was to shutdown all rrd-using apps, restart rrdcached (so journals get replayed/flushed), stop rrdcached, delete journals and restart rrdcached.  Now the manual "update" command succeeds.

The only errors logged by rrdcached are of the " illegal attempt to update using time XXXX" variety.

Theory:  Under some condition(s) rrdcached detects (incorrectly?) an error in an update, but still advances it's internal notion of the last_update_time.  This condition carries forward causing each subsequent update to get an error and push the last_update_time forward.  However, the internal last_update_time never gets written to the rrd file (which is a good thing, otherwise I would not be able to recover).  This condition persists across journal replays.

If you're still with me, any thoughts?

Steve

_______________________________________________
rrd-users mailing list
rrd-users at lists.oetiker.ch
https://lists.oetiker.ch/cgi-bin/listinfo/rrd-users