[rrd-users] rrdcached notion of last_update_time out of sync with rrd file
stevejankowski at yahoo.com
Wed Oct 12 01:08:27 CEST 2011
I'm encountering situations where rrdcached's notion of the last_update_time is out of sync with what's actually in the rrd file. This is leading to the loss of hours worth of data. I have a rough idea of what triggers the problem, but do not yet have steps to reproduce. I'll describe my usage and hopefully this will trigger some suspicions.
Environment: CentOS 5.4, rrdtool 1.4.3 / 1.4.5
(problem encountered with rrdtool 1.4.3. During debugging, I switched to 1.4.5 rrdcached using the journals created by 1.4.3)
Usage: All updates via rrdcached, rrd_fetch against subset of rrds running once a minute
Size: ~7800 rrds, each updated once a minute. Updates applied within one second.
My stats processing application sucks up a minute's worth of application stats each minute, converts them into "update" commands and writes them directly to rrdcached via unix socket. This app holds a long-lived connection to the rrdcached. A separate monitoring application scrapes stats from a subset of the rrds once a minute (using librrd) with the rrdcached daemon address configured (ie. it sends a flush to rrdcached prior to fetching).
This works great unless the stat processing application dies. The minutely application stats are buffered up and processed at high-speed when the stats processing app is restarted. (currently a manual restart, so there could be hours worth of stats to crunch). While catching up, in excess of 10,000 updates/sec are sent to rrdcached.
The updates appear to accepted by rrdcached (response to each update is "0 errors, enqueued 1 value(s)."). However, a large number of the rrds have no data for the whole catching up time while some have all the data. Rrd info on the problem files shows this (even after flush and rrdcached restart):
filename = "acme.rrd"
rrd_version = "0003"
step = 60
last_update = 1318330860
header_size = 8248
However, manually adding the next time results in this error:
update /var/acme/stats/rrd/customer/acme/acme_v1.rrd 1318330920:2741387100176:2744313219528:2344931134016:2321908650392:2366106587368:2384946112584:17:73:753673980704:771235491136:357218014544:348830922000:1743477:142642:36463:213:208:5:156248210:1312447
-1 illegal attempt to update using time 1318330920 when last update time is 1318334820 (minimum one second step)
Why does rrdcached think the last update time is 1318334820 when the file is 1318330860 ?
Restarting rrdcached has no effect. The only work-around I found was to shutdown all rrd-using apps, restart rrdcached (so journals get replayed/flushed), stop rrdcached, delete journals and restart rrdcached. Now the manual "update" command succeeds.
The only errors logged by rrdcached are of the " illegal attempt to update using time XXXX" variety.
Theory: Under some condition(s) rrdcached detects (incorrectly?) an error in an update, but still advances it's internal notion of the last_update_time. This condition carries forward causing each subsequent update to get an error and push the last_update_time forward. However, the internal last_update_time never gets written to the rrd file (which is a good thing, otherwise I would not be able to recover). This condition persists across journal replays.
If you're still with me, any thoughts?
More information about the rrd-users