[rrd-users] Bug? was RE: rddtool Heartbeat & Step

Thu Aug 9 04:19:11 MEST 2001

G'day,

> -----Original Message-----
> From: Blaise Lepeuple [mailto:blaise at yaga.com]
> Sent: Tuesday, August 07, 2001 2:10 PM
> To: don.baarda at baesystems.com
> Subject: rddtool Heartbeat & Step
> 
> 
> I'm sorry if you are the wrong person to ask this, but you 
> had your email
> address on the online man page for "rrd create".
> 
> If I should talk to somebody else, please redirect me to him.
[...]

It's been a while since I examined and understood the internals of RRD. I
did go through it a while ago and satisfied myself that it worked, and since
then have been content that it does what it is supposed to.

> So here is the creation of the rrd :
> 
> rrdtool create test.rrd -s 10 --start 997147699 DS:tik:COUNTER:10:0:U
> RRA:AVERAGE:0:1:10
[...]

I've just had a look at the man page that now includes my description. It
includes verbatim my stuff about heartbeat and step, but excluded the bit I
had on the end of that email about xff. I'll add it here for your info,
because it will help when you start making RRA's with steps>1;

	You are right that "xff" has little affect if you have few "unknown"
PDPs, and setting "heartbeat" high is one way of reducing the number of
"unknown" PDPs. However, it is worth remembering that "unknowns" can happen
because of other reasons. When setting "xff", you are deciding how many
"unknown" PDPs are acceptable when accumulating into an RRA, and an
"unknown" really means that rrd has no idea what the rate for the PDP was.
So "xff" is a "garbage threshold" for how much missing input data you can
tolerate when accumulating your data into "course grain", large "steps",
RRAs.

	When setting "heartbeat", you are specifying a requirement on your
samples. Remember that a long "heartbeat" means that you are happy for
multiple PDPs to be estimated from a single sample, which means the
individual PDPs are not really accurate. The nice thing about this though is
that these not-quite-accurate PDPs accumulate accurately. The individual
PDPs are estimated from the average rate over a longer period, hence when
you accumulate these PDPs into a single period, the average rate is correct
for that period. So "heartbeat" is a "garbage threshold" for how much
inaccuracy you can tolerate in your "fine grain", small "steps", RRAs.

Note that the xff for your RRA is 0. This has no effect since steps=1 for
this RRA, and as I remember it xff only comes into effect when accumulating
multiple PDP's into an RRA.

> Now if I do the measure for 20 a bit early or a bit late, I 
> would expect
> this PDP to have an unknown value since the interval for that 
> pdp exceeded
> the heartbeat.
> If it is late, I am getting the expected result :
> 
> rrdtool update test.rrd 997147700:0 997147710:10 997147721:21 
> 997147730:30
> 997147740:40
> 
> rrdtool fetch test.rrd AVERAGE --start 997147710 --end 997147740 :
> tik
> 
> 997147710: 1.0000000000e+00
> 997147720: nan
> 997147730: 1.0000000000e+00
> 997147740: 1.0000000000e+00
[...]

This is fine. The PDP for 997147711 -> 997147720 includes no known values,
and is hence unknown. The PDP for 997147721 -> 997147730 includes 1sec <
heartbeat unknown, and hence the PDP is known.

> rrdtool update test.rrd 997147700:0 997147710:10 997147719:19 
> 997147730:30
> 997147740:40
> 
> rrdtool fetch test.rrd AVERAGE --start 997147710 --end 997147740 :
> tik
> 
> 997147710: 1.0000000000e+00
> 997147720: nan
> 997147730: nan
> 997147740: 1.0000000000e+00
[...]

This looks wrong. You may have tripped up a bug in RRD. From my
understanding the last time I looked at RRD, the 997147720: output should
not be nan since the period 997147711->997147720 has only 1sec unknown, and
since 1sec is less than heartbeat, that PDP should be OK.

> rrdtool update test.rrd 997147700:0 997147710:10 997147719:19 
> 997147729:29
> 997147740:40
> 
> rrdtool fetch test.rrd AVERAGE --start 997147710 --end 997147740 :
> tik
> 
> 997147710: 1.0000000000e+00
> 997147720: 1.0000000000e+00
> 997147730: nan
> 997147740: nan

This looks wrong too. 997147711->997147720 has 1sec unknown, hence PDP OK.
997147721->997147730 has only 1sec unknown too, so should be known. For
997147731->997147740 is all unknown so unknown is correct.

> On the other hand, I can stretch up to 18 seconds some 
> readings without
> affecting anything :
> 
> rrdtool update test.rrd 997147700:0 997147710:10 997147711:11 
> 997147729:29
> 997147730:30 997147740:40
> 
> rrdtool fetch test.rrd AVERAGE --start 997147710 --end 997147740 :
> tik
> 
> 997147710: 1.0000000000e+00
> 997147720: 1.0000000000e+00
> 997147730: 1.0000000000e+00
> 997147740: 1.0000000000e+00

Surprisingly, this is actually correct. 997147711->997147720 has 9sec's
unknown < step so known. 997147721->997147730 also has 9sec's unknown < step
so known. The large unknown period between 997147712->997147729 still leaves
enough known values in the PDP's on each side for them both to be known.

I've Cc'd this to the rrd-users list in case someone else can comment on the
presence/absence of a bug. Note that you are floating in the areas of a
possible "off by one" bug, and I recall seeing that one of these was fixed
at some point. What version of rrd are you running?

ABO

--
Unsubscribe mailto:rrd-users-request at list.ee.ethz.ch?subject=unsubscribe
Help        mailto:rrd-users-request at list.ee.ethz.ch?subject=help
Archive     http://www.ee.ethz.ch/~slist/rrd-users
WebAdmin    http://www.ee.ethz.ch/~slist/lsg2.cgi