[rrd-users] Re: odd spikes due to early resets

Tue Feb 20 23:36:47 MET 2001

G'day,

> -----Original Message-----
> From:	Clifton Royston [SMTP:cliftonr at lava.net]
> Sent:	Wednesday, February 21, 2001 6:32 AM
> To:	BAARDA, Don
> Cc:	'Tobias Weingartner'; 'Sasha Mikheev'; Matt Ashfield; RRD users
> Subject:	Re: [rrd-users] Re: odd spikes due to early resets
> 
> On Tue, Feb 20, 2001 at 09:51:02AM +1030, BAARDA, Don wrote:
> ...
> > Hence the probability of incorrectly interpreting a reset as a valid
> wrap
> > is;
> > 
> > 	err_probability = (specified_max / measurable_max)
> > 
> > 	where:
> > 
> > 	measurable_max = (counter_max / step)
> > 
> > 	For a typical application of a 32bit counter, 1Mbps interface, and
> > 5min step this works out as;
> > 
> > 	err_probability = ( (10^6 / 8) / ( 2^32 / 300) ) = 0.8% 
> > 
> > 	This is _very_ low!!! 
> 
> I have to disagree with you on this, if I understand what you are
> calculating!
> 
> A probability of nearly 1% in a given sample is a very high probability
> when you are doing samples on thousands of interfaces, thousands of
> times per day.  
> 
	I am not calculating the probability of there ever being a reset
mistaken for a counter wrap. As you say, even a very low probability of
failure per sample starts to approach 100% as the number of samples
approaches infinity. However, the "sample" in this case is not a single rrd
input value, but a single reset.

	I am calculating the probability that a single counter-reset is
miss-interpreted as a counter-wrap. Hence a 1% probability means, on
average, one 1 in 100 resets will result in a false value. Note that this is
per reset, not per input value, so unless you are resetting your router 100
times a day, you are unlikely to see a single false value in a given day.

	Whether you can live with this depends on your requirements. The
alternate solution of using DERIVE with min=0 will result in 100% of
legitimate counter wraps being recorded as "Unknown". It depends on what you
consider the biggest evil; recording a false value for 1 in 100 resets, or
recording "Unknown" for every single counter-wrap.

> It brings the probability of invalid spikes in the data to near
> certainty - which is exactly what many users are complaining about and
> discussing remedies for.
> 
	Note that setting a max means any reading greater than that max is
considered invalid and marked "Unknown". Hence you can never have a "spike"
greater than the max setting.

	After discussing this off list with Alex van den Bogaerdt, he
pointed out that many people have high bandwidth pipes with typically low
traffic; ie 100Mbps pipes with <10Mbps traffic. In this case the above
calculation equates to just under 80% chance of a reset being
miss-interpreted as a wrap, and the resulting value usually sticks out as an
obvious spike above the normal traffic.

	I've just posted Alex off list my summary of our discussions (which
he hasn't agreed with yet), which I'll post here;

	If you cannot tolerate ever mistaking the occasional counter reset
for a legitimate counter wrap, and would prefer "Unknowns" for all
legitimate counter wraps and resets, always use DERIVE with min=0.
Otherwise, using COUNTER with a suitable max will return correct values for
all legitimate counter wraps, mark some counter resets as "Unknown", but can
mistake some counter resets for a legitimate counter wrap.

	For a 5min step and 32bit counter, the probability of mistaking a
counter reset for a legitimate wrap is arguably about 0.8% per 1Mbps of
maximum bandwidth. Note that this equates to 80% for 100Mbps interfaces, so
for high bandwidth interfaces and a 32bit counter, DERIVE with min=0 is
probably preferable. If you are using a 64bit counter, just about any max
setting will eliminate the possibility of mistaking a reset for a counter
wrap.

	ABO

--
Unsubscribe mailto:rrd-users-request at list.ee.ethz.ch?subject=unsubscribe
Help        mailto:rrd-users-request at list.ee.ethz.ch?subject=help
Archive     http://www.ee.ethz.ch/~slist/rrd-users
WebAdmin    http://www.ee.ethz.ch/~slist/lsg2.cgi