[rrd-developers] Re: Cricket/RRDtool doesn't like restarting devices

Alex van den Bogaerdt alex at slot.hollandcasino.nl
Wed Jan 12 00:39:15 MET 2000

Bert Driehuis wrote:
> I've had an unrelated cause for spikes that render the charts of the
> history of my Squid server useless. RRDtool (at least in the way it is
> called by Cricket, but I believe this to be generally true) cannot cope
> with a restart of an SNMP agent. COUNTER objects wind up with huge

<IMO mode=humble>
It doesn't need to do so as it is the front-end that should cope with
this.  The front-end (Cricket in this case) should detect the restart
and write an unknown to RRDtool.

> values in this scenario:
> Time	Value	PDP Value
> t	200	0
> t+1	250	50
> t+2	300	50
>  [ agent restarts, counter goes to zero ]
> t+3	5	4.2e9

In addition to above remarks,  RRDtool can offer some protection if
you, or the front-end that asks RRDtool to create the database, tell
RRDtool to limit the counter values that are possible.  Clearly, when
the delta (the increase) is around 50 normally, 4200000000 is absurd.
Read the tutorial, especially the part about my car. It won't do
4200000000 km/h so I set a limit on the database input.

> This can easily be demonstrated by monitoring the value of
> system.sysUpTime.0 for a while (it will show a value of about 100), then
> restarting your snmpd. Squid will expose this behavior more easily than
> a router, as network gear tends to be rarely rebooted (well, Squid
> shouldn't die either, but my test server does every once in a while,
> usually due to pilot error on my side :-)

Indeed, sysUptime could be monitored.  As you may or may not know, RRDtool
does not do any monitoring.  It is the front-end that performs this task.

> It could be an artifact of something else I've done wrong, but I think
> the code in rrd_update.c to deal with overflow is asking for trouble
> anyway. I've attached a diff that replaces that check with an assignment
> of NaN, and unless people object, ask Tobi to include it in the next
> release.

You may have guessed it already: I object.  Modifying code that works
to mask problems in other code is not done and is generally not necessary
and undesirable.

I do agree with previous threads on this list that the code could be
expanded (perhaps: should be) and allow for arbitrary wrapping values.
However, it should do counter wraps, not resets.

> Overflows are fairly rare, in my experience. If dealing with them is
> important, code needs to be added to Cricket to check to see if
> system.sysUpTime.0 has decreased since the previous sample, and in that
> case mark the sample with a tag to indicate that it is a valid sample,
> but should not be used for a comparison with the previous value. This
> would be pretty complicated to do right.


   if ((current.sysUptime - previous.sysUptime) < 0) feed_U_to_RRDtool;

RRDtool receives the unknown value and thus the current interval is
invalid.  Then, it receives the correct counter value and the next
interval will be known.  The only thing that needs to be taken care
of right now is the update time; RRDtool cannot handle two samples
with the same time stamp.  This may be an improvement for the wish list
but it is also worked around rather simple by feeding the U value at
time NOW-1 and the current counter value at NOW (this won't allow for
updates to happen each second. Who cares?)

There have been discussions on this subject a number of times.  You
may want to pay a visit to the archives if you're interested.

 / alex at slot.hollandcasino.nl                  alex at ergens.op.het.net \
| work                                                         private |
| My employer is capable of speaking therefore I speak only for myself |

Unsubscribe mailto:rrd-developers-request at list.ee.ethz.ch?subject=unsubscribe
Help        mailto:rrd-developers-request at list.ee.ethz.ch?subject=help
Archive     http://www.ee.ethz.ch/~slist/rrd-developers

More information about the rrd-developers mailing list