[mrtg] Odd Behavior with mrtg + rrdtool

Tue Jun 3 17:50:47 CEST 2008

On Tue, 2008-06-03 at 10:40 +1200, Steve Shipway wrote:

> 
> A two second interval is extremely short! 8-o
> 
> I would suggest you check the obvious firstly
> - Are you using SNMPv2?  If not, do so, if possible.

I tried it both with and without SNMPv2. There was no perceptible
difference between the graphs.

> - Are you generating so much test traffic that the SNMP packets are being dropped?
> - With a 2sec interval, this can mean that the interval is smaller than the SNMP timeout or retries time.  Any delay would cause data to be skipped, and possibly interpolated or set to zero (do you have unknaszero set?)  Maybe your MRTG server has slow disks that cannot keep up with the IO stream and it needs to freeze occasionally to flush the output buffer, missing data polls?

I verified with Wireshark that for every SNMP request sent out a response was received in a timely fashion (~1ms)

> I am guessing that the odd dips are when the counter wraps around, or rather when the MRTG or RRD code thinks it /might/ have wrapped around.  Setting to SNMPv2 will make this less frequent and less likely, although a 2sec poll is unlikely to be wrapping until some crazy number of gigabits per second.  Maybe the MRTG wrap detection code gets a bit dodgy at these high poll frequencies?

That was my initial reaction, but that doesn't seem to be the case. At
11.6Mbps, the rollover on the octet counters should occur on the order
of hours, not seconds or minutes! Also, with a constant data stream, I
would expect any dips due to rollover error to occur at a fairly regular
interval, and this was definitely not the case (see description of
symptoms below).

> If using SNMPv2 makes the dips disappear or occur less often, then it is probably a wraparound-detection error.  Similarly, if the dips disappear with lower poll frequencies then it might be because the normalisation routines get upset then the buckets are so small?  I'd need to pore over the code for hours to deduce any possible misbehaviour when the interval is so small.

> Hope this helps,
> 
> Steve

Not being at all familiar with the internals of MRTG, I won't speculate
on the specifics, but the observed behavior was when the dips on the
graph were appearing, MRTG appeared to be sending an 'extra' request
inside an interval.

eg, with 2 second intervals:

SNMP-GET requests at 0s, 2s, 4s, 6s, 7s, 8s, 10s, etc.
The dip in the graph would correspond to the 6s-7s-8s sending event, and
dip to a factor of 1/2 of the expected rate.

eg, with 1 second intervals:

SNMP-GET requests at 0.0s, 1.0s, 2.0s, 2.1s, 3.0s, 4.0s, etc.
The dips in the graph would correspond to the 2.0s-2.1s-3.0s sending
event and dip to a factor of 1/10th of the expected rate.

The router responds to each request in a timely (~1ms) fashion. The
relation of the timing of the errant request to the interval size and
the dip in the graph is what really intrigues me.

The factors that affected the dips appeared to be interval length
(longer intervals = less dips and proportionally shallower dips), and
number of OID pairs being polled (less OID pairs = less dips).

I unfortunately do not have any data to include with this email as I
spent the last several days working around this problem with a custom
polling script in perl. I will take some time later this week or this
weekend to reproduce this problem and send you Wireshark captures and
corresponding graphs from the rrd data to better illustrate this
phenomenon. If there is any other data you think might be useful for me
to include, please let me know. :)

Thank you for your time!

J.Williams