[rrd-users] Incorrect numbers returned when monitoring network stats at one second intervals
Mark Seger
Mark.Seger at hp.com
Thu Jul 26 15:22:29 CEST 2007
Simon Hobson wrote:
> Mark Seger wrote:
>
>
>> I had actually written this as a postscript to another topic and Alex
>> suggested starting another thread, but by then I had already replied.
>> Since there were no follow-up replies I was thinking perhaps this note
>> got lost in the haze and so I'm reposting is as a new one
>>
>
> It didn't, I posted a reply last Wednesday.
>
>
>> > It turns out that unlike most systems counters which get updated quite
>>
>>> frequently, network counters only get updated about once a second but
>>> not exactly once a second! It turns out they get updated every 0.9765
>>> seconds. So consider the output of my collection tool at an interval
>>> of 0.2 seconds. Just note that in the following format, I'm reporting
>>> the aggregate across interfaces while doing a 'ping -f' on one of
>>> them. The rates for the different interfaces are being updated at
>>> different times and so that why you're seeing the 8M/sec numbers
>>> aligning at .208 while the background traffic on a different interface
>>> is aligning at .409.
>>>
>
> Data snipped. So in summary - the stats are not updated exactly every
> second, and different interfaces are updated at different times.
>
correct
> It's well known that if you sample data asynchronously then you will
> get this sort of effect unless you sample rate is significantly
> different to the data rate. It's somewhat similar to the aliasing
> problem when trying to sample a high frequency analogue signal at too
> low a sample rate - for example. If you want that sort of precision
> then you must synchronise your sampling with the data, or use some
> other means (such as averaging or smoothing) to hide the effect.
>
also agree
> Sampling every second does not occasionally give you an invalid value
> as you suggest - the value it gives is 100% valid, just unexpected !
> Just like a lot of 'amateur statistics' manage to come to invalid
> conclusions with valid data.
I guess I have to differ on your conclusion. When one has a tool that
is reporting bytes/sec and it occasionally reports an invalid number
like 200MB/sec on a 1G link, they at least owe an explanation to their
users why this is the case. When doing detailed analysis of system
performance problems, particularly when using automated processes,
having bogus network numbers is a bad thing. Suppose you're running on
a lightly loaded network and you see a spike of 75MB during one of the
intervals. Was this really 75MB that second or was it the result of
incorrect reporting? Since many people do not monitor at that fine
grained of a level - and believe me, they have no idea how much they're
losing by not doing so - I suspect very few people even notice. I guess
that's why I have a problem with any data sampled at 1 or even 5 minute
intervals - it really doesn't tell me anything about what my system is
really doing.
As a slight digression, and since this has been fixed it's really more
illustrative in nature, how many people knew the disk i/o stats were
busted prior to the 2.6.15 kernel? They were NOT counting the bytes
written to disk but rather bytes queued! At the same time it was
counting i/o to disk making the 2 numbers totally inconsistent with each
other. How did I find out? I ran collectl with a sample interval one 1
second and could literally see i/o rates as high as 500MB/sec to a
single disk!
As it turns out, virtually all other linux counters are updated at a
high enough frequency that you can sample them at much lower frequencies
than one a second.
-mark
> _______________________________________________
> rrd-users mailing list
> rrd-users at lists.oetiker.ch
> https://lists.oetiker.ch/cgi-bin/listinfo/rrd-users
>
More information about the rrd-users
mailing list