[rrd-users] Incorrect numbers returned when monitoring network stats at one second intervals

Mark Seger Mark.Seger at hp.com
Thu Jul 26 15:22:29 CEST 2007

Simon Hobson wrote:
> Mark Seger wrote:
>> I had actually written this as a postscript to another topic and Alex
>> suggested starting another thread, but by then I had already replied. 
>> Since there were no follow-up replies I was thinking perhaps this note
>> got lost in the haze and so I'm reposting is as a new one
> It didn't, I posted a reply last Wednesday.
>>  > It turns out that unlike most systems counters which get updated quite
>>>  frequently, network counters only get updated about once a second but
>>>  not exactly once a second!  It turns out they get updated every 0.9765
>>>  seconds.  So consider the output of my collection tool at an interval
>>>  of 0.2 seconds.  Just note that in the following format, I'm reporting
>>>  the aggregate across interfaces while doing a 'ping -f' on one of
>>>  them.  The rates for the different interfaces are being updated at
>>>  different times and so that why you're seeing the 8M/sec numbers
>>>  aligning at .208 while the background traffic on a different interface
>>>  is aligning at .409.
> Data snipped. So in summary - the stats are not updated exactly every 
> second, and different interfaces are updated at different times.
> It's well known that if you sample data asynchronously then you will 
> get this sort of effect unless you sample rate is significantly 
> different to the data rate. It's somewhat similar to the aliasing 
> problem when trying to sample a high frequency analogue signal at too 
> low a sample rate - for example. If you want that sort of precision 
> then you must synchronise your sampling with the data, or use some 
> other means (such as averaging or smoothing) to hide the effect.
also agree
> Sampling every second does not occasionally give you an invalid value 
> as you suggest - the value it gives is 100% valid, just unexpected ! 
> Just like a lot of 'amateur statistics' manage to come to invalid 
> conclusions with valid data.
I guess I have to differ on your conclusion.  When one has a tool that 
is reporting bytes/sec and it occasionally reports an invalid number 
like 200MB/sec on a 1G link, they at least owe an explanation to their 
users why this is the case.  When doing detailed analysis of system 
performance problems, particularly when using automated processes, 
having bogus network numbers is a bad thing.  Suppose you're running on 
a lightly loaded network and you see a spike of 75MB during one of the 
intervals.  Was this really 75MB that second or was it the result of 
incorrect reporting?  Since many people do not monitor at that fine 
grained of a level - and believe me, they have no idea how much they're 
losing by not doing so - I suspect very few people even notice.  I guess 
that's why I have a problem with any data sampled at 1 or even 5 minute 
intervals - it really doesn't tell me anything about what my system is 
really doing.

As a slight digression, and since this has been fixed it's really more 
illustrative in nature, how many people knew the disk i/o stats were 
busted prior to the 2.6.15 kernel?  They were NOT counting the bytes 
written to disk but rather bytes queued!  At the same time it was 
counting i/o to disk making the 2 numbers totally inconsistent with each 
other.  How did I find out?  I ran collectl with a sample interval one 1 
second and could literally see i/o rates as high as 500MB/sec to a 
single disk!

As it turns out, virtually all other linux counters are updated at a 
high enough frequency that you can sample them at much lower frequencies 
than one a second.

> _______________________________________________
> rrd-users mailing list
> rrd-users at lists.oetiker.ch
> https://lists.oetiker.ch/cgi-bin/listinfo/rrd-users

More information about the rrd-users mailing list