[rrd-users] Incorrect numbers returned when monitoring network stats at one second intervals

Thu Jul 26 17:53:54 CEST 2007


Simon Hobson wrote:
> Mark Seger wrote:
>
>   
>>> Sampling every second does not occasionally give you an invalid 
>>> value as you suggest - the value it gives is 100% valid, just 
>>> unexpected ! Just like a lot of 'amateur statistics' manage to come 
>>> to invalid conclusions with valid data.
>>>       
>> I guess I have to differ on your conclusion.  When one has a tool 
>> that is reporting bytes/sec and it occasionally reports an invalid 
>> number like 200MB/sec on a 1G link, they at least owe an explanation 
>> to their users why this is the case.
>>     
>
> Which tools ? It's unclear from your previous postings what tools you 
> are using to produce the figures.
>   
this effects any tool that reports statistics from /proc/net/dev, such 
as sar, iostat, etc. since none of there allow sub-second monitoring all 
are effected.  collectl, which is the tool I use allows to to set 
sub-second intervals down to the microsecond none of the systems can do 
much better than milliseconds.
> Have you reported to the issue to the package maintainers ?
>   
there is nothing any of them can do about this and probably far too many 
of them are effected to even try.  I was going to report it in the 
kernel.org mailing list, since it's NOT a tool problem, but when I 
looked at the maintainers list there were too many working on network 
related things and so I wimped out.
>> Since many people do not monitor at that fine grained of a level - 
>> and believe me, they have no idea how much they're losing by not 
>> doing so - I suspect very few people even notice.  I guess that's 
>> why I have a problem with any data sampled at 1 or even 5 minute 
>> intervals - it really doesn't tell me anything about what my system 
>> is really doing.
>>     
>
>
> Personally I cannot see what is useful about such fine grained data 
> (for most people and most systems). Even on what might normally be 
> considered a 'steady' data flow, actual data rates will fluctuate 
> wildly at that level of inspection. Very few network topologies are 
> deterministic - ethernet certainly is not. Transit delays through 
> routers are even less deterministic, not to mention all the other 
> circuits a packet must pass through. Oh yes, did I omit to mention 
> the task scheduler queue, disk i/o queue, network output queue, ... 
> all these things will conspire to give a randomness to your output 
> with a lot of variables - even an ntp update will have an effect as 
> the task wakes up, sends a packet, waits for a response, and updates 
> the status files on disk.
>   
It would seem to me you've never worked on really large systems - for 
example, every wonder how long it takes to create 1M files?  If you look 
at the i/o patterns at very low rates you can actually see periodic 
stalls ( think they are something like every 18 seconds or so).  If you 
have an application trying to do high-performance networking and it's 
behaving poorly you can detect network congestion problems.  I've seen 
nfs problems in which the system may periodically hit very high use 
rates.  This led me to ultimately learn about nfs spin lock issues in 
earlier 2.6 kernels.  There are a lot more examples of this.
> I would reasonably expect the output of almost any real-world system 
> to appear pseudo-random !
>   
and you would be surprised to find that is not so.  admittedly a lot 
counters vary wildly during the course fo the day but hidden inside them 
you're be amazed at the correlations that can be drawn between 
performance numbers and system/appliction behave.  and correlating these 
numbers to what's happening in systems logs is yet a whole other level 
of analysis.
-mark
> _______________________________________________
> rrd-users mailing list
> rrd-users at lists.oetiker.ch
> https://lists.oetiker.ch/cgi-bin/listinfo/rrd-users
>