[rrd-users] Incorrect numbers returned when monitoring network stats at one second intervals
Mark Seger
Mark.Seger at hp.com
Thu Jul 26 17:53:54 CEST 2007
Simon Hobson wrote:
> Mark Seger wrote:
>
>
>>> Sampling every second does not occasionally give you an invalid
>>> value as you suggest - the value it gives is 100% valid, just
>>> unexpected ! Just like a lot of 'amateur statistics' manage to come
>>> to invalid conclusions with valid data.
>>>
>> I guess I have to differ on your conclusion. When one has a tool
>> that is reporting bytes/sec and it occasionally reports an invalid
>> number like 200MB/sec on a 1G link, they at least owe an explanation
>> to their users why this is the case.
>>
>
> Which tools ? It's unclear from your previous postings what tools you
> are using to produce the figures.
>
this effects any tool that reports statistics from /proc/net/dev, such
as sar, iostat, etc. since none of there allow sub-second monitoring all
are effected. collectl, which is the tool I use allows to to set
sub-second intervals down to the microsecond none of the systems can do
much better than milliseconds.
> Have you reported to the issue to the package maintainers ?
>
there is nothing any of them can do about this and probably far too many
of them are effected to even try. I was going to report it in the
kernel.org mailing list, since it's NOT a tool problem, but when I
looked at the maintainers list there were too many working on network
related things and so I wimped out.
>> Since many people do not monitor at that fine grained of a level -
>> and believe me, they have no idea how much they're losing by not
>> doing so - I suspect very few people even notice. I guess that's
>> why I have a problem with any data sampled at 1 or even 5 minute
>> intervals - it really doesn't tell me anything about what my system
>> is really doing.
>>
>
>
> Personally I cannot see what is useful about such fine grained data
> (for most people and most systems). Even on what might normally be
> considered a 'steady' data flow, actual data rates will fluctuate
> wildly at that level of inspection. Very few network topologies are
> deterministic - ethernet certainly is not. Transit delays through
> routers are even less deterministic, not to mention all the other
> circuits a packet must pass through. Oh yes, did I omit to mention
> the task scheduler queue, disk i/o queue, network output queue, ...
> all these things will conspire to give a randomness to your output
> with a lot of variables - even an ntp update will have an effect as
> the task wakes up, sends a packet, waits for a response, and updates
> the status files on disk.
>
It would seem to me you've never worked on really large systems - for
example, every wonder how long it takes to create 1M files? If you look
at the i/o patterns at very low rates you can actually see periodic
stalls ( think they are something like every 18 seconds or so). If you
have an application trying to do high-performance networking and it's
behaving poorly you can detect network congestion problems. I've seen
nfs problems in which the system may periodically hit very high use
rates. This led me to ultimately learn about nfs spin lock issues in
earlier 2.6 kernels. There are a lot more examples of this.
> I would reasonably expect the output of almost any real-world system
> to appear pseudo-random !
>
and you would be surprised to find that is not so. admittedly a lot
counters vary wildly during the course fo the day but hidden inside them
you're be amazed at the correlations that can be drawn between
performance numbers and system/appliction behave. and correlating these
numbers to what's happening in systems logs is yet a whole other level
of analysis.
-mark
> _______________________________________________
> rrd-users mailing list
> rrd-users at lists.oetiker.ch
> https://lists.oetiker.ch/cgi-bin/listinfo/rrd-users
>
More information about the rrd-users
mailing list