[rrd-users] Incorrect numbers returned when monitoring network stats at one second intervals
Mark Seger
Mark.Seger at hp.com
Thu Jul 26 18:29:19 CEST 2007
Alex van den Bogaerdt wrote:
> On Thu, Jul 26, 2007 at 09:22:29AM -0400, Mark Seger wrote:
>
>
>> I guess I have to differ on your conclusion. When one has a tool that
>> is reporting bytes/sec and it occasionally reports an invalid number
>> like 200MB/sec on a 1G link, they at least owe an explanation to their
>> users why this is the case. When doing detailed analysis of system
>>
>
> I have to agree with Mark here. The numbers aren't wrong, your
> analysis of them is.
>
> The system is reporting counter values, not network rates. You are
> converting those values into rates and while doing so you find that
> there is a limitation.
>
> Don't forget: bytes per second is still an average and an approximation.
> What you are arguing about is equally valid for data measured per
> 1.0000000000000000000 seconds.
>
> What about a frame that is partially transmitted before you take
> a snapshot of the counters, and partially thereafter? What if three
> frames are transmitted in two units of time?
>
> Why would 1 second be good and 300 seconds not? After all, the error
> will increase as the time interval decreases, perhaps a 300 second
> interval is better than a 1 second interval.
>
general, when doing long term monitoring I take a sample every 10
seconds. Since this number is big enough to cross 10 sample periods
every time while it will still occasionally read a higher value, when
normalized back to bytes/sec that error becomes significantly less
noticeable and you'll never see a 200MB/sec spike. My whole problem
with a 300 second interval is that's a lifetime. You can easily have
long periods of inactivity followed by bursts of saturation. If this is
what a typical day on your system looks like you could end up seeing an
average network load, perhaps with an occasional spike, that implies
your network is just doing fine when in fact you don't have adequate
capacity to meet times of heavy load. On the other hand if your network
is continuously being pounded you will see that.
> But what about peak values lost? Well, decrease your interval further
> from one second down to milliseconds or beyond, and eventually you
> will find that there is a byte transmitted, or there is not. So,
> the peak value will always be 100% of the link capacity.
>
I'm not sure what you mean. here's an example of monitoring infiniband
connected storage once every 5 seconds during the write of a 5GB file.
notice how smooth the i/o rates are:
collectl -sx -oT -i5
waiting for 5 second sample...
# <----------InfiniBand---------->
#Time KBin pktIn KBOut pktOut Errs
12:10:26 20 195 110708 108267 0
12:10:31 18 181 107331 104960 0
12:10:36 18 174 103577 101293 0
and here it is again at a 1 second sample. this time we can see the
rates aren't so smooth. that's because this is a share system and other
are using the pipe. even at 5 seconds sampling we're missing part of
the picture and at the very least this tells me I could consider the
time to write a 5GB file as a valid test. I can assure you on a
dedicated system even the 1 second numbers are much smoother
collectl -sx -oT -i5 -i.5
waiting for .5 second sample...
# <----------InfiniBand---------->
#Time KBin pktIn KBOut pktOut Errs
12:11:01 17 168 100824 98597 0
12:11:02 22 205 116926 114352 0
12:11:02 19 194 117393 114796 0
12:11:03 16 130 67217 65747 0
12:11:03 23 250 150934 147594 0
12:11:04 21 180 100623 98408 0
12:11:04 18 174 104349 102148 0
12:11:05 22 238 130249 127382 0
12:11:05 17 168 100623 98400 0
In fact, here's another system on which I'm doing a ping -f. Look at
the traffic at one second, 0.1 seconf and 0.01 second. you can see some
pretty solid numbers at 1 second, a little drift at .1, probably because
I have to actually run another process to get the stats and 0.01 second
questionable. The question being are these real stalls or are they
caused by the actual measurement mechanism. In this case I'd suggest
the mechanism, but even with its problems we're NOT seeing any big
spikes either.
[root at cag-bl460-03 ~]# collectl -sx -oTm
waiting for 1 second sample...
# <----------InfiniBand---------->
#Time KBin pktIn KBOut pktOut Errs
12:23:47.001 4769 2375 4769 2375 0
12:23:48.001 4719 2350 4719 2350 0
12:23:49.001 4669 2326 4668 2325 0
12:23:50.001 4819 2400 4819 2400 0
[root at cag-bl460-03 ~]# collectl -sx -oTm -i.1
waiting for .1 second sample...
# <----------InfiniBand---------->
#Time KBin pktIn KBOut pktOut Errs
12:23:58.409 4691 2336 4691 2336 0
12:23:58.502 4320 2161 4318 2150 0
12:23:58.609 5161 2570 5161 2570 0
12:23:58.702 4858 2419 4858 2419 0
12:23:58.809 5161 2570 5161 2570 0
12:23:58.902 4318 2150 4318 2150 0
[root at cag-bl460-03 ~]# collectl -sx -oTm -i.01
waiting for .01 second sample...
# <----------InfiniBand---------->
#Time KBin pktIn KBOut pktOut Errs
12:24:15.759 5906 2941 5906 2941 0
12:24:15.778 5284 2631 5284 2631 0
12:24:15.795 2953 1470 2953 1470 0
12:24:15.802 7171 3571 7171 3571 0
12:24:15.819 5906 2941 5906 2941 0
12:24:15.836 5906 2941 5906 2941 0
12:24:15.842 0 0 0 0 0
12:24:15.859 5906 2941 5906 2941 0
12:24:15.876 5906 2941 5906 2941 0
> I think what you really want to know is latency and packet loss.
> There is no problem if the network is transmitting at (e.g.) 99 MBps
> on a 100 MBps link. That's why it's there. The problem occurs if
> you want to transmit data but have no capacity (resulting in a delay),
> or cannot reach the destination (e.g. dropped frames).
>
>
> my 2ct.
> Alex
>
> P.S.
> I won't respond to this thread for at least a couple of days. That
> is because of having limited internet access (if at all), not due to
> other reasons.
>
> _______________________________________________
> rrd-users mailing list
> rrd-users at lists.oetiker.ch
> https://lists.oetiker.ch/cgi-bin/listinfo/rrd-users
>
More information about the rrd-users
mailing list