[rrd-users] Incorrect numbers returned when monitoring network stats at one second intervals

Thu Jul 26 18:29:19 CEST 2007

Alex van den Bogaerdt wrote:
> On Thu, Jul 26, 2007 at 09:22:29AM -0400, Mark Seger wrote:
>
>   
>> I guess I have to differ on your conclusion.  When one has a tool that 
>> is reporting bytes/sec and it occasionally reports an invalid number 
>> like 200MB/sec on a 1G link, they at least owe an explanation to their 
>> users why this is the case.  When doing detailed analysis of system 
>>     
>
> I have to agree with Mark here.  The numbers aren't wrong, your
> analysis of them is.
>
> The system is reporting counter values, not network rates. You are
> converting those values into rates and while doing so you find that
> there is a limitation.
>
> Don't forget: bytes per second is still an average and an approximation.
> What you are arguing about is equally valid for data measured per
> 1.0000000000000000000 seconds.
>
> What about a frame that is partially transmitted before you take
> a snapshot of the counters, and partially thereafter? What if three
> frames are transmitted in two units of time?
>
> Why would 1 second be good and 300 seconds not?  After all, the error
> will increase as the time interval decreases, perhaps a 300 second
> interval is better than a 1 second interval.
>   
general, when doing long term monitoring I take a sample every 10 
seconds.  Since this number is big enough to cross 10 sample periods 
every time while it will still occasionally read a higher value, when 
normalized back to bytes/sec that error becomes significantly less 
noticeable and you'll never see a 200MB/sec spike.  My whole problem 
with a 300 second interval is that's a lifetime.  You can easily have 
long periods of inactivity followed by bursts of saturation.  If this is 
what a typical day on your system looks like you could end up seeing an 
average network load, perhaps with an occasional spike, that implies 
your network is just doing fine when in fact you don't have adequate 
capacity to meet times of heavy load.  On the other hand if your network 
is continuously being pounded you will see that.
> But what about peak values lost?  Well, decrease your interval further
> from one second down to milliseconds or beyond, and eventually you
> will find that there is a byte transmitted, or there is not.  So,
> the peak value will always be 100% of the link capacity.
>   
I'm not sure what you mean.  here's an example of monitoring infiniband 
connected storage once every 5 seconds during the write of a 5GB file.  
notice how smooth the i/o rates are:

collectl -sx -oT -i5
waiting for 5 second sample...
#         <----------InfiniBand---------->
#Time       KBin  pktIn  KBOut pktOut Errs
12:10:26      20    195 110708 108267    0
12:10:31      18    181 107331 104960    0
12:10:36      18    174 103577 101293    0

and here it is again at a 1 second sample.  this time we can see the 
rates aren't so smooth.  that's because this is a share system and other 
are using the pipe.  even at 5 seconds sampling we're missing part of 
the picture and at the very least this tells me I could consider the 
time to write a 5GB file as a valid test.  I can assure you on a 
dedicated system even the 1 second numbers are much smoother

collectl -sx -oT -i5 -i.5
waiting for .5 second sample...
#         <----------InfiniBand---------->
#Time       KBin  pktIn  KBOut pktOut Errs
12:11:01      17    168 100824  98597    0
12:11:02      22    205 116926 114352    0
12:11:02      19    194 117393 114796    0
12:11:03      16    130  67217  65747    0
12:11:03      23    250 150934 147594    0
12:11:04      21    180 100623  98408    0
12:11:04      18    174 104349 102148    0
12:11:05      22    238 130249 127382    0
12:11:05      17    168 100623  98400    0

In fact, here's another system on which I'm doing a ping -f.  Look at 
the traffic at one second, 0.1 seconf and 0.01 second.  you can see some 
pretty solid numbers at 1 second, a little drift at .1, probably because 
I have to actually run another process to get the stats and 0.01 second 
questionable.  The question being are these real stalls or are they 
caused by the actual measurement mechanism.  In this case I'd suggest 
the mechanism, but even with its problems we're NOT seeing any big 
spikes either.

[root at cag-bl460-03 ~]# collectl -sx -oTm
waiting for 1 second sample...
#             <----------InfiniBand---------->
#Time           KBin  pktIn  KBOut pktOut Errs
12:23:47.001    4769   2375   4769   2375    0
12:23:48.001    4719   2350   4719   2350    0
12:23:49.001    4669   2326   4668   2325    0
12:23:50.001    4819   2400   4819   2400    0

[root at cag-bl460-03 ~]# collectl -sx -oTm -i.1
waiting for .1 second sample...
#             <----------InfiniBand---------->
#Time           KBin  pktIn  KBOut pktOut Errs
12:23:58.409    4691   2336   4691   2336    0
12:23:58.502    4320   2161   4318   2150    0
12:23:58.609    5161   2570   5161   2570    0
12:23:58.702    4858   2419   4858   2419    0
12:23:58.809    5161   2570   5161   2570    0
12:23:58.902    4318   2150   4318   2150    0

[root at cag-bl460-03 ~]# collectl -sx -oTm -i.01
waiting for .01 second sample...
#             <----------InfiniBand---------->
#Time           KBin  pktIn  KBOut pktOut Errs
12:24:15.759    5906   2941   5906   2941    0
12:24:15.778    5284   2631   5284   2631    0
12:24:15.795    2953   1470   2953   1470    0
12:24:15.802    7171   3571   7171   3571    0
12:24:15.819    5906   2941   5906   2941    0
12:24:15.836    5906   2941   5906   2941    0
12:24:15.842       0      0      0      0    0
12:24:15.859    5906   2941   5906   2941    0
12:24:15.876    5906   2941   5906   2941    0
> I think what you really want to know is latency and packet loss.
> There is no problem if the network is transmitting at (e.g.) 99 MBps
> on a 100 MBps link. That's why it's there.  The problem occurs if
> you want to transmit data but have no capacity (resulting in a delay),
> or cannot reach the destination (e.g. dropped frames).
>
>
> my 2ct.
> Alex
>
> P.S.
> I won't respond to this thread for at least a couple of days.  That
> is because of having limited internet access (if at all), not due to
> other reasons.
>
> _______________________________________________
> rrd-users mailing list
> rrd-users at lists.oetiker.ch
> https://lists.oetiker.ch/cgi-bin/listinfo/rrd-users
>