[rrd-users] Incorrect numbers returned when monitoring network stats at one second intervals

Thu Jul 26 12:17:59 CEST 2007

I had actually written this as a postscript to another topic and Alex 
suggested starting another thread, but by then I had already replied.  
Since there were no follow-up replies I was thinking perhaps this note 
got lost in the haze and so I'm reposting is as a new one
>>> btw - just to toss in an interesting wrinkle did you know if you sample
>>> network statistics once a second you will periodically get an invalid
>>> value because of the frequency at which linux updates its network
>>> counters?  the only way I'm able to get accurate network statistics 
>>> near
>>> that rate is to sample them every 0.9765 seconds.  I can go into more
>>> detail if anyone really cares.  8-)
>>>     
>>
>> I'm curious ...
>>   
> ahh!  I knew I'd get someone to ask...  The trick is how easily can I 
> explain this.
>
> It turns out that unlike most systems counters which get updated quite 
> frequently, network counters only get updated about once a second but 
> not exactly once a second!  It turns out they get updated every 0.9765 
> seconds.  So consider the output of my collection tool at an interval 
> of 0.2 seconds.  Just note that in the following format, I'm reporting 
> the aggregate across interfaces while doing a 'ping -f' on one of 
> them.  The rates for the different interfaces are being updated at 
> different times and so that why you're seeing the 8M/sec numbers 
> aligning at .208 while the background traffic on a different interface 
> is aligning at .409.
>
> #             <-----------Network---------->
> #Time         netKBi pkt-in  netKBo pkt-out
> 09:41:14.809       0      0       0       0
> 09:41:15.009       0      0       0       0
> 09:41:15.209    8418  91729    8927   92564
> 09:41:15.409      61    945    2082    1585
> 09:41:15.609       0      0       0       0
> 09:41:15.809       0      0       0       0
> 09:41:16.009    7635  82294    7877   82464
> 09:41:16.209       0      0       0       0
> 09:41:16.409       0      0       0       0
> 09:41:16.609       0      0       0       0
> 09:41:16.809       1      4       1       4
> 09:41:17.009    8228  87659    8252   87639
> 09:41:17.209       0      0       0       0
> 09:41:17.409      94   1380    3042    2320
> 09:41:17.609       0      0       0       0
> 09:41:17.809       0      0       0       0
> 09:41:18.009    8598  92534    8854   92879
> 09:41:18.209       0      0       0       0
> 09:41:18.409       0      0       0       0
> 09:41:18.609       0      0       0       0
>
> Actually, here's a different form of the output by interface and I 
> just did a grep on 'eth1':
>
> 09:44:06.408    2   eth1:      0      0      0      0      0      
> 0      0      0      0
> 09:44:06.608    2   eth1:  75304      0  74949      0      0      
> 0      0   7005   7155
> 09:44:06.808    2   eth1:      0      0      0      0      0      
> 0      0      0      0
> 09:44:07.008    2   eth1:      0      0      0      0      0      
> 0      0      0      0
> 09:44:07.208    2   eth1:      0      0      0      0      0      
> 0      0      0      0
> 09:44:07.408    2   eth1:      0      0      0      0      0      
> 0      0      0      0
> 09:44:07.609    2   eth1:  90796      0  91442      0      0      
> 0      0   8407   8841
> 09:44:07.808    2   eth1:      0      0      0      0      0      
> 0      0      0      0
> 09:44:08.008    2   eth1:      0      0      0      0      0      
> 0      0      0      0
> 09:44:08.208    2   eth1:      0      0      0      0      0      
> 0      0      0      0
> 09:44:08.408    2   eth1:      0      0      0      0      0      
> 0      0      0      0
> 09:44:08.608    2   eth1:  80064      0  80599      0      0      
> 0      0   7447   7805
>
> now that all said, lets look at another form of output which includes 
> cpu load and disk traffic
>
> #             
> <--------CPU--------><-----------Disks-----------><-----------Network----------> 
>
> #Time         cpu sys inter  ctxsw KBRead  Reads  KBWrit Writes netKBi 
> pkt-in  netKBo pkt-out
> 09:47:00.007   47  27 17012  40654      0      0       0      0   
> 1389  14902    1600   12420
> 09:47:01.007   56  34 18252  47138      0      0       0      0   
> 1474  15850    1389   14339
> 09:47:02.007   56  30 18357  51876      0      0       0      0   
> 1602  17152    1683   12679
> 09:47:03.007   49  29 16554  45260      0      0       0      0   
> 1605  17296    1505   15350
> 09:47:04.007   58  31 18236  47319      0      0      60      6   
> 1480  15918    1555   11205
> 09:47:05.007   58  28 20374  58415      0      0       0      0   
> 1608  17529    1626   16640
> 09:47:06.007   53  30 17347  47579      0      0       0      0   
> 3337  35896    3427   30139
> 09:47:07.007   51  33 16722  45189      0      0       0      0   
> 1503  16109    1078    9882
> 09:47:08.007   52  27 17104  46796      0      0       0      0   
> 1470  15858    1842   14803
> 09:47:09.007   56  29 18046  50817      0      0      12      2   
> 1630  17448    1424   14190
> 09:47:10.007   50  27 18421  50895      0      0       0      0   
> 1644  17739    1940   16003
>
> Look what happened at 9:47:06.007!  the network traffic was reported 
> at twice the rate it should have been.  so what's going on?  this is 
> really subtle, but consider the case where the network stats are being 
> updated every 0.9765 seconds but you are sample those number every 1 
> second.  This is much easier to visualize as a line so I don't know it 
> this will work very well here or not but I'll try.  consider the 
> network counters are being written as 100, 200, 300, 400 and 500.  If 
> you read your first sample before the counter is set to 200, you'll 
> read 100.  Then you take your next sample before the counters if 
> updated to 300 and you read 200.  You take your next sample AFTER the 
> counter is updated to 400 and so you read 400.  That means the 
> counters you read once a second were 100, 200 and 400 and the rates 
> you'll report are 100 and then 200!!!  That's exactly what's happening 
> here and there isn't a tool around that can report the traffic 
> correctly unless you sample at 0.9765 or at a high enough rate that 
> this effect will not be significant.  Is that clear enough?
>
> In any event, now that I've gotten your attention, try downloading 
> collectl from http://sourceforge.net/projects/collectl and taking it 
> for a spin.  Try monitoring your network once a second and see what I 
> mean.  You'll also find you can monitor a lot more than just the cpu, 
> disk and network in my example above.  You can look at memory, nfs, 
> sockets, inodes, lustre traffic and even infiniband.  more 
> importantly, you can tell it to generate its data in 'space-separated' 
> format which you could then even think about loading into rrd and get 
> a broad historical profile of what your system was doing.  The only 
> thing is there are potentially hundreds of counters (if you sample at 
> the device level) and so you could need a lot of storage to hold it 
> all.  Or I suppose you could turn the sampling rate down from 10 
> second to a minute or more if that's what you prefer.
>
> -mark
>> _______________________________________________
>> rrd-users mailing list
>> rrd-users at lists.oetiker.ch
>> https://lists.oetiker.ch/cgi-bin/listinfo/rrd-users
>>