[rrd-users] Incorrect numbers returned when monitoring network stats at one second intervals
Mark Seger
Mark.Seger at hp.com
Thu Jul 26 12:17:59 CEST 2007
I had actually written this as a postscript to another topic and Alex
suggested starting another thread, but by then I had already replied.
Since there were no follow-up replies I was thinking perhaps this note
got lost in the haze and so I'm reposting is as a new one
>>> btw - just to toss in an interesting wrinkle did you know if you sample
>>> network statistics once a second you will periodically get an invalid
>>> value because of the frequency at which linux updates its network
>>> counters? the only way I'm able to get accurate network statistics
>>> near
>>> that rate is to sample them every 0.9765 seconds. I can go into more
>>> detail if anyone really cares. 8-)
>>>
>>
>> I'm curious ...
>>
> ahh! I knew I'd get someone to ask... The trick is how easily can I
> explain this.
>
> It turns out that unlike most systems counters which get updated quite
> frequently, network counters only get updated about once a second but
> not exactly once a second! It turns out they get updated every 0.9765
> seconds. So consider the output of my collection tool at an interval
> of 0.2 seconds. Just note that in the following format, I'm reporting
> the aggregate across interfaces while doing a 'ping -f' on one of
> them. The rates for the different interfaces are being updated at
> different times and so that why you're seeing the 8M/sec numbers
> aligning at .208 while the background traffic on a different interface
> is aligning at .409.
>
> # <-----------Network---------->
> #Time netKBi pkt-in netKBo pkt-out
> 09:41:14.809 0 0 0 0
> 09:41:15.009 0 0 0 0
> 09:41:15.209 8418 91729 8927 92564
> 09:41:15.409 61 945 2082 1585
> 09:41:15.609 0 0 0 0
> 09:41:15.809 0 0 0 0
> 09:41:16.009 7635 82294 7877 82464
> 09:41:16.209 0 0 0 0
> 09:41:16.409 0 0 0 0
> 09:41:16.609 0 0 0 0
> 09:41:16.809 1 4 1 4
> 09:41:17.009 8228 87659 8252 87639
> 09:41:17.209 0 0 0 0
> 09:41:17.409 94 1380 3042 2320
> 09:41:17.609 0 0 0 0
> 09:41:17.809 0 0 0 0
> 09:41:18.009 8598 92534 8854 92879
> 09:41:18.209 0 0 0 0
> 09:41:18.409 0 0 0 0
> 09:41:18.609 0 0 0 0
>
> Actually, here's a different form of the output by interface and I
> just did a grep on 'eth1':
>
> 09:44:06.408 2 eth1: 0 0 0 0 0
> 0 0 0 0
> 09:44:06.608 2 eth1: 75304 0 74949 0 0
> 0 0 7005 7155
> 09:44:06.808 2 eth1: 0 0 0 0 0
> 0 0 0 0
> 09:44:07.008 2 eth1: 0 0 0 0 0
> 0 0 0 0
> 09:44:07.208 2 eth1: 0 0 0 0 0
> 0 0 0 0
> 09:44:07.408 2 eth1: 0 0 0 0 0
> 0 0 0 0
> 09:44:07.609 2 eth1: 90796 0 91442 0 0
> 0 0 8407 8841
> 09:44:07.808 2 eth1: 0 0 0 0 0
> 0 0 0 0
> 09:44:08.008 2 eth1: 0 0 0 0 0
> 0 0 0 0
> 09:44:08.208 2 eth1: 0 0 0 0 0
> 0 0 0 0
> 09:44:08.408 2 eth1: 0 0 0 0 0
> 0 0 0 0
> 09:44:08.608 2 eth1: 80064 0 80599 0 0
> 0 0 7447 7805
>
> now that all said, lets look at another form of output which includes
> cpu load and disk traffic
>
> #
> <--------CPU--------><-----------Disks-----------><-----------Network---------->
>
> #Time cpu sys inter ctxsw KBRead Reads KBWrit Writes netKBi
> pkt-in netKBo pkt-out
> 09:47:00.007 47 27 17012 40654 0 0 0 0
> 1389 14902 1600 12420
> 09:47:01.007 56 34 18252 47138 0 0 0 0
> 1474 15850 1389 14339
> 09:47:02.007 56 30 18357 51876 0 0 0 0
> 1602 17152 1683 12679
> 09:47:03.007 49 29 16554 45260 0 0 0 0
> 1605 17296 1505 15350
> 09:47:04.007 58 31 18236 47319 0 0 60 6
> 1480 15918 1555 11205
> 09:47:05.007 58 28 20374 58415 0 0 0 0
> 1608 17529 1626 16640
> 09:47:06.007 53 30 17347 47579 0 0 0 0
> 3337 35896 3427 30139
> 09:47:07.007 51 33 16722 45189 0 0 0 0
> 1503 16109 1078 9882
> 09:47:08.007 52 27 17104 46796 0 0 0 0
> 1470 15858 1842 14803
> 09:47:09.007 56 29 18046 50817 0 0 12 2
> 1630 17448 1424 14190
> 09:47:10.007 50 27 18421 50895 0 0 0 0
> 1644 17739 1940 16003
>
> Look what happened at 9:47:06.007! the network traffic was reported
> at twice the rate it should have been. so what's going on? this is
> really subtle, but consider the case where the network stats are being
> updated every 0.9765 seconds but you are sample those number every 1
> second. This is much easier to visualize as a line so I don't know it
> this will work very well here or not but I'll try. consider the
> network counters are being written as 100, 200, 300, 400 and 500. If
> you read your first sample before the counter is set to 200, you'll
> read 100. Then you take your next sample before the counters if
> updated to 300 and you read 200. You take your next sample AFTER the
> counter is updated to 400 and so you read 400. That means the
> counters you read once a second were 100, 200 and 400 and the rates
> you'll report are 100 and then 200!!! That's exactly what's happening
> here and there isn't a tool around that can report the traffic
> correctly unless you sample at 0.9765 or at a high enough rate that
> this effect will not be significant. Is that clear enough?
>
> In any event, now that I've gotten your attention, try downloading
> collectl from http://sourceforge.net/projects/collectl and taking it
> for a spin. Try monitoring your network once a second and see what I
> mean. You'll also find you can monitor a lot more than just the cpu,
> disk and network in my example above. You can look at memory, nfs,
> sockets, inodes, lustre traffic and even infiniband. more
> importantly, you can tell it to generate its data in 'space-separated'
> format which you could then even think about loading into rrd and get
> a broad historical profile of what your system was doing. The only
> thing is there are potentially hundreds of counters (if you sample at
> the device level) and so you could need a lot of storage to hold it
> all. Or I suppose you could turn the sampling rate down from 10
> second to a minute or more if that's what you prefer.
>
> -mark
>> _______________________________________________
>> rrd-users mailing list
>> rrd-users at lists.oetiker.ch
>> https://lists.oetiker.ch/cgi-bin/listinfo/rrd-users
>>
More information about the rrd-users
mailing list