[rrd-users] trying to understand the relationship between source data, what's in rrd and what gets plotted
Mark Seger
Mark.Seger at hp.com
Wed Jul 25 15:58:02 CEST 2007
Simon Hobson wrote:
> Mark Seger wrote:
>
>
>> > That is because hh:mm:06, hh:mm:16, hh:mm:26 and so on are not a whole
>>
>>> multiple of 10 seconds.
>>>
>>> You have "n*step+offset", not "n*step". This is why normalization is
>>> needed.
>>>
>>>
>>>
>>>
>>>> As I said above it sounds like if I conform my data to align to the time
>>>> boudary conditions rrd requires it should work and if I don't conform it
>>>> won't.
>>>>
>>>>
>>> No. Your step size is wrong, not your input. Change your step size
>>> to 1,2,3 or 6 seconds
>>>
>>>
>> so if I understand what you're suggesting I should pick a start time and
>> step size such that my data will align accordingly, right? Since I have
>> samples at 00:01:06, 01:12, etc that would mean I should pick a time
>> that lands on a minute boundary and a step of 2 because 00:01:02, 1:04:,
>> 1:06, etc will still hit all my timestamps. 1 sec would work too but
>> that would be overkill. I don't think 3 or 6 would do it because they
>> would not all align. 00:01:06 would, but you'd never see 01:16.
>>
>
> Not quite - FORGET THE MINUTE BOUNDARIES
>
yes, I realize that but kept using it to simplify my examples. I
learned a lot about time boundaries and alignment when I wanted to get
my tool to align to the closest millisecond. 8-)
> rrdtool uses samples that are a multiple os "step" seconds since unix
> epoch - you can easily pick step times which do not fall on minute
> boundaries (whilst 7 would not be very common, most times it would
> not fall on a minute boundary).
>
> But you are correct that steps of 3 or 6 will not get you 10 second intervals.
>
>
>> so let's say I have 3 samples of 100, 1000 and 100 starting at
>> 00:01:06. since these are absolute numbers for 10 second intervals,
>> they really represent rates of 10/sec, 100/sec and 10/sec. am I then
>> correct in assuming that rrd will then normalize it into 15 slots with
>> 20/slot for the first 5, 200 for the next 5 and then 20 for the next 5,
>> all aligned to 00:01:00.
>>
>
> Actually 10/s is 10/s - not 20/s ! 10/s * 2s would get you 20.
>
that's what I was trying to say 8-)
>> so starting at 01:00 the data would look like
>> 20 20 20 20 20 200 200 200 200 200 200 20 20 20 20 20. If I then wanted
>> to see what the rate is at 01:06, rrd would see a value in that 2 second
>> slot of 20 and treat it as a rate of 10/sec. the same would hold for
>> any of the 200s which would be reported as 100/sec for the slots they
>> occur in, right?
>>
>> this is certainly a lot closer to what I was looking for and gets back
>> to really clarifying my original question which was the subject of this
>> thread. I guess the negatives here are you have to be real careful to
>> pick the right time and stepsize and if your samples don't land on
>> integral time boundaries all bets are off (what if my samples were at
>> 00:01:06.5, 00:01:12.5, etc?). it would also make my rrd database 5
>> times bigger and it's already over 10MB for 1 day's worth of data.
>>
>
> Al alternative for handling your historical data might be to simply
> 'lie' about the timestamps ! Eg, for your 00:01:06 sample, insert it
> with a timestamp of 00:01:00, 00:01:16 as 00:01:10 and so on. You'll
> have a slight blip as you change to actually collecting the data on
> 10s steps (instead of n*10+6 steps) but it would allow you to graph
> your historical data without going to 2s steps.
>
yes, that would work but it would also mean one would need to remember that.
something I just remembered that kind of shoots a whole in this
discussion is sampling drift. the data I collected and was using for my
tests drifted 4 seconds over the course of the day and I don't think any
solution will exactly address that. now that I align my data to my
interval (rrd's step) even if it's in milliseconds, this is all moot.
however the discussion has been very helpful.
>> btw - just to toss in an interesting wrinkle did you know if you sample
>> network statistics once a second you will periodically get an invalid
>> value because of the frequency at which linux updates its network
>> counters? the only way I'm able to get accurate network statistics near
>> that rate is to sample them every 0.9765 seconds. I can go into more
>> detail if anyone really cares. 8-)
>>
>
> I'm curious ...
>
ahh! I knew I'd get someone to ask... The trick is how easily can I
explain this.
It turns out that unlike most systems counters which get updated quite
frequently, network counters only get updated about once a second but
not exactly once a second! It turns out they get updated every 0.9765
seconds. So consider the output of my collection tool at an interval of
0.2 seconds. Just note that in the following format, I'm reporting the
aggregate across interfaces while doing a 'ping -f' on one of them. The
rates for the different interfaces are being updated at different times
and so that why you're seeing the 8M/sec numbers aligning at .208 while
the background traffic on a different interface is aligning at .409.
# <-----------Network---------->
#Time netKBi pkt-in netKBo pkt-out
09:41:14.809 0 0 0 0
09:41:15.009 0 0 0 0
09:41:15.209 8418 91729 8927 92564
09:41:15.409 61 945 2082 1585
09:41:15.609 0 0 0 0
09:41:15.809 0 0 0 0
09:41:16.009 7635 82294 7877 82464
09:41:16.209 0 0 0 0
09:41:16.409 0 0 0 0
09:41:16.609 0 0 0 0
09:41:16.809 1 4 1 4
09:41:17.009 8228 87659 8252 87639
09:41:17.209 0 0 0 0
09:41:17.409 94 1380 3042 2320
09:41:17.609 0 0 0 0
09:41:17.809 0 0 0 0
09:41:18.009 8598 92534 8854 92879
09:41:18.209 0 0 0 0
09:41:18.409 0 0 0 0
09:41:18.609 0 0 0 0
Actually, here's a different form of the output by interface and I just
did a grep on 'eth1':
09:44:06.408 2 eth1: 0 0 0 0 0 0
0 0 0
09:44:06.608 2 eth1: 75304 0 74949 0 0 0
0 7005 7155
09:44:06.808 2 eth1: 0 0 0 0 0 0
0 0 0
09:44:07.008 2 eth1: 0 0 0 0 0 0
0 0 0
09:44:07.208 2 eth1: 0 0 0 0 0 0
0 0 0
09:44:07.408 2 eth1: 0 0 0 0 0 0
0 0 0
09:44:07.609 2 eth1: 90796 0 91442 0 0 0
0 8407 8841
09:44:07.808 2 eth1: 0 0 0 0 0 0
0 0 0
09:44:08.008 2 eth1: 0 0 0 0 0 0
0 0 0
09:44:08.208 2 eth1: 0 0 0 0 0 0
0 0 0
09:44:08.408 2 eth1: 0 0 0 0 0 0
0 0 0
09:44:08.608 2 eth1: 80064 0 80599 0 0 0
0 7447 7805
now that all said, lets look at another form of output which includes
cpu load and disk traffic
#
<--------CPU--------><-----------Disks-----------><-----------Network---------->
#Time cpu sys inter ctxsw KBRead Reads KBWrit Writes netKBi
pkt-in netKBo pkt-out
09:47:00.007 47 27 17012 40654 0 0 0 0 1389
14902 1600 12420
09:47:01.007 56 34 18252 47138 0 0 0 0 1474
15850 1389 14339
09:47:02.007 56 30 18357 51876 0 0 0 0 1602
17152 1683 12679
09:47:03.007 49 29 16554 45260 0 0 0 0 1605
17296 1505 15350
09:47:04.007 58 31 18236 47319 0 0 60 6 1480
15918 1555 11205
09:47:05.007 58 28 20374 58415 0 0 0 0 1608
17529 1626 16640
09:47:06.007 53 30 17347 47579 0 0 0 0 3337
35896 3427 30139
09:47:07.007 51 33 16722 45189 0 0 0 0 1503
16109 1078 9882
09:47:08.007 52 27 17104 46796 0 0 0 0 1470
15858 1842 14803
09:47:09.007 56 29 18046 50817 0 0 12 2 1630
17448 1424 14190
09:47:10.007 50 27 18421 50895 0 0 0 0 1644
17739 1940 16003
Look what happened at 9:47:06.007! the network traffic was reported at
twice the rate it should have been. so what's going on? this is really
subtle, but consider the case where the network stats are being updated
every 0.9765 seconds but you are sample those number every 1 second.
This is much easier to visualize as a line so I don't know it this will
work very well here or not but I'll try. consider the network counters
are being written as 100, 200, 300, 400 and 500. If you read your first
sample before the counter is set to 200, you'll read 100. Then you take
your next sample before the counters if updated to 300 and you read
200. You take your next sample AFTER the counter is updated to 400 and
so you read 400. That means the counters you read once a second were
100, 200 and 400 and the rates you'll report are 100 and then 200!!!
That's exactly what's happening here and there isn't a tool around that
can report the traffic correctly unless you sample at 0.9765 or at a
high enough rate that this effect will not be significant. Is that
clear enough?
In any event, not that I've gotten your attention, try downloading
collectl from http://sourceforge.net/projects/collectl and taking it for
a spin. Try monitoring your network once a second and see what I mean.
You'll also find you can monitor a lot more than just the cpu, disk and
network in my example above. You can look at memory, nfs, sockets,
inodes, lustre traffic and even infiniband. more importantly, you can
tell it to generate its data in 'space-separated' format which you could
then even think about loading into rrd and get a broad historical
profile of what your system was doing. The only thing is there are
potentially hundreds of counters (if you sample at the device level) and
so you could need a lot of storage to hold it all. Or I suppose you
could turn the sampling rate down from 10 second to a minute or more if
that's what you prefer.
-mark
> _______________________________________________
> rrd-users mailing list
> rrd-users at lists.oetiker.ch
> https://lists.oetiker.ch/cgi-bin/listinfo/rrd-users
>
More information about the rrd-users
mailing list