[rrd-users] trying to understand the relationship between source data, what's in rrd and what gets plotted

Wed Jul 25 15:58:02 CEST 2007

Simon Hobson wrote:
> Mark Seger wrote:
>
>   
>>  > That is because hh:mm:06, hh:mm:16, hh:mm:26 and so on are not a whole
>>     
>>>  multiple of 10 seconds.
>>>
>>>  You have "n*step+offset", not "n*step".  This is why normalization is
>>>  needed.
>>>
>>>
>>>  
>>>       
>>>>  As I said above it sounds like if I conform my data to align to the time
>>>>  boudary conditions rrd requires it should work and if I don't conform it
>>>>  won't.
>>>>    
>>>>         
>>>  No.  Your step size is wrong, not your input.  Change your step size
>>>  to 1,2,3 or 6 seconds
>>>  
>>>       
>> so if I understand what you're suggesting I should pick a start time and
>> step size such that my data will align accordingly, right?  Since I have
>> samples at 00:01:06, 01:12, etc that would mean I should pick a time
>> that lands on a minute boundary and a step of 2 because 00:01:02, 1:04:,
>> 1:06, etc will still hit all my timestamps.  1 sec would work too but
>> that would be overkill.  I don't think 3 or 6 would do it because they
>> would not all align.  00:01:06 would, but you'd never see 01:16.
>>     
>
> Not quite - FORGET THE MINUTE BOUNDARIES
>   
yes, I realize that but kept using it to simplify my examples.  I 
learned a lot about time boundaries and alignment when I wanted to get 
my tool to align to the closest millisecond.  8-)
> rrdtool uses samples that are a multiple os "step" seconds since unix 
> epoch - you can easily pick step times which do not fall on minute 
> boundaries (whilst 7 would not be very common, most times it would 
> not fall on a minute boundary).
>
> But you are correct that steps of 3 or 6 will not get you 10 second intervals.
>
>   
>> so let's say I have 3 samples of 100, 1000 and 100 starting at
>> 00:01:06.  since these are absolute numbers for 10 second intervals,
>> they really represent rates of 10/sec, 100/sec and 10/sec.  am I then
>> correct in assuming that rrd will then normalize it into 15 slots with
>> 20/slot for the first 5, 200 for the next 5 and then 20 for the next 5,
>> all aligned to 00:01:00.
>>     
>
> Actually 10/s is 10/s - not 20/s !  10/s * 2s would get you 20.
>   
that's what I was trying to say 8-)
>>  so starting at 01:00 the data would look like
>> 20 20 20 20 20 200 200 200 200 200 200 20 20 20 20 20.  If I then wanted
>> to see what the rate is at 01:06, rrd would see a value in that 2 second
>> slot of 20 and treat it as a rate of 10/sec.  the same would hold for
>> any of the 200s which would be reported as 100/sec for the slots they
>> occur in, right?
>>
>> this is certainly a lot closer to what I was looking for and gets back
>> to really clarifying my original question which was the subject of this
>> thread.  I guess the negatives here are you have to be real careful to
>> pick the right time and stepsize and if your samples don't land on
>> integral time boundaries all bets are off (what if my samples were at
>> 00:01:06.5, 00:01:12.5, etc?).  it would also make my rrd database 5
>> times bigger and it's already over 10MB for 1 day's worth of data.
>>     
>
> Al alternative for handling your historical data might be to simply 
> 'lie' about the timestamps ! Eg, for your 00:01:06 sample, insert it 
> with a timestamp of 00:01:00, 00:01:16 as 00:01:10 and so on. You'll 
> have a slight blip as you change to actually collecting the data on 
> 10s steps (instead of n*10+6 steps) but it would allow you to graph 
> your historical data without going to 2s steps.
>   
yes, that would work but it would also mean one would need to remember that.
something I just remembered that kind of shoots a whole in this 
discussion is sampling drift.  the data I collected and was using for my 
tests drifted 4 seconds over the course of the day and I don't think any 
solution will exactly address that.  now that I align my data to my 
interval (rrd's step) even if it's in milliseconds, this is all moot.  
however the discussion has been very helpful.
>> btw - just to toss in an interesting wrinkle did you know if you sample
>> network statistics once a second you will periodically get an invalid
>> value because of the frequency at which linux updates its network
>> counters?  the only way I'm able to get accurate network statistics near
>> that rate is to sample them every 0.9765 seconds.  I can go into more
>> detail if anyone really cares.  8-)
>>     
>
> I'm curious ...
>   
ahh!  I knew I'd get someone to ask...  The trick is how easily can I 
explain this.

It turns out that unlike most systems counters which get updated quite 
frequently, network counters only get updated about once a second but 
not exactly once a second!  It turns out they get updated every 0.9765 
seconds.  So consider the output of my collection tool at an interval of 
0.2 seconds.  Just note that in the following format, I'm reporting the 
aggregate across interfaces while doing a 'ping -f' on one of them.  The 
rates for the different interfaces are being updated at different times 
and so that why you're seeing the 8M/sec numbers aligning at .208 while 
the background traffic on a different interface is aligning at .409.

#             <-----------Network---------->
#Time         netKBi pkt-in  netKBo pkt-out
09:41:14.809       0      0       0       0
09:41:15.009       0      0       0       0
09:41:15.209    8418  91729    8927   92564
09:41:15.409      61    945    2082    1585
09:41:15.609       0      0       0       0
09:41:15.809       0      0       0       0
09:41:16.009    7635  82294    7877   82464
09:41:16.209       0      0       0       0
09:41:16.409       0      0       0       0
09:41:16.609       0      0       0       0
09:41:16.809       1      4       1       4
09:41:17.009    8228  87659    8252   87639
09:41:17.209       0      0       0       0
09:41:17.409      94   1380    3042    2320
09:41:17.609       0      0       0       0
09:41:17.809       0      0       0       0
09:41:18.009    8598  92534    8854   92879
09:41:18.209       0      0       0       0
09:41:18.409       0      0       0       0
09:41:18.609       0      0       0       0

Actually, here's a different form of the output by interface and I just 
did a grep on 'eth1':

09:44:06.408    2   eth1:      0      0      0      0      0      0      
0      0      0
09:44:06.608    2   eth1:  75304      0  74949      0      0      0      
0   7005   7155
09:44:06.808    2   eth1:      0      0      0      0      0      0      
0      0      0
09:44:07.008    2   eth1:      0      0      0      0      0      0      
0      0      0
09:44:07.208    2   eth1:      0      0      0      0      0      0      
0      0      0
09:44:07.408    2   eth1:      0      0      0      0      0      0      
0      0      0
09:44:07.609    2   eth1:  90796      0  91442      0      0      0      
0   8407   8841
09:44:07.808    2   eth1:      0      0      0      0      0      0      
0      0      0
09:44:08.008    2   eth1:      0      0      0      0      0      0      
0      0      0
09:44:08.208    2   eth1:      0      0      0      0      0      0      
0      0      0
09:44:08.408    2   eth1:      0      0      0      0      0      0      
0      0      0
09:44:08.608    2   eth1:  80064      0  80599      0      0      0      
0   7447   7805

now that all said, lets look at another form of output which includes 
cpu load and disk traffic

#             
<--------CPU--------><-----------Disks-----------><-----------Network---------->
#Time         cpu sys inter  ctxsw KBRead  Reads  KBWrit Writes netKBi 
pkt-in  netKBo pkt-out
09:47:00.007   47  27 17012  40654      0      0       0      0   1389  
14902    1600   12420
09:47:01.007   56  34 18252  47138      0      0       0      0   1474  
15850    1389   14339
09:47:02.007   56  30 18357  51876      0      0       0      0   1602  
17152    1683   12679
09:47:03.007   49  29 16554  45260      0      0       0      0   1605  
17296    1505   15350
09:47:04.007   58  31 18236  47319      0      0      60      6   1480  
15918    1555   11205
09:47:05.007   58  28 20374  58415      0      0       0      0   1608  
17529    1626   16640
09:47:06.007   53  30 17347  47579      0      0       0      0   3337  
35896    3427   30139
09:47:07.007   51  33 16722  45189      0      0       0      0   1503  
16109    1078    9882
09:47:08.007   52  27 17104  46796      0      0       0      0   1470  
15858    1842   14803
09:47:09.007   56  29 18046  50817      0      0      12      2   1630  
17448    1424   14190
09:47:10.007   50  27 18421  50895      0      0       0      0   1644  
17739    1940   16003

Look what happened at 9:47:06.007!  the network traffic was reported at 
twice the rate it should have been.  so what's going on?  this is really 
subtle, but consider the case where the network stats are being updated 
every 0.9765 seconds but you are sample those number every 1 second.  
This is much easier to visualize as a line so I don't know it this will 
work very well here or not but I'll try.  consider the network counters 
are being written as 100, 200, 300, 400 and 500.  If you read your first 
sample before the counter is set to 200, you'll read 100.  Then you take 
your next sample before the counters if updated to 300 and you read 
200.  You take your next sample AFTER the counter is updated to 400 and 
so you read 400.  That means the counters you read once a second were 
100, 200 and 400 and the rates you'll report are 100 and then 200!!!  
That's exactly what's happening here and there isn't a tool around that 
can report the traffic correctly unless you sample at 0.9765 or at a 
high enough rate that this effect will not be significant.  Is that 
clear enough?

In any event, not that I've gotten your attention, try downloading 
collectl from http://sourceforge.net/projects/collectl and taking it for 
a spin.  Try monitoring your network once a second and see what I mean.  
You'll also find you can monitor a lot more than just the cpu, disk and 
network in my example above.  You can look at memory, nfs, sockets, 
inodes, lustre traffic and even infiniband.  more importantly, you can 
tell it to generate its data in 'space-separated' format which you could 
then even think about loading into rrd and get a broad historical 
profile of what your system was doing.  The only thing is there are 
potentially hundreds of counters (if you sample at the device level) and 
so you could need a lot of storage to hold it all.  Or I suppose you 
could turn the sampling rate down from 10 second to a minute or more if 
that's what you prefer.

-mark
> _______________________________________________
> rrd-users mailing list
> rrd-users at lists.oetiker.ch
> https://lists.oetiker.ch/cgi-bin/listinfo/rrd-users
>