[rrd-users] trying to understand the relationship between source data, what's in rrd and what gets plotted

Mark Seger Mark.Seger at hp.com
Wed Jul 25 15:58:02 CEST 2007

```
Simon Hobson wrote:
> Mark Seger wrote:
>
>
>>  > That is because hh:mm:06, hh:mm:16, hh:mm:26 and so on are not a whole
>>
>>>  multiple of 10 seconds.
>>>
>>>  You have "n*step+offset", not "n*step".  This is why normalization is
>>>  needed.
>>>
>>>
>>>
>>>
>>>>  As I said above it sounds like if I conform my data to align to the time
>>>>  boudary conditions rrd requires it should work and if I don't conform it
>>>>  won't.
>>>>
>>>>
>>>  to 1,2,3 or 6 seconds
>>>
>>>
>> so if I understand what you're suggesting I should pick a start time and
>> step size such that my data will align accordingly, right?  Since I have
>> samples at 00:01:06, 01:12, etc that would mean I should pick a time
>> that lands on a minute boundary and a step of 2 because 00:01:02, 1:04:,
>> 1:06, etc will still hit all my timestamps.  1 sec would work too but
>> that would be overkill.  I don't think 3 or 6 would do it because they
>> would not all align.  00:01:06 would, but you'd never see 01:16.
>>
>
> Not quite - FORGET THE MINUTE BOUNDARIES
>
yes, I realize that but kept using it to simplify my examples.  I
learned a lot about time boundaries and alignment when I wanted to get
my tool to align to the closest millisecond.  8-)
> rrdtool uses samples that are a multiple os "step" seconds since unix
> epoch - you can easily pick step times which do not fall on minute
> boundaries (whilst 7 would not be very common, most times it would
> not fall on a minute boundary).
>
> But you are correct that steps of 3 or 6 will not get you 10 second intervals.
>
>
>> so let's say I have 3 samples of 100, 1000 and 100 starting at
>> 00:01:06.  since these are absolute numbers for 10 second intervals,
>> they really represent rates of 10/sec, 100/sec and 10/sec.  am I then
>> correct in assuming that rrd will then normalize it into 15 slots with
>> 20/slot for the first 5, 200 for the next 5 and then 20 for the next 5,
>> all aligned to 00:01:00.
>>
>
> Actually 10/s is 10/s - not 20/s !  10/s * 2s would get you 20.
>
that's what I was trying to say 8-)
>>  so starting at 01:00 the data would look like
>> 20 20 20 20 20 200 200 200 200 200 200 20 20 20 20 20.  If I then wanted
>> to see what the rate is at 01:06, rrd would see a value in that 2 second
>> slot of 20 and treat it as a rate of 10/sec.  the same would hold for
>> any of the 200s which would be reported as 100/sec for the slots they
>> occur in, right?
>>
>> this is certainly a lot closer to what I was looking for and gets back
>> to really clarifying my original question which was the subject of this
>> thread.  I guess the negatives here are you have to be real careful to
>> pick the right time and stepsize and if your samples don't land on
>> integral time boundaries all bets are off (what if my samples were at
>> 00:01:06.5, 00:01:12.5, etc?).  it would also make my rrd database 5
>> times bigger and it's already over 10MB for 1 day's worth of data.
>>
>
> Al alternative for handling your historical data might be to simply
> 'lie' about the timestamps ! Eg, for your 00:01:06 sample, insert it
> with a timestamp of 00:01:00, 00:01:16 as 00:01:10 and so on. You'll
> have a slight blip as you change to actually collecting the data on
> 10s steps (instead of n*10+6 steps) but it would allow you to graph
> your historical data without going to 2s steps.
>
yes, that would work but it would also mean one would need to remember that.
something I just remembered that kind of shoots a whole in this
discussion is sampling drift.  the data I collected and was using for my
tests drifted 4 seconds over the course of the day and I don't think any
solution will exactly address that.  now that I align my data to my
interval (rrd's step) even if it's in milliseconds, this is all moot.
however the discussion has been very helpful.
>> btw - just to toss in an interesting wrinkle did you know if you sample
>> network statistics once a second you will periodically get an invalid
>> value because of the frequency at which linux updates its network
>> counters?  the only way I'm able to get accurate network statistics near
>> that rate is to sample them every 0.9765 seconds.  I can go into more
>> detail if anyone really cares.  8-)
>>
>
> I'm curious ...
>
ahh!  I knew I'd get someone to ask...  The trick is how easily can I
explain this.

It turns out that unlike most systems counters which get updated quite
frequently, network counters only get updated about once a second but
not exactly once a second!  It turns out they get updated every 0.9765
seconds.  So consider the output of my collection tool at an interval of
0.2 seconds.  Just note that in the following format, I'm reporting the
aggregate across interfaces while doing a 'ping -f' on one of them.  The
rates for the different interfaces are being updated at different times
and so that why you're seeing the 8M/sec numbers aligning at .208 while
the background traffic on a different interface is aligning at .409.

#             <-----------Network---------->
#Time         netKBi pkt-in  netKBo pkt-out
09:41:14.809       0      0       0       0
09:41:15.009       0      0       0       0
09:41:15.209    8418  91729    8927   92564
09:41:15.409      61    945    2082    1585
09:41:15.609       0      0       0       0
09:41:15.809       0      0       0       0
09:41:16.009    7635  82294    7877   82464
09:41:16.209       0      0       0       0
09:41:16.409       0      0       0       0
09:41:16.609       0      0       0       0
09:41:16.809       1      4       1       4
09:41:17.009    8228  87659    8252   87639
09:41:17.209       0      0       0       0
09:41:17.409      94   1380    3042    2320
09:41:17.609       0      0       0       0
09:41:17.809       0      0       0       0
09:41:18.009    8598  92534    8854   92879
09:41:18.209       0      0       0       0
09:41:18.409       0      0       0       0
09:41:18.609       0      0       0       0

Actually, here's a different form of the output by interface and I just
did a grep on 'eth1':

09:44:06.408    2   eth1:      0      0      0      0      0      0
0      0      0
09:44:06.608    2   eth1:  75304      0  74949      0      0      0
0   7005   7155
09:44:06.808    2   eth1:      0      0      0      0      0      0
0      0      0
09:44:07.008    2   eth1:      0      0      0      0      0      0
0      0      0
09:44:07.208    2   eth1:      0      0      0      0      0      0
0      0      0
09:44:07.408    2   eth1:      0      0      0      0      0      0
0      0      0
09:44:07.609    2   eth1:  90796      0  91442      0      0      0
0   8407   8841
09:44:07.808    2   eth1:      0      0      0      0      0      0
0      0      0
09:44:08.008    2   eth1:      0      0      0      0      0      0
0      0      0
09:44:08.208    2   eth1:      0      0      0      0      0      0
0      0      0
09:44:08.408    2   eth1:      0      0      0      0      0      0
0      0      0
09:44:08.608    2   eth1:  80064      0  80599      0      0      0
0   7447   7805

now that all said, lets look at another form of output which includes

#
<--------CPU--------><-----------Disks-----------><-----------Network---------->
pkt-in  netKBo pkt-out
09:47:00.007   47  27 17012  40654      0      0       0      0   1389
14902    1600   12420
09:47:01.007   56  34 18252  47138      0      0       0      0   1474
15850    1389   14339
09:47:02.007   56  30 18357  51876      0      0       0      0   1602
17152    1683   12679
09:47:03.007   49  29 16554  45260      0      0       0      0   1605
17296    1505   15350
09:47:04.007   58  31 18236  47319      0      0      60      6   1480
15918    1555   11205
09:47:05.007   58  28 20374  58415      0      0       0      0   1608
17529    1626   16640
09:47:06.007   53  30 17347  47579      0      0       0      0   3337
35896    3427   30139
09:47:07.007   51  33 16722  45189      0      0       0      0   1503
16109    1078    9882
09:47:08.007   52  27 17104  46796      0      0       0      0   1470
15858    1842   14803
09:47:09.007   56  29 18046  50817      0      0      12      2   1630
17448    1424   14190
09:47:10.007   50  27 18421  50895      0      0       0      0   1644
17739    1940   16003

Look what happened at 9:47:06.007!  the network traffic was reported at
twice the rate it should have been.  so what's going on?  this is really
subtle, but consider the case where the network stats are being updated
every 0.9765 seconds but you are sample those number every 1 second.
This is much easier to visualize as a line so I don't know it this will
work very well here or not but I'll try.  consider the network counters
are being written as 100, 200, 300, 400 and 500.  If you read your first
sample before the counter is set to 200, you'll read 100.  Then you take
your next sample before the counters if updated to 300 and you read
200.  You take your next sample AFTER the counter is updated to 400 and
so you read 400.  That means the counters you read once a second were
100, 200 and 400 and the rates you'll report are 100 and then 200!!!
That's exactly what's happening here and there isn't a tool around that
can report the traffic correctly unless you sample at 0.9765 or at a
high enough rate that this effect will not be significant.  Is that
clear enough?

collectl from http://sourceforge.net/projects/collectl and taking it for
a spin.  Try monitoring your network once a second and see what I mean.
You'll also find you can monitor a lot more than just the cpu, disk and
network in my example above.  You can look at memory, nfs, sockets,
inodes, lustre traffic and even infiniband.  more importantly, you can
tell it to generate its data in 'space-separated' format which you could
profile of what your system was doing.  The only thing is there are
potentially hundreds of counters (if you sample at the device level) and
so you could need a lot of storage to hold it all.  Or I suppose you
could turn the sampling rate down from 10 second to a minute or more if
that's what you prefer.

-mark
> _______________________________________________
> rrd-users mailing list
> rrd-users at lists.oetiker.ch
> https://lists.oetiker.ch/cgi-bin/listinfo/rrd-users
>

```