[rrd-users] Input values normalization

Donovan Baarda abo at minkirri.apana.org.au
Thu Feb 20 05:38:14 CET 2014


On 19 February 2014 01:41, Simon Hobson <linux at thehobsons.co.uk> wrote:
[...]
> Then RRD isn't the right tool for you. It's a bit like complaining that
this oxy-acetalyne torch doesn't cut a very neat hole in the car panel I'm
trying to drill. Wrong tool, if you need a precision hole cut then use a
tool designed for that.
> If you need precision storage of arbitrary data points then use a
different database. If you want efficient storage and aggregation, but not
precision storage of individual readings, then use RRD.

While this statement is true, and is a good way to avoid an ill-informed
argument, it's also a bit of an easy way out.

The timeseries management in RRD is extremely well thought out and
implemented. While it might not do what people *think* they want, it nearly
always does what they *really* want. People who avoid RRD because "it
doesn't do what we want" usually, after discovering many (but typically not
all) of the corner cases, end up re-inventing RRD badly.

Usually people *think* they want precision storage of the raw time series
samples. However, any time you want to *do* anything with this raw
timeseries, like graph it, or compare/combine it with other timeseries, the
first thing you have to do is normalize it. Sometimes people *think* they
don't need to normalize it before using it, but if you don't the results
you get are wrong. Usually people don't realize they are wrong until they
start to notice some operations occasionally giving results that can't be
right, like percentages >100% or <0%. If they bother to figure out why and
fix it, they end up re-inventing normalization, though they don't realize
that because it's hidden inside inefficient interpolation operations
performed on the fly every time they do anything. Maybe while they are
implementing this they run into the corner cases around missing samples, or
counter wraps/resets. The reality is, if you do it right, the *only* thing
the raw timeseries samples are used for is to create normalized time series
before other operations... again and again and again. RRD does the
normalization very efficiently once at sample collection time, and has
sensible handling of missing samples and counter wraps/resets.

There are some things that RRD doesn't do well that sometimes people really
do want, like not "round-robin-ing" infinitely long timeseries, but
normalization is not really one of them.

To get the most out of RRD you need to think about what you are measuring.
In this particular case, the magnitude of any sampling and normalization
errors depends on what exactly the GAUGE value represents. In the simplest
case a gauge represents the instantaneous magnitude at the time of the
sample. In this case the errors are unknown, as you don't know how high or
low the value could have gone between samples. Other types of gauges might
report the maximum, minimum, or average magnitude since the last sample, or
an exponentially decaying average over time (eg loadavg). Regardless of
what the gauge represents, in practice the sample interval and maximum rate
of change of the gauge work together to limit the error.

You really should read the "The HEARTBEAT and the STEP" section on the
rrdcreate docs and make sure you understand it;

http://oss.oetiker.ch/rrdtool/doc/rrdcreate.en.html

My advice is set step to the smallest resolution you want to be able to
see, which could be more or less than your update interval, and set
heartbeat to 1.5x your update interval. Then have AVERAGE, MIN, and MAX
RRA's at the steps you want using an xff of 0.5. You can then generate nice
graphs with a pale min-to-max range and a line at the average. An example
of how to do that is here;

http://minkirri.apana.org.au/~abo/projects/rrdcollect/rrdcollect.cgi

which gives results that look a little bit like this;

http://minkirri.apana.org.au/rrdcollect/sensor-temps-1000d.png

BTW a quick trick I just discovered for GAUGE if you want to only have
PDP's for your samples and mark all the PDP's between samples as UNKNOWN
with an update interval much longer than your step is to do an UNKNOWN
update step secs before your your real update using something like this;

rrd update my.rrd N-$STEP:U:U:U:...
rrd update my.rrd N:v1:v2:v3:...

For example, using step=1, heartbeat=2, and updating like this every
minute, you get 59 UNKNOWNs for all the seconds you didn't update, and 1
known value for the second that you did. This is arguably the correct PDP
sequence for an infrequently updated GAUGE that can vary wildly between
updates, but you will need to use very forgiving xff values on your
aggregating RRA's to get anything other than UNKNOWN in them.


--
Donovan Baarda <abo at minkirri.apana.org.au>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.oetiker.ch/pipermail/rrd-users/attachments/20140220/867b1811/attachment.htm 


More information about the rrd-users mailing list