[rrd-users] counting errors in rrd

Fri Mar 4 17:47:57 CET 2011

[I hope I'm not stating the obvious, just ignore me in that case]

> The ultimate goal in what I am doing is to show a graph of the response
> time, with our SLA times marked.  With outages in and out of SLA times
> (which is what I have setup now).

Everything is considered to be a rate. Remember that when you enter numbers. 
All numbers, no matter if you insert them as gauge, counter, end up as a 
rate internal to RRDtool.

I think you want to enter the duration of an outage in seconds.

With your goal, I think you will want to know especially the high numbers as 
it is the worst performance, so that when viewing a long period of time, say 
a whole year, you will notice the longest outage during a period of 86400 
seconds.

>  The next step will be to figure out the
> total SLA availability and total avilability of the service that I am
> monitoring.  I was hoping to use the flags for this, realizing that it is
> not a true measure of uptime.  Checking every 5 mins, can show a 10 min
> outage, even though it was only down for the moment it took to do the 
> check.

For this, I think it would be best to have uptime as a fraction (or a 
percentage) of total time. Worst performance is the lowest number in this 
case.

When entering uptime every 5 minutes, update it as a percentage or a 
fraction. RRDtool will then do the right thing for you when it is 
normalizing and consolidating. Computing how much of the previous 5 minutes 
your system was up is something you do outside RRDtool in this case.

When entering updates as system changes at specific (and known) time stamps, 
write a 1 (or 100) when the system goes from up to down, and a 0 when it 
goes from down to up. Make sure to have a high heartbeat setting, or 
periodically insert the current state even if there's no change.

Summarized:
have your downtime both as duration in seconds and as fraction of total 
time. Have MIN and MAX (and probably also AVERAGE) RRAs.