[rrd-users] showing percentage of service availability per certain period of time

Alex van den Bogaerdt alex at ergens.op.het.net
Thu Sep 13 18:52:19 CEST 2007

On Thu, Sep 13, 2007 at 01:18:43PM +0200, Ladislav Andel wrote:
> Hello,
> I'm trying to figure out how to configure rrd graph tool to get me the 
> percentage of server availability.
> I'm monitoring servers which returns round-trip time to rrdtool.
> Sometimes(when the server doesn't respond) I don't get any answer within 
> given period of time so in rrd DB is NaN.
> I would like to calculate for certain period of time e.g. a week how 
> many times the server was not accessible.
> Something like the successful number of tests divided by total amount of 
> tests for the period.
> Could you give me an example of this approach as rrd script?

You have:

* A number when the server responded. You assume the server was
  up during the entire interval.
* NaN when the server did not respond. You assume the server was
  down during the entire interval.

You can know the amount of time involved, if you make sure:
* your start time is a whole multiple of your RRA step size
* your end time is a whole multiple of your RRA step size
 (e.g. if you have six PDP per CDP, and if each PDP is 300
  seconds, they need to be a whole multiple of 6*300=1800 seconds

Remember that 6*300 (or whatever is appropriate for you). You will
need it another time.

* if you create a graph in the same run, the amount of pixels is a
  whole multiple (may be 1) of the amount of steps, which
  is (end-start)/RRA_step_size.
* If you don't create a graph, I believe RRDtool will automatically
  do the right thing.

Then you do a CDEF calculation: if a value is NaN, return 100, else 0.
This is the trick!  For each of the intervals, you tell RRDtool if
the device was up the entire time (100), or down (0).

Now PRINT or GPRINT the average of all these percentages.

Let's verify:

500 pixels graph, each CDP is 6 PDPs, each PDP is 300 seconds.
That's time slots of 1800 seconds, 500 of them.

Don't worry about the weird end time; I needed an example which
was easy to compute by hand!

Start time is midnight, 1189634400, which is a whole multiple of
1800.  End time is 10:00 in ten days from now, 1190534400, which
is also a whole multiple of 1800.  Good.

(end-start)/(6*300) = (1190534400-1189634400)/1800 = 500. Good.

So, 500 periods up (a number) or down (NaN).  Let's say out of
these samples, 25 were NaN.

25 times you've modified a NaN into 100, and 475 times you've
modified a number into zero.  The average of this is 25*100/500
which is 5.  If you print this using "%6.2lf%% down" the result
will look like "  5.00% down"

Let's check: 25 out of 500 is 5 out of 100 is indeed 5%

Alex van den Bogaerdt

More information about the rrd-users mailing list