[rrd-users] Re: Tracking uptime

Wed Sep 21 12:23:07 MEST 2005

On Tue, Sep 20, 2005 at 10:33:55PM -0400, Gregory (Grisha) Trubetskoy wrote:

> > If possible at all: try to write fractions of 100 (or 1) when you detect
> > downtime somewhere in the interval but not right now.
> 
> Could you elaborate on this last point? You mean that if I have knowledge 
> that half of the interval it was up, then write 50%?

Exactly.  After all, that's what your data means: what percentage of time
was my device/application up.

> I plan on taking samples every minute. Rather, our system works by 
> expecting a heartbeat packet from a server once every 60 seconds, and when 
> no packet arrives for more than 70 seconds, I will record a 0 uptime at 
> that time. It will keep writing 0 every minute until the first packet 
> arrives, then it will writ 100%. I will set the -step to 60 seconds as 
> well.

This means the heartbeat and the rrdtool database are not in sync. That is
not a problem, in fact it will probably improve its accuracy.  Just don't
expect to see 100 or 0 in all time slots when you dump the database.
(Make sure you understand how rrdtool normalizes its input)

The server sends a pulse when it starts, and then every 60 seconds.
This means you should consider the pulse to be the start of an interval.
(If you don't do it like this: reconsider).

RRDtool works different.  It writes the end of an interval.  Therefore,
you should write a 0 to your database as soon as the first pulse comes in.
The next pulse you write 100 and continue to do so until no pulse is
received.  At that moment, you start writing zeros again.  Should the
device go down (or: should the application stop) then you won't know
exactly when that happened.  It will have happened between 0 and 60
seconds after sending its last pulse.  On average this will be 30 seconds.
You could write 50 into your database with a timestamp where you expected
the pulse to arrive.  That would be a guess, so if you prefer to see
more downtime than actually happened, do write zero!  If you write 0,
you won't ever show more uptime than has actually happened.  If you
write 50, you get it right on average but you might show more uptime
than the real amount.

You don't even need to write zero every 60 seconds.  If you set step to
60 and set heartbeat to a huge number, rrdtool will patiently wait.  As
soon as the first pulse comes in again, you write 0 to rrdtool and it
will fill every timeslot between the last update and the current update.
That is the correct thing to do, as the server was down upto that moment.

You will want to write zero every now and then for two reasons:
-1- if the downtime is more than rrdtool's heartbeat, you'd get unknowns.
-2- if you create a graph from your rrd during downtime, you will see
    unknowns between the last time you updated and time "now".

> Since I'm looking at uptime over last 30 days or more, I think this should 
> be fairly accurate without having to adjust for fractions of a minute 
> here and there?

Only you can decide that.  Looking at 30 days just means you have 43200
chances to get it wrong (or right).  What matters is the chance per
minute to get it wrong.

Suppose you see a pulse after 20 seconds in stead of 60 seconds.  My
guess is that this means downtime of 50% during 20 seconds.  Think a
while about it.  20 is not a mistake (it shouldn't be 30 to get 50%).

Alex

--
Unsubscribe mailto:rrd-users-request at list.ee.ethz.ch?subject=unsubscribe
Help        mailto:rrd-users-request at list.ee.ethz.ch?subject=help
Archive     http://lists.ee.ethz.ch/rrd-users
WebAdmin    http://lists.ee.ethz.ch/lsg2.cgi