[rrd-developers] rrdcached shutdown

Thu Sep 25 19:11:41 CEST 2008

On Thu, Sep 25, 2008 at 06:25:44PM +0200, Sebastian Harl wrote:
> Imho, the init script should take care of that and wait for the daemon
> to shut down completely. The same applies for the rrdtool plugin of
> collectd as well (in fact, the caching code in collectd is very similar
> to the rrdcached implementation; it was written by the same person after
> all ;-)) - see [1] for the Debian collectd init script that handles long
> shutdown times.

Sebastian et al,

I'm operating under the following assumptions:

  (1) people will run rrdcached because they need to handle more RRDs than
      their hardware would otherwise allow.

  (2) delay in getting bits on disk is an acceptable trade-off for (1) as
      long as no data is lost

  (3) sites such as these are generally intolerant of data loss.

The delay introduced by a full flush-out at shutdown may be intolerably
long.  In my environment it takes 15+ minutes.  During this time, no new
updates can be accepted.  Information is lost.

If this is due to system shutdown, any other services which have already
been succesfully shutdown by normal process are unavailable while we wait
for the rrdcached to die.  Also, rrdcached doesn't have any control over
whether the O/S even waits for it to shutdown.

> > > RRDs out to disk..  When the daemon starts back up it can re-create its
> > > memory state with the journal.
> 
> I don't think this is a good idea. When shutting down the daemon, I'd
> expect it to finish it's job - e.g. I might not want to restart the
> daemon, so I would lose data in that case. I agree that this is probably
> a very uncommon case but I'm sure there are quite a few other examples
> and I don't want to risk data loss even in very uncommon situations.

I think it makes more sense to focus on how quickly the daemon can return
to service with no data loss.  If we need to reboot for some reason, and
the daemon is blocking shutdown for 20 minutes, that's a problem.

> > > What do you think about an expedited shutdown if we are journaling
> > > updates?  We could simply flush the journal and exit.
> > 
> > this would make sense to me ... maybe have different behaviour
> > depending on the signal it gets ?
> 
> That won't work in this case. You cannot catch SIGKILL.

We could catch SIGTERM for expedited shutdown and SIGINT for full-flush
shutdown?  Then, each operator can decide which makes the most sense.

-- 
 kevin brintnall =~ /kbrint at rufus.net/