[rrd-developers] rrd cache work

Mon Sep 1 11:09:42 CEST 2008

On Fri, Aug 29, 2008 at 10:45:32PM +0200, Florian Forster wrote:
> > What's the current status of the project?  This project is of interest
> > to me; I am trying to scale our RRD installation up by at least an
> > order of magnitude.  I could contribute to the development if you are
> > interested..
> 
> If you want to contribute further changes I'd love to include them in
> the patch. If you have questions regarding the code, don't hesitate to
> ask.

octo,

I am working on a few changes...  Once I have completed testing I will
submit the patch back to you and tobi.

-----------------------------------------------------------------

These two pseudo-codes are repeated a lot, esp. in places that need to
pre-flush their files (i.e. info, last, lastupdate, ...)

  if (! --daemon command line) {
    check environment for $RRDCACHED_ADDRESS
  }

  if (have daemon) {
    connect;
    flush;
    disconnect;
  }

I have combined them into a single call that handles both..  This reduces
the code in the simpler functions quite a bit:

  rrdc_flush_if_daemon(opt_daemon, filename);

-----------------------------------------------------------------

Also, it appears that the update strings are passed directly through the
daemon without modification.  However, when we see an update like
"N:1:3:5:7:9", the time that ultimately gets passed through to _rrd_update
should be based on when the "N:" was received at the daemon, not when it
was flushed out to disk.  Likewise, @-time is broken.

I think the most sensible course of action would be to convert the
relative time strings to absolute time strings when we receive the UPDATE
commands from the client.

I envision re-using get_time_from_reading().  However, I'd move the things
that sanity-check against the RRD (starting with "if (version < 3)" into
parse_ds (which is the only other caller of get_time_from_reading())..
Then, we can use this function in the daemon without prior knowledge of
the RRD's structure.

-----------------------------------------------------------------

In my envirionment, I have a process that calculates a very large number
of RRD updates (1 per file), and then issues them in a loop with
RRDs::update (perl).  With the current code, this will create a lot of
unnecessary connect/disconnect.

I noticed that some of the other methods that deal with several RRDs
(i.e. graph, xport) re-use a single connection.  I am wondering if we
should generalize this approach as follows:

 * keep a single (scoped global) cached fd and addr string in rrd_client.c

 * in rrdc_connect(), detect whether we're trying to re-use the cached
   string.  if so, check cached fd and return.  if it's a new daemon
   address, then close the old one and replace.  A single daemon is the
   most common use case.

 * in most functions, no longer explicitly call rrdc_disconnect.

 * install an atexit() handler to disconnect from any remaining cached
   entry.  (what about long lived processes that only use the daemon
   sparingly?)

Let me know what you think on this one..  I'm guessing plenty of other
environments issue a lot of UPDATEs from a single process.

-----------------------------------------------------------------

Lots of updates are going to be queued at (0 mod config_flush_interval).
This risks creating the "thundering herd" problem that we're trying to
avoid with delayed updates (albeit less frequently).  When creating new
cache_item_t in handle_request_update, we should skew the time as follows:

  ci->last_flush_time = now + random() % config_flush_interval;

-----------------------------------------------------------------

Let me know what you think.

-- 
 kevin brintnall =~ /kbrint at rufus.net/