[rrd-developers] rrdcached + collectd issues

Sat Oct 10 13:07:31 CEST 2009

Hi Thorsten,

On Fri, Oct 09, 2009 at 04:41:55PM -0700, Thorsten von Eicken wrote:
> > This sounds like collectd not sending updates to rrdcached.  If they
> > are not in the journal, then rrdcached has not received them.
> 
> Yes, the question is whether it's collectd's fault or rrdcached's
> fault..

assuming you're receiving those values via collectd's Network plugin,
this is what's going on:

  * There are two threads handling incoming network traffic. The first
    reads the packets from a network socket and appends them to a linked
    list.

  * The second thread parses the data and dispatches the included data
    to the daemon, resulting in roughly 20 “value lists” per packet.

  * The dispatch thread will call “rrdc_update” within the rrdcached
    plugin, resulting in a single update instruction being sent to
    RRDCacheD. The call returns after a status has been returned by the
    daemon.
    (This is what Kevin meant with the BATCH operation, where this call
    would return immediately without waiting for a status to be
    returned.)

If RRDCacheD takes too long to answer, the dispatch thread will wait
there and not dequeue any more values from that queue of received and
unparsed packets. If this is the cache, you should see some (linear?)
memory growth of the collectd process. You can also try to forcibly quit
collectd (kill -9) and immediately restart collectd. If the data RRD
files were lagging behind is simply lost, this is a indication of the
data being within collectd and waiting to be sent to RRDCacheD.

(It's not yet possible to “watch” the length of this queue directly.
I'll add some measurements to the Network plugin so we can see what's
going on eventually …)

> Yeah, as I mentioned above, I'm very steady at 3k updates received per
> sec.

Are you talking about network packets or separate updates here?
Depending on your data every packet can contain about 20–30 separate
values, to the difference is significant ;)

> > > I/O is not a problem as I mentioned, it's pure CPU. I've compiled
> > > rrdcched with -pg to get gprof output, but haven't been
> > > successful.

*The* data structure within RRDCacheD that is *supposed* to grow as more
data is to be cached is “cache_tree”. So *the* call that is supposed to
be the limiting factor is this line within “handle_request_update”:

  ci = g_tree_lookup (cache_tree, file);

(I'm talking about normal operation, of course. Replaying the journal is
special.)

> >>   %   cumulative   self              self     total
> >>  time   seconds   seconds    calls   s/call   s/call  name
> >>  55.12     62.39    62.39 280843249     0.00     0.00  buffer_get_field

If the CPU really was busy for 250 minutes (and not stuck doing I/O),
then about 62.39/15000 (~4.2 %) of the time was spent in
“buffer_get_field”. It might be possible to optimize that function
further, but I don't think it's worth it. The real bottleneck is
probably somewhere else.

One possible way to make this faster is to use something like:

  *ptr = strcspn (buffer, "\\ ");
  if (*ptr == ' ')
    /* Normal case: Field without backslash */
  else
    /* More complex escape sequence handling */

I would surprised if this was the cause of those performance issues,
though. Looking at the code it looks like the schoolbook case for branch
prediction, something modern CPUs are *very* good in …

Overall I have the feeling that the update command is slower than
expected – at least this would explain your issues. It'd be best if you
could try to get some reliable profiling data. Without it, optimization
makes not much sense :/

Regards,
—octo
-- 
Florian octo Forster
Hacker in training
GnuPG: 0x91523C3D
http://verplant.org/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : http://lists.oetiker.ch/pipermail/rrd-developers/attachments/20091010/f6dfb29d/attachment.pgp