<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">

</head>

<body bgcolor="#ffffff" text="#000000">

Florian Forster wrote:

<blockquote cite="mid:20091010110731.GM11119@verplant.org" type="cite">

  <pre wrap="">Hi Thorsten,

On Fri, Oct 09, 2009 at 04:41:55PM -0700, Thorsten von Eicken wrote:

  </pre>

  <blockquote type="cite">

    <blockquote type="cite">

      <pre wrap="">This sounds like collectd not sending updates to rrdcached.  If they

are not in the journal, then rrdcached has not received them.

      </pre>

    </blockquote>

    <pre wrap="">Yes, the question is whether it's collectd's fault or rrdcached's

fault..

    </pre>

  </blockquote>

  <pre wrap=""><!---->

assuming you're receiving those values via collectd's Network plugin,

this is what's going on:

  * There are two threads handling incoming network traffic. The first

    reads the packets from a network socket and appends them to a linked

    list.

  * The second thread parses the data and dispatches the included data

    to the daemon, resulting in roughly 20 &#8220;value lists&#8221; per packet.

  * The dispatch thread will call &#8220;rrdc_update&#8221; within the rrdcached

    plugin, resulting in a single update instruction being sent to

    RRDCacheD. The call returns after a status has been returned by the

    daemon.

    (This is what Kevin meant with the BATCH operation, where this call

    would return immediately without waiting for a status to be

    returned.)

If RRDCacheD takes too long to answer, the dispatch thread will wait

there and not dequeue any more values from that queue of received and

unparsed packets. If this is the cache, you should see some (linear?)

memory growth of the collectd process. You can also try to forcibly quit

collectd (kill -9) and immediately restart collectd. If the data RRD

files were lagging behind is simply lost, this is a indication of the

data being within collectd and waiting to be sent to RRDCacheD.

(It's not yet possible to &#8220;watch&#8221; the length of this queue directly.

I'll add some measurements to the Network plugin so we can see what's

going on eventually &#8230;)

  </pre>

</blockquote>

Yes, this description fits. When rrdcached hits 100% cpu them

collectd's memory size starts increasing linearly.<br>

<br>

I also made progress in diagnosing rrdcached's performance issues. I

had 10 queue threads before (-t 10) and I now reduced it to 2 (-t 2).

It now behaves a lot better, so I suspect there was a lot of lock

contention going on. I don't see any performance impact and in fact a

"FLUSHALL" seems to go faster than with 10 threads. But there are still

some funny effects, the flushall created a queue of length 23000. It

went down to ~5000 in 3 minutes at which point it saturated the disk

controller cache and proceeded at a somewhat slower pace. Maybe a

single queue thread could do it too. The overall cpu load now swings

between 20-40%, depending on the flush rate (-w 3600 -z 3600 -f 7200).<br>

<br>

BTW: I tried callgrind, haha, it's way too slow.<br>

<br>

Thorsten<br>

<br>

</body>

</html>