<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">

</head>

<body bgcolor="#ffffff" text="#000000">

Long list of observations and thoughts below...<br>

<br>

Florian Forster wrote:

<blockquote cite="mid:20091010110731.GM11119@verplant.org" type="cite">

  <pre wrap="">Hi Thorsten,

On Fri, Oct 09, 2009 at 04:41:55PM -0700, Thorsten von Eicken wrote:

  </pre>

  <blockquote type="cite">

    <blockquote type="cite">

      <pre wrap="">This sounds like collectd not sending updates to rrdcached.  If they

are not in the journal, then rrdcached has not received them.

      </pre>

    </blockquote>

    <pre wrap="">Yes, the question is whether it's collectd's fault or rrdcached's

fault..

    </pre>

  </blockquote>

  <pre wrap=""><!---->

If RRDCacheD takes too long to answer, the dispatch thread will wait

there and not dequeue any more values from that queue of received and

unparsed packets. If this is the cache, you should see some (linear?)

memory growth of the collectd process. You can also try to forcibly quit

collectd (kill -9) and immediately restart collectd. If the data RRD

files were lagging behind is simply lost, this is a indication of the

data being within collectd and waiting to be sent to RRDCacheD.

(It's not yet possible to &#8220;watch&#8221; the length of this queue directly.

I'll add some measurements to the Network plugin so we can see what's

going on eventually &#8230;)

  </pre>

</blockquote>

The linear memory growth is very clear. However, there are a number of

things that still bug me:<br>

<br>

&nbsp;- collectd+rrdcached were running steady processing ~25'000 tree nodes

with ~2'500 updates per second (rrdcached's UpdatesReceived stats

counter). I then threw another ~30'000 tree nodes with ~3'000 updates

per second at it (this is all real traffic, not a simulation). Due to

the way we deal with the creation of the required new rrds this caused

very heavy disk activity for a while slowing down collectd and

rrdcached so collectd started buffering for ~15 minutes, during which

time it grew from ~40MB to just under 300MB, all good and expected so

far. It then stayed steady at that size and judging by the rrdcached

UpdatesReceived it must have been able to clear its backlog. Then I

threw yet another 30'000 tree nodes and corresponding updates at it. At

that point, collectd started immediately to grow again linearly to over

600MB. Given that it has more traffic coming at it I expect it to grow

larger buffers than previously, but what bothered me is that it started

to grow immediately. It's as if the previous 250MB of buffers hadn't

been freed (in the malloc sense, I understand that the process size

isn't going to shrink). Could it be that there is a bug?<br>

<br>

&nbsp;- if rrdcached is restarted, collectd doesn't reconnect. I know this

is the case for TCP sockets but I'm pretty sure I observed it using the

unix socket too. This is a problem because restarting collectd looses

the data it has buffered while rrdcached was down.<br>

<br>

&nbsp;- the -z parameter is nice, but not quite there yet. I'm running with

-w 3600 -z 3600 and the situation after the first hour is not pretty

with a ton of flushes followed by a lull and a repeat after another

hour. It takes about 4 hours before everything stabilizes and becomes

smooth. I'm wondering whether it would be difficult to change to an

adaptive rate system, where given a -w 3600 and the current number of

dirty tree nodes rrdcached computes the rate at which it needs to flush

to disk and then does that. If you think about it, within one

collection interval (20s in my case) it would know the total set of

RRDs (tree nodes) and they all would be dirty. In my case it would

periodically compute the ratio (e.g. 25'000 tree nodes to flush over

3600 seconds = 6.9 flushes per second) and would start flushing the

oldest dirty nodes immediately even though they've been dirty for much

less than 3600 seconds. Of course rrdcached would need to re-evaluate

the flush rate periodically, but if it keeps a running counter of dirty

tree nodes that should be pretty easy. All this should put the daemon

into a steady state from the very beginning.<br>

<br>

&nbsp; - running with 80-90k tree nodes for a while ended up bringing

rrdcached to its knees. What I observe is that over time rrdcached uses

more and more cpu and starts seeing page faults. Eventually, rrdached

comes to a crawl and neither keeps up with the input (so collectd

starts growing) nor manages to maintain its write-rate. The page faults

are interesting because no swap space is used (it stays at 64k usage,

which is the initial state). The only explanation I've come up with is

that at the point where the "working set" of all the RRDs exceeds the

amount of memory available (I have 8GB) everything starts degrading. At

that point, rrdcached fights against the buffer cache and starts seeing

page faults. Its write threads also slow down because now the disk is

not just being written but also read (I can see that happening). I

assume that once it page-faults the whole process slows down meaning

that notjust the queue threads but also the connection threads start

slowing down, which then causes collectd to start buffering data and

grow -- it grew to &gt;2GB for me! That now puts more pressure on

memory and we're in a downward spiral. It's not yet clear to me whether

the disk used for RRDs is maxed out when this process starts

(eventually it does max out), so I don't know whether I'm hitting a

hard disk I/O limit or whether I just spiral into it by successively

reducing the amount of buffer cache available. I suspect it would be

possible to push the system further if the various rrdcached threads

could be decoupled better. Also, being able to put an upper bound on

collectd memory would be smart 'cause it's clear that at some point the

growth becomes self-defeating. It could randomly drop samples when it

hits the limit and that would probably lead to an overall happier

outcome.<br>

<br>

&nbsp;- I'm wondering how we could overcome the RRD working set issue. Even

with rrdcached and long cache periods (e.g. I use 1 hour) it seems that

the system comes to a crawl if the RRD working set exceeds memory. One

idea that came to mind is to use the caching in rrdcached to convert

the random small writes that are typical for RRDs to more of a

sequential access pattern. If we could tweak the RRD creation and the

cache write-back algorithm such that RRDs are always accessed in the

same order, and we manage to get the RRDs allocated on disk in that

order, then we could use the cache to essentially do one sweep through

the disk per cache flush period (e.g. per hour in my case). Of course

on-demand flushes and other things would interrupt this sweep, but the

bulk of accesses could end up being more or less sequential. I believe

that doing the cache write-back in a specific order is not too

difficult, what I'm not sure of is how to make it such that the RRD

files get allocated on disk in the that order too. Any thoughts?<br>

<br>

Cheers,<br>

Thorsten<br>

</body>

</html>