<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
</head>
<body bgcolor="#ffffff" text="#000000">
Long list of observations and thoughts below...<br>
<br>
Florian Forster wrote:
<blockquote cite="mid:20091010110731.GM11119@verplant.org" type="cite">
<pre wrap="">Hi Thorsten,
On Fri, Oct 09, 2009 at 04:41:55PM -0700, Thorsten von Eicken wrote:
</pre>
<blockquote type="cite">
<blockquote type="cite">
<pre wrap="">This sounds like collectd not sending updates to rrdcached. If they
are not in the journal, then rrdcached has not received them.
</pre>
</blockquote>
<pre wrap="">Yes, the question is whether it's collectd's fault or rrdcached's
fault..
</pre>
</blockquote>
<pre wrap=""><!---->
If RRDCacheD takes too long to answer, the dispatch thread will wait
there and not dequeue any more values from that queue of received and
unparsed packets. If this is the cache, you should see some (linear?)
memory growth of the collectd process. You can also try to forcibly quit
collectd (kill -9) and immediately restart collectd. If the data RRD
files were lagging behind is simply lost, this is a indication of the
data being within collectd and waiting to be sent to RRDCacheD.
(It's not yet possible to “watch” the length of this queue directly.
I'll add some measurements to the Network plugin so we can see what's
going on eventually …)
</pre>
</blockquote>
The linear memory growth is very clear. However, there are a number of
things that still bug me:<br>
<br>
- collectd+rrdcached were running steady processing ~25'000 tree nodes
with ~2'500 updates per second (rrdcached's UpdatesReceived stats
counter). I then threw another ~30'000 tree nodes with ~3'000 updates
per second at it (this is all real traffic, not a simulation). Due to
the way we deal with the creation of the required new rrds this caused
very heavy disk activity for a while slowing down collectd and
rrdcached so collectd started buffering for ~15 minutes, during which
time it grew from ~40MB to just under 300MB, all good and expected so
far. It then stayed steady at that size and judging by the rrdcached
UpdatesReceived it must have been able to clear its backlog. Then I
threw yet another 30'000 tree nodes and corresponding updates at it. At
that point, collectd started immediately to grow again linearly to over
600MB. Given that it has more traffic coming at it I expect it to grow
larger buffers than previously, but what bothered me is that it started
to grow immediately. It's as if the previous 250MB of buffers hadn't
been freed (in the malloc sense, I understand that the process size
isn't going to shrink). Could it be that there is a bug?<br>
<br>
- if rrdcached is restarted, collectd doesn't reconnect. I know this
is the case for TCP sockets but I'm pretty sure I observed it using the
unix socket too. This is a problem because restarting collectd looses
the data it has buffered while rrdcached was down.<br>
<br>
- the -z parameter is nice, but not quite there yet. I'm running with
-w 3600 -z 3600 and the situation after the first hour is not pretty
with a ton of flushes followed by a lull and a repeat after another
hour. It takes about 4 hours before everything stabilizes and becomes
smooth. I'm wondering whether it would be difficult to change to an
adaptive rate system, where given a -w 3600 and the current number of
dirty tree nodes rrdcached computes the rate at which it needs to flush
to disk and then does that. If you think about it, within one
collection interval (20s in my case) it would know the total set of
RRDs (tree nodes) and they all would be dirty. In my case it would
periodically compute the ratio (e.g. 25'000 tree nodes to flush over
3600 seconds = 6.9 flushes per second) and would start flushing the
oldest dirty nodes immediately even though they've been dirty for much
less than 3600 seconds. Of course rrdcached would need to re-evaluate
the flush rate periodically, but if it keeps a running counter of dirty
tree nodes that should be pretty easy. All this should put the daemon
into a steady state from the very beginning.<br>
<br>
- running with 80-90k tree nodes for a while ended up bringing
rrdcached to its knees. What I observe is that over time rrdcached uses
more and more cpu and starts seeing page faults. Eventually, rrdached
comes to a crawl and neither keeps up with the input (so collectd
starts growing) nor manages to maintain its write-rate. The page faults
are interesting because no swap space is used (it stays at 64k usage,
which is the initial state). The only explanation I've come up with is
that at the point where the "working set" of all the RRDs exceeds the
amount of memory available (I have 8GB) everything starts degrading. At
that point, rrdcached fights against the buffer cache and starts seeing
page faults. Its write threads also slow down because now the disk is
not just being written but also read (I can see that happening). I
assume that once it page-faults the whole process slows down meaning
that notjust the queue threads but also the connection threads start
slowing down, which then causes collectd to start buffering data and
grow -- it grew to >2GB for me! That now puts more pressure on
memory and we're in a downward spiral. It's not yet clear to me whether
the disk used for RRDs is maxed out when this process starts
(eventually it does max out), so I don't know whether I'm hitting a
hard disk I/O limit or whether I just spiral into it by successively
reducing the amount of buffer cache available. I suspect it would be
possible to push the system further if the various rrdcached threads
could be decoupled better. Also, being able to put an upper bound on
collectd memory would be smart 'cause it's clear that at some point the
growth becomes self-defeating. It could randomly drop samples when it
hits the limit and that would probably lead to an overall happier
outcome.<br>
<br>
- I'm wondering how we could overcome the RRD working set issue. Even
with rrdcached and long cache periods (e.g. I use 1 hour) it seems that
the system comes to a crawl if the RRD working set exceeds memory. One
idea that came to mind is to use the caching in rrdcached to convert
the random small writes that are typical for RRDs to more of a
sequential access pattern. If we could tweak the RRD creation and the
cache write-back algorithm such that RRDs are always accessed in the
same order, and we manage to get the RRDs allocated on disk in that
order, then we could use the cache to essentially do one sweep through
the disk per cache flush period (e.g. per hour in my case). Of course
on-demand flushes and other things would interrupt this sweep, but the
bulk of accesses could end up being more or less sequential. I believe
that doing the cache write-back in a specific order is not too
difficult, what I'm not sure of is how to make it such that the RRD
files get allocated on disk in the that order too. Any thoughts?<br>
<br>
Cheers,<br>
Thorsten<br>
</body>
</html>