[rrd-developers] rrdcached + collectd issues

Tue Oct 13 12:52:15 CEST 2009

Florian Forster wrote:
> Hi Thorsten,
>
> I'm having a bit of a hard time replying to this message because it (and
> the previous one) were sent as HTML-only. Could you maybe switch to
> multipart or plain text messages? Thanks :)
>   
Oops, sorry, will do better.
>> It's as if the previous 250MB of buffers hadn't been freed (in the
>> malloc sense, I understand that the process size isn't going to
>> shrink). Could it be that there is a bug?<br>
>>     
>
> We're talking about the resident segment size (RSS) here, right? Because
> *that* ought to descrese. 
>   
Yes, RSS.
>> &nbsp;- if rrdcached is restarted, collectd doesn't reconnect.
>>     
>
> The collectd plugin calls “rrdc_connect” before each update. The
> semantic of that function is to check whether a valid connection to the
> daemon exists and try to reconnect if necessary. If anything goes wrong
> with sending / receiving data, other functions will simply close /
> invalidate the connection and it is supposed to be opened in the next
> iteration.
>
> If the connection is not reestablished, my guess is that the socket
> descriptor is not properly invalidated. I'll have to look further into
> this though.
>   
I cannot see it reconnect, but maybe I have to wait for a looong time. I 
can troubleshoot if you see it reconnect properly on your box.
> I'm running with -w 3600 -z 3600 and the situation after the first
>> hour is not pretty with a ton of flushes followed by a lull and a
>> repeat after another hour.
>>     
>
> That's unexpected (at least for me). With those setting I would have
> expected the first hour to be memory only (i.e. no disk activity at all)
> and after that basically uniformly distributed writes for an hour. Two
> hours after start I'd expect a drop in writes which increases for an
> hour and has its peak at three hours after start.
>   
Yes, you are correct. Maybe I was being too picky. You simulations match 
what I observe.
>> I suspect it would be possible to push the system further if the
>> various rrdcached threads could be decoupled better.
>>     
>
> Do you have anything specific in mind? As far as I can tell the various
> threads are pretty much as decoupled as they can safely be.
>   
I did have something in mind, but now I'm not sure my hypothesis was 
correct...
>> Also, being able to put an upper bound on collectd memory would be
>> smart 'cause it's clear that at some point the growth becomes
>> self-defeating.
>>     
>
> Sounds like a reasonable idea. Any idea which values to drop? The
> oldest, the newest, either (chosen randomly), both?
>   
I've converged on a XFF value of 0.9, 'cause else it's too easy to loose 
a lot fo data if there is any flakyness in the collection. So I would 
prefer totally random dropping of values irrespective of age. That'll 
uniformly lower the resolution across the board. Visually imperceptible 
until it starts dropping significant amounts. I'm sure others have 
different ideas.
>> &nbsp;- I'm wondering how we could overcome the RRD working set issue.
>>     
>
> Let's assume every RRD file has only one data source and you have
> 100,000 files. Then the total data cached should be:
>
>     8 Byte * 100,000 files * 3600 / 20 seconds ⇒ 144 MByte
>
> This should be possible *somehow* …
>   
It's easy. The issue is that you need to "transpose" a large matrix: one 
dimension is time, the other the data sources. You write in time order 
and you read in data source order.
>> One idea that came to mind is to use the caching in rrdcached to
>> convert the random small writes that are typical for RRDs to more of a
>> sequential access pattern.
>>     
>
> Well, the problem is that currently RRD files look like this on disk:
>
>   [a0,a1,a2,a3,…,an] [b0,b1,b2,b3,…,bn] [c0,c1,c2,c3,…,cn]
>
> To get a sequential access pattern, we'd have to reorder this to:
>
>    a0,b0,c0 a1,b1,c1 a2,b2,c2 a3,b3,c3 … an,bn,cn 
>
> I think the only way to achieve this is to have all that data in one
> file. The huge problem here is adding new data: If we need to add
> d[0,…,n] to the set above, almost *all* data has to be moved. And we're
> not even touching several RRAs with differing resolutions. I think to
> get this RRDtool / RRDCacheD would have to be turned into something much
> more like a database system and less like a frontend for writing
> separate files.
>   
What I was meaning is slightly different and doesn't change the current 
RRD file format. If you do one pass updating all your RRDs you end up 
writing 1/Nth of the disk blocks, where N has to do with the RRA's being 
updated vs. the total stored data for an RRD. If you do this pass over 
your RRDs in random order, the disk will do random seeks between 
read-modify-writes with some possible ordering thanks to elevator 
algorithm and such. Now imagine instead that you could update the RRDs 
in the order in which they're stored on disk. Depending on the cylinder 
size vs. RRD size you'd get away with fewer seeks, and with 
predominantly short seeks. This is not "sequential access" strictly 
speaking, but it should be a whole lot faster than random seeks across 
the entire disk.

I just restarted everything afresh to get a clean set of data. It's 
already not looking pretty. Here's the set-up:
- /usr/bin/rrdcached -w 3600 -z 3600 -f 7200 -t 2 -b /rrds -B -j 
/rrds/journal -p /var/run/rrdcached/rrdcached.pid -l 127.0.0.1:3033
- ~55k tree nodes, collected every 20 seconds
- see the rrdcached-1*.png in http://www.voneicken.com/dl/rrd/

What I see:
- the system ran with half the load for 5 minutes at start-up before I 
added the "second half"
- the input is constant (see network rx pkts in last graph in 1c.png)
- rrdcached has ok cpu load for the first 15 minutes, then it really 
ramps up to using over half a cpu
- the connection thread seems to be affected because the "receive 
update" and "journal bytes" rates start to degrade
- note that the journal files are on a separate set of disks from the 
RRDs, and that set of disks is always pretty unloaded
- note that so far we haven't hit the end of the first hour, so no 
flushes to disk yet
- collectd keeps and keeps growing after the first 15 minutes, it's 
clear that the degradation in "receive update" is due to rrdcached and 
collectd has to start buffering (note how the first 15 minutes were nice 
and flat)

Conclusions so far:
- it's interesting that the connection thread can't keep up with 
collectd sending stuff, I hadn't seen that before because I had always 
increased the load after flushes had occurred, so there were more moving 
parts to suspect
- it's also interesting that the connection thread can keep up fine for 
the first 15 minutes, note that the tree depth goes to 19 levels right 
after the full traffic hits, so I don't see a correlation there. This 
also means that it's not string parsing that is the problem as that 
would show up immediately and not with 10 minutes of delay.

As I'm finishing to write this email rrdcached started to flush to disk. 
So far nothing interesting happening (other than I/O). The connection 
thread performance (or rather lack thereof) is virtually unchanged. I'll 
grabs a fresh set of graphs as rrdtool-2*.png

Without being able to run any decent profiler I'm a bit stumped. I tried 
to change main the other day to run the listen loop in a separate thread 
and to run one queue loop in main, so I could get gprof stats for a 
queue loop. That almost worked -- I had trouble getting the pgm to exit 
cleanly to give me stats. Maybe a similar hack to a connection thread 
could work. Mhh, sounds mor difficult since these are forked off the 
listen thread, ughh. Ideas?

Thorsten