[rrd-developers] rrdcached + collectd issues

Fri Oct 9 20:46:26 CEST 2009

I've been trying to get rrdcached and collectd to work together under 
load and am running into a number of issues.
I'm using:
 - collectd 4.7.4
 - rrdtool trunk rev 1889 (had trouble compiling collectd with newer 
versions)
 - rrdcached stats snapshot:
9 Statistics follow
QueueLength: 0
UpdatesReceived: 91123595
FlushesReceived: 83983
UpdatesWritten: 210810
DataSetsWritten: 82474667
TreeNodesNumber: 25925
TreeDepth: 17
JournalBytes: 6846161193
JournalRotate: 4
 - approx 3k updates written to rrdcached per second
 - approx 200-300KB written to journal per second
 - approx 2k-3k data sets written per second
 - rrdached params: -w 3600 -z 3600 -f 7200 -t 10
 - disk I/O is not an issue
 - rrdcached memory usage is not an issue (grows to 0.8GB then stays 
totally flat), no swapping
 - running collectd, rrdcached, and custom graphing app on same 
dual-core server, verified that flushing for graphing is working properly

First issue is that over time the data in the rrd files lag behind the 
data arriving into collectd in the network. After 12 hours I see approx 
a 5 minute lag. I've seen it go to >1.5 hrs after a bunch of days. The 
symptoms are that data in the rrd files continues to advance at the 
normal rate (20 second interval in our case) but just in the past. The 
fact that the delay is steady leads me to believe that it's a program 
bug (I've seen delays due to I/O overload in the past and different rrds 
then show different lags and jump forward when they finally get some 
disk attention). I've done some tests looking at the last_updated in the 
rrd and looking at what the rrdcached daemon returns to a PENDING 
command for the same file and I'm sure the flushing works. The daemon 
just doesn't receive newer updates. The journal is also in sync with all 
this. If I restart collectd, then the lag pretty quickly vanishes. So 
either collectd has some queue with a bug, or data is queued in the 
socket between collectd and rrdcached. I get the same delay whether I 
use a unix sock or a tcp sock and the amount of data "queued" is such 
that it's not in system buffers (the rrdcached journal is written at 
200kB/sec and I believe that's the same rate at which rrdcached receives 
data).

The second issue, which may possibly cause the first one is that the cpu 
consumed by rrdcached is way too high. After running for about an hour 
it consumes a full cpu (~90% user + ~10% system). It could be that 
that's causing the above lag, dunno. I/O is not a problem as I 
mentioned, it's pure CPU. I've compiled rrdcched with -pg to get gprof 
output, but haven't been successful. I commented out 
install_signal_handlers (left the USR2 to be able to terminate 
gracefully) and ran with -g, but the gprof output shows only ~2 minutes 
of CPU time profiled when the daemon accumulated >250mins. Here's the 
top of the output:
Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
 55.12     62.39    62.39 280843249     0.00     0.00  buffer_get_field
 11.33     75.22    12.83 93607575     0.00     0.00  send_response
 10.32     86.91    11.68 93464852     0.00     0.00  handle_request_update
  5.36     92.97     6.06                             connection_thread_main
  4.03     97.53     4.57 93683555     0.00     0.00  handle_request
  3.46    101.46     3.92 93484712     0.00     0.00  check_file_access
  3.29    105.18     3.72 176583057     0.00     0.00  next_cmd
  1.33    106.69     1.51 93686967     0.00     0.00  find_command
  1.23    108.08     1.40 88419974     0.00     0.00  journal_write
  1.00    109.22     1.14 93672403     0.00     0.00  has_privilege
It looks to like that's mostly the journal replay stuff and very little 
more. If someone has tips on how to get real profiling output, I'm all ears.

The journal replay is too slow. When I terminate the daemon it leaves 
several GB of journal files behind. Reading those in takes the better 
part of an hour, during which the daemon is unresponsive. Most of time 
is in buffer_get_field. (Note: in the most common cases buffer_get_field 
copies each field in-place, character by character. Seems to me that a 
simple if statement could avoid the writes.)

By the way, I find the unix socket stuff undebuggable. I switched to TCP 
sockets because I can telnet to the socket and find out what the daemon 
is doing. (For example, when nothing seems to work for almost an hour 
when I start the daemon because it's replaying logs there is no 
information about what's going on anywhere.) I'm saying this because 
everyone recommends the unix sockets for security reasons. It's unusable 
IMHO.

I think this is very close to being an extremely high performance RRD 
monitoring system, but it's not quite there yet. I'd appreciate any 
pointers on what to pursue. I hope tat the above descriptions will ring 
some bells in those of you that wrote some of the code. I'm available to 
test things out and collect more info. Unfortunately I have only little 
time to dig into the code myself, sigh.
Thanks,
Thorsten