[rrd-developers] rrdcached + collectd issues
Thorsten von Eicken
tve at voneicken.com
Fri Oct 9 20:46:26 CEST 2009
I've been trying to get rrdcached and collectd to work together under
load and am running into a number of issues.
I'm using:
- collectd 4.7.4
- rrdtool trunk rev 1889 (had trouble compiling collectd with newer
versions)
- rrdcached stats snapshot:
9 Statistics follow
QueueLength: 0
UpdatesReceived: 91123595
FlushesReceived: 83983
UpdatesWritten: 210810
DataSetsWritten: 82474667
TreeNodesNumber: 25925
TreeDepth: 17
JournalBytes: 6846161193
JournalRotate: 4
- approx 3k updates written to rrdcached per second
- approx 200-300KB written to journal per second
- approx 2k-3k data sets written per second
- rrdached params: -w 3600 -z 3600 -f 7200 -t 10
- disk I/O is not an issue
- rrdcached memory usage is not an issue (grows to 0.8GB then stays
totally flat), no swapping
- running collectd, rrdcached, and custom graphing app on same
dual-core server, verified that flushing for graphing is working properly
First issue is that over time the data in the rrd files lag behind the
data arriving into collectd in the network. After 12 hours I see approx
a 5 minute lag. I've seen it go to >1.5 hrs after a bunch of days. The
symptoms are that data in the rrd files continues to advance at the
normal rate (20 second interval in our case) but just in the past. The
fact that the delay is steady leads me to believe that it's a program
bug (I've seen delays due to I/O overload in the past and different rrds
then show different lags and jump forward when they finally get some
disk attention). I've done some tests looking at the last_updated in the
rrd and looking at what the rrdcached daemon returns to a PENDING
command for the same file and I'm sure the flushing works. The daemon
just doesn't receive newer updates. The journal is also in sync with all
this. If I restart collectd, then the lag pretty quickly vanishes. So
either collectd has some queue with a bug, or data is queued in the
socket between collectd and rrdcached. I get the same delay whether I
use a unix sock or a tcp sock and the amount of data "queued" is such
that it's not in system buffers (the rrdcached journal is written at
200kB/sec and I believe that's the same rate at which rrdcached receives
data).
The second issue, which may possibly cause the first one is that the cpu
consumed by rrdcached is way too high. After running for about an hour
it consumes a full cpu (~90% user + ~10% system). It could be that
that's causing the above lag, dunno. I/O is not a problem as I
mentioned, it's pure CPU. I've compiled rrdcched with -pg to get gprof
output, but haven't been successful. I commented out
install_signal_handlers (left the USR2 to be able to terminate
gracefully) and ran with -g, but the gprof output shows only ~2 minutes
of CPU time profiled when the daemon accumulated >250mins. Here's the
top of the output:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
55.12 62.39 62.39 280843249 0.00 0.00 buffer_get_field
11.33 75.22 12.83 93607575 0.00 0.00 send_response
10.32 86.91 11.68 93464852 0.00 0.00 handle_request_update
5.36 92.97 6.06 connection_thread_main
4.03 97.53 4.57 93683555 0.00 0.00 handle_request
3.46 101.46 3.92 93484712 0.00 0.00 check_file_access
3.29 105.18 3.72 176583057 0.00 0.00 next_cmd
1.33 106.69 1.51 93686967 0.00 0.00 find_command
1.23 108.08 1.40 88419974 0.00 0.00 journal_write
1.00 109.22 1.14 93672403 0.00 0.00 has_privilege
It looks to like that's mostly the journal replay stuff and very little
more. If someone has tips on how to get real profiling output, I'm all ears.
The journal replay is too slow. When I terminate the daemon it leaves
several GB of journal files behind. Reading those in takes the better
part of an hour, during which the daemon is unresponsive. Most of time
is in buffer_get_field. (Note: in the most common cases buffer_get_field
copies each field in-place, character by character. Seems to me that a
simple if statement could avoid the writes.)
By the way, I find the unix socket stuff undebuggable. I switched to TCP
sockets because I can telnet to the socket and find out what the daemon
is doing. (For example, when nothing seems to work for almost an hour
when I start the daemon because it's replaying logs there is no
information about what's going on anywhere.) I'm saying this because
everyone recommends the unix sockets for security reasons. It's unusable
IMHO.
I think this is very close to being an extremely high performance RRD
monitoring system, but it's not quite there yet. I'd appreciate any
pointers on what to pursue. I hope tat the above descriptions will ring
some bells in those of you that wrote some of the code. I'm available to
test things out and collect more info. Unfortunately I have only little
time to dig into the code myself, sigh.
Thanks,
Thorsten
More information about the rrd-developers
mailing list