[rrd-users] RRDcached performance issues

Wed Dec 31 12:20:53 CET 2014

Hello,
I have some questions regarding rrdcached, which is being used as a part of
our PM system. As the system is going through a check/redesign phase, and as
we ran in quite some performance issues regarding data polling/writing, we
are trying to find the appropriate way to reduce the number of deployed
rrdcache deamons and to optimise their performance.
In the spirit of „measure before you optimise“ and reading available
documentation and mailing lists, I have come to a conclusion that something
might be very wrong with our setup. This didn't show earlier as deamons were
not handling so much tree nodes, nor did we take a closer look after the
initial deployment.
Performance is being monitored for the last few weeks and to provide more
data, here are my observations:
- we have a sum of 14 differently configured rrdcache deamons
- they are handling a sum of 1,7 million nodes, sum of 3000 updates/writes
per second, which also sums up to average of 500KB per second in journal
- deamons are individually handling from around 3k - 400k nodes
- in all setups, update rate is in accordance with write rate, which ensures
that they don't grow in memory or delay

So, the question that puzzled me is how come for some files (handled by the
larger caches) writes are more than 24h apart, regardless of their
configuration. In smaller caches, the problem is also present but less
noticable. We took one of the larger setups for testing and moved it to the
machine not dedicated for data polling procedures.
- VM is a Oracle Solaris 10, 64GB RAM, 8VCPU + SUN QFS
- RRDtool version:  1.4.5 
- Deamon configuration:  rrdcached -l unix:/opt/pmsys/log/rrdcache.sock -b
/rrdqfs/RRD -m 0777 -s pmsys -w 3600 -z 3600 -f 86400 -t 16 -p /opt/
pmsys/log/rrdcached.pid -j /opt/pmsys/log/journal.log -l 192.168.250.29 -l
192.168.250.29:42218 -l 192.168.250.29:42219 -l 192.168.250.29:42220 -l
192.168.250.29:42221 -l 192.168.250.29:42222 -l 192.168.250.29:42223 -l
192.168.250.29:42224 -l 192.168.250.29:42225 -l 192.168.250.29:42226 -l
192.168.250.29:42227 -l 192.168.250.29:42228 -l 192.168.250.29:42229 -l
192.168.250.29:42230

On the clean start (pic on link
http://www.pixentral.com/show.php?picture=1Q6MMmPLQ4fMKAPUBuKErBxzVG4x91),
this is what happened:

- polling procedures were not always able to send all updates to rrdcached
during its „stabilisation period“ and started dying (this is solved using
direct telnet communication with the deamon + batch + update, instead od
RRDs::update so we do not experience data polling difficulties anymore)
- only after 24h rrdcached finally reached its write rate and stabilised on
2GB in memory
- this left almost 20million updates always pending, impossible to catch up
with updates/writes being at about the same rate
- on restart, rrdcached tries to load 2,5GB in size journal log, which never
really ends (the deamon is unresponsive, it's impossible to telnet, accept
updates, journal doesn't update, process memory is around 30MB after an
hour... so I usually end up deleting journal and restarting deamon, thus
losing more then 24h data for 250k files (+2h hoping it would shape up))
- with different configuration (-w 600 -z 300 -t 16) the situation was not
any better
- CPU is poorly utilised, no swaping, disk I/O is not considered as a
bottleneck at this point
- one more very important observation is that, as rrdcacheds write on the
same shared filesystem, when one cached is down or restarted, other caches
increase their write rate and then gradually again decrease to their update
rate, while the write rate of the restarted one increases (which leads us to
believe that QFS might be involved also, and that rrdcached might be aware
of its situation of too many pending updates)

So what I need are some clues on how to explain this behavior to further
troubleshoot this issues and determine is rrdcached working as expected in
this circumstances.  This would enable us to redirect our attention to some
other components (QFS).
Any hints would be highly appreciated. I can also provide any additional
information.

Gaby

--
View this message in context: http://rrd-mailinglists.937164.n2.nabble.com/RRDcached-performance-issues-tp7582783.html
Sent from the RRDtool Users Mailinglist mailing list archive at Nabble.com.