[rrd-developers] rrdcached performance with >200k nodes

Wed Jan 13 16:52:54 CET 2010

> >> Hello list,
> >>
> >> we've probably reached rrdcached limits in our monitoring system
> >>
> >> We had a very nicely running rrdcached while collecting from about 400 hosts,
> >> about 100k nodes (RRD files).
> >>
> >> We've bumped the number of host to about 2000 hosts for interface
> >> traffic, errors, unicast and multicast packets with collector of our
> >> own. It does batch the RRD updates using rrdcached's BATCH via unix
> >> socket. This collector is able to walk
> >> all the hosts in less than 5 minutes. The number of nodes is about 200k.
> >>
> >> The rrdcached is configured to -w 3600 -z 3600 -f 7200 -t 8. Everything runs
> >> smoothly until first timeout. Then the Queue value rises up to the
> >> number of nodes
> >> and keeps that high. Write rate is very low, disk IO is almost zero.
> >> CPU load done by rrdcached gets very high (100-200%).
> >>
> >> The system is FreeBSD 7.2-p4, amd64 with 16GB RAM, RAID10 disk array.
> >> rrdtool 1.4.2.
> >>
> >> Could it be we've reached rrdcached's limits? What can be done about it?

Hi Mirek,

I'm running a very similar setup to yours: FreeBSD 7/amd64, ~270k nodes, 5
minute interval.  I am using '-w 21600 -z 21600 -f 86400', and my
rrdcached is steady at ~1.5G RSS.

Ideally you would cache at least one full page of writes per RRD file.
So, your ideal "-w" timer would be at least:

	(RRD step interval)*(page size)/(RRD row size).

I'm guessing at least part of your problem is IO limitations.  As Florian
said, this workload will see most of the disk's time used up seeking,
rather than writing. (try watching "gstat").

As for the CPU, it's possible we have some problem that only exhibits
itself when there is a large queue.  However, I've never run into this.
We'll nave to narrow the problem down a little more.

When it's exhibiting this high CPU problem, does it continue to write to
the journal?  Are there an abnormal number of "FLUSH" or "WROTE" entries
at that time?

What do you mean by "until the first timeout"?

P.S. I also use these sysctl values, FWIW, YMMV:

vfs.ufs.dirhash_maxmem=16777216	# from 2097152
vfs.hirunningspace=4194304	# from 1048576

-- 
 kevin brintnall =~ /kbrint at rufus.net/