[rrd-users] Scaling rrd tables for best performance

Thu Dec 6 04:36:31 CET 2007

Hi there...

I'm about to start using rrd to measure several aspects of a few hundred 
servers. Just as an example, every server will have it's cpu (idle, 
kernel, user, iowait, etc), it's memory (used real, buffers, cache, 
unused real, swap, total real), it's partitions space (two DSes per 
partition: used and free) and a few other interesting values being 
monitored. By the end of the day, one server alone will have from 25 to 
30 different DSes.

I can have each server information store on a "server.rrd" file, like:

server1.rrd, server2.rrd, etc...

Or have it split among several rrd files for the same server, like:

server1_cpu.rrd, server1_mem.rrd, server1_network.rrd, 
server1_generic.rrd, server2_cpu.rrd, etc...

I'm going to start with something around 500-600 servers but I'm 
expecting it to grow to a few thousands for the next year and I would 
like to have things scaled for that growth.

I have read (somewhere I don't remember, the source may not be reliable) 
that rrdtool caches information in memory to speed up the real-time 
calculations, but I don't yet understand how it would be possible 
between two different measurements, since the "rrdupdate" process is not 
a daemon that would stay loaded in memory all the time, but would be 
called at every step. It gives me the impression that at every system 
call to rrdupdate, it would copy all data from disk to memory, do all 
calculations and then flush data back to disk (causing some disk 
activity, but that's understandable and desirable if you want make sure 
that the fewest data possible would be lost in case of a system crash).

Everything I'm saying is pretty much a guess based on the documentation 
(that doesn't go that far into the rrdtool internals for obvious reasons).

So, my main questions are:

How does rrdtool handles the data IO operations between disk and memory? 
Is my understanding close to how things works or am I completely wrong 
about it?

Also, which of those two options would give me best performance for 
real-time monitoring? Having multiple rrd files for each aspect of each 
host (reads having DSes from the same server on different files), or 
having a single rrd file with all aspects from a given host (reads 
having all DSes on the same file)?

Regards,
Eduardo M. Bragatto.