[rrd-developers] Re: How to get the most performance when using lots of RRD files

Wed Aug 16 13:38:11 MEST 2006

On Wed, Aug 16, 2006 at 07:19:01AM -0400, Richard A Steenbergen wrote:
> On Wed, Aug 16, 2006 at 08:10:09AM +0200, Henrik Stoerner wrote:
> > 
> > However, my main system for this currently has about 20.000 RRD files,
> > all of which are updated every 5 minutes. So that's about 70 updates
> > per second, and I can see that the amount of disk I/O happening on
> > this server will become a performance problem soon, as more systems are
> > added and hence more RRD files need updating.
[snip]
> The situation I was trying to solve involved a constant stream of high 
> resolution data across a large set of records, and relatively infrequent 
> viewing of that data. It sounds like you're trying to do something 
> similar. Honestly if all you care about is databasing it would probably be 
> easier to ditch RRD and use something else or write your own db which is 
> more efficient, but at the end of the day (for me anyways :P) rrdtool does 
> the best job of producing pretty pictures that don't look like they came 
> off of gnuplot or my EKG, and I'm in no mood to become a graphics person 
> and re-invent the wheel.

I would be very sad to drop RRDtool, for those very reasons. It is the
de-facto standard for storing time-series based data on Unix, and there
are so many neat utilities around for working with RRD files.

> So, probably your biggest issue is indeed thrashing the hell out of the 
> disk if you just tried to naively fire off a pile of forks and hope it all 
> works out for the best. [snip]
> Obviously a syscall to exec a shell to run the rrdtool binary every time 
> scales to about nothing, and the API (if you can even call it that, I 
> don't think (argc, argv) counts :P) to rrdtool functions in C really and 
> truly bites. If your application is in C, and you can link directly to the 
> librrd, thats a quick and dirty fix for at least some of the evils.

That is basically what I do.

The fork()/exec() calls have been eliminated, since Hobbit uses a module
which calls into directly into the rrdtool library API. So I am calling 
the rrd_update() function directly. (Whew - wouldn't even dare to think
how much more overhead it would be to do the updates via the rrdtool
commandline tool).

> The big daddy of performance suck is then going to be, opening, closing, 
> and seeking the right spot in the files every time.

Exactly.

I can see you've been through many of the same deliberations as I have,
and come to just about the same conclusions. More spindles would help,
but only up to a point. Using RAM disks and keeping a cache of open file
handles is not going to work with the amount of data I have, unfortunately.

Consolidating datapoints into fewer files is a possibility, but at the
cost of making the code doing updates more complex - it is not
guaranteed that all of the data-updates will be available simultaneously.

> Or hell you could always just throw more spindles at it or throw a few 
> more $500 linux PCs at it, what do I care. :)

Throwing cheap PC's at the problem is kind of what I was thinking of :-)
I'd like to spread the RRD files across a number of cheap servers,
but in a way that makes it easy to add more servers if it becomes
necessary.

Anyway, thanks for your comments. They assure me there isn't some
obvious solution that I've missed.

Regards,
Henrik

--
Unsubscribe mailto:rrd-developers-request at list.ee.ethz.ch?subject=unsubscribe
Help        mailto:rrd-developers-request at list.ee.ethz.ch?subject=help
Archive     http://lists.ee.ethz.ch/rrd-developers
WebAdmin    http://lists.ee.ethz.ch/lsg2.cgi