[rrd-developers] patch/tuning for very large RRD systems (was "Re: SQL Backend request")

Fri May 25 15:06:38 CEST 2007

Hi Dave 
Could you please highlight your RRD structure for 300,000 targets? I assume this is mainly ports activity and you have hundreds ports per switch or you referring to some other systems (more complicated). Also, how many metrics per RRD do you have? How many different templates are you using? What is physical size of your RRD archive? Are you split your archive between disks, or all files on same disks and in single location. I'm doing study on infrastructure with 500,000+ targets for at least four groups of target systems (UNIX, Linux, NT and Network). Because of infrastructure size I'm planning to use Z/Linux with SuSe 9.0. Any recommendations?

And, I have second group of questions. Because of large size of your infrastructure (300,000+ is a lot) are you doing any forecasting (aka capacity planning), or any calculations for performance/activity prediction. As example, based on data that was collected for last tree months I would like to produce capacity planning chart for next tree months. Any idea how I can implement linear/nonlinear regression analysis with ignoring all extremes and outliers. I can extend research that will be helpful for community, but at this time would like to hear a comments on my direction.

Thanks
Val Shiro

----- Original Message ----
From: Dave Plonka <plonka at doit.wisc.edu>
To: rrd-developers at lists.oetiker.ch
Cc: Dale Carder <dwcarder at doit.wisc.edu>; Archit Gupta <archit at cs.wisc.edu>
Sent: Wednesday, May 16, 2007 7:43:50 AM
Subject: [rrd-developers] patch/tuning for very large RRD systems (was "Re: SQL Backend request")

Hi Tobi,

This is one of those serendipitous occasions -
I think we can save you perhaps lots of time:

Archit Gupta, Dale Carder, and I have just comleted a research project
on RRD performance and scalability.  I beleive we may have the largest
single system MRTG w/RRD - over 300,000 targets and RRD files updated
every five minutes.

On Wed, May 16, 2007 at 10:00:12AM +0200, Tobias Oetiker wrote:
> 
> I am still gathering data ... but it seems all to come down to
> 
> * file system cache pollution through other processes
> * and the time you give the system to deal with dirty bufffers
> * the block queuse size may also play a role.
> * I think we could gain quite a lot by using fadvise in rrdtool to
>   only keep the header portion of the file in cache.
>   will try this later today ...

Over the past month, we've done a posix_fadvise RANDOM patch and
completely evaluated the perforance impacts.  It's really good
(obviously).  What it does is get both the readahead and page faults
under control, in Linux at least.

For others observing performance issues, the pertinent things are:

* What operating system and version are you running?
   (to determine initial file readahead and availablity of fadvise syscall)

* What is your hardware, including physical memory?
   to determine CPU available and max buffer-cache size

* What is your update interval (rrd step) and RRD file definitions (RRAs)?
  (e.g. typical MRTG, or output of rrdtool info.)

  This determines the page fault characteristics of RRD files
  when buffer-cache is scarce.

* What version of rrdtool?
  (for applying patch)

With that we can determine if it's possible to update a givent number
of RRD files, or how to properly (re)size it.

Dave

-- 
plonka at doit.wisc.edu  http://net.doit.wisc.edu/~plonka/  Madison, WI

Got a little couch potato? 
Check out fun summer activities for kids.

____________________________________________________________________________________Luggage? GPS? Comic books? 
Check out fitting gifts for grads at Yahoo! Search
http://search.yahoo.com/search?fr=oni_on_mail&p=graduation+gifts&cs=bz
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.oetiker.ch/pipermail/rrd-developers/attachments/20070525/1a29504f/attachment.html