[rrd-developers] patch/tuning for very large RRD systems (was "Re: SQL Backend request")

Fri May 25 17:46:39 CEST 2007

Hi Val,

On Fri, May 25, 2007 at 03:21:22AM -0700, Val Shiro wrote:
>
> Could you please highlight your RRD structure for 300,000 targets? I
> assume this is mainly ports activity and you have hundreds ports per
> switch or you referring to some other systems (more complicated).
> Also, how many metrics per RRD do you have?

Generally, just two (ds0 and ds1, typically inbound and outbound rates)
since 99.9% of our RRD stuff is gathered with MRTG.

Just to be complete: we have some junipoll, and some small custom
collectors running, but those are negligible.

> How many different templates are you using?

Do you mean MRTG cfgmaker templates?  We happen to have 3 (high
capacity interfaces, normal interfaces, and Cisco MSFC interfaces).
This isn't really pertinent to the overall system's performance.
(We only reconfigure once per day.)

> What is physical size of your RRD archive?

We've tested with both with standard MRTG files that are 103KB (8 RRAs:
4 AVERAGE, 4 MAX) and our own production system in which we extend RRA
0 (the five minute averages) up to store 1 year or 5 years of rates,
resulting in RRD files of 1.7MB or 8.2MB respectively.  Over 99%
of them are 1.7MB.

The RRD file size is not particularly pertinent to the performance once
unnecessary readahead is suppressed.  Instead, what is pertinent is
the number of RRD file hot pages/blocks and the fault rate (resulting
from page replacements if necessary (when buffer-cache is scarce)
and from block boundary crossings within RRAs as time passes).
This is determined by the number of RRAs with a given update frequncy
and the size of those RRAs.

> Are you
> split your archive between disks, or all files on same disks and in
> single location.

They're all on a single file-system of about ~750GB in size, on
a logical volume on a RAID-10 of 16 x 2 disks on a fiber-channel
attached SAN (EMC DMX-3).

> I'm doing study on infrastructure with 500,000+
> targets for at least four groups of target systems (UNIX, Linux,
> NT and Network). Because of infrastructure size I'm planning to use
> Z/Linux with SuSe 9.0. Any recommendations?

By targets do you mean MRTG targets, i.e. each have two measurements,
and one RRD file per target?

> And, I have second group of questions. Because of large size of your
> infrastructure (300,000+ is a lot) are you doing any forecasting
> (aka capacity planning), or any calculations for performance/activity
> prediction. As example, based on data that was collected for last
> tree months I would like to produce capacity planning chart for next
> tree months. Any idea how I can implement linear/nonlinear regression
> analysis with ignoring all extremes and outliers. I can extend research
> that will be helpful for community, but at this time would like to
> hear a comments on my direction.

Yes, we can talk seperately about some of these ideas...
I'll email you.

Dave

-- 
plonka at doit.wisc.edu  http://net.doit.wisc.edu/~plonka/  Madison, WI