[rrd-developers] patch/tuning for very large RRD systems (was "Re: SQL Backend request")

Thu May 24 20:43:56 CEST 2007

Hi Mark,

On Thu, May 24, 2007 at 12:51:11PM -0400, Mark Plaksin wrote:
> Dave Plonka <plonka at doit.wisc.edu> writes:
> > Archit Gupta, Dale Carder, and I have just comleted a research project
> > on RRD performance and scalability.  I beleive we may have the largest
> > single system MRTG w/RRD - over 300,000 targets and RRD files updated
> > every five minutes.
> 
> Wow!  Would you describe the hardware you are running that on?  CPU,
> RAM, disk, and anything else you think is relevant?

         Component: Characteristics
  ----------------------------------------------
        Processors: 8 x Intel Xeon @ 2.7GHz
   Processor Cache: 2 MB
            Memory: 16 GB
              Disk: SAN, RAID-10, 16 x 2 disks
  Operating System: Linux 2.6.9
       File System: ext3 and ext2, 4KB blocksize
     I/O Scheduler: Deadline

I've attached a list with our other configuration recommendations.

While our system is certainly generously sized, it is a 3-year old
machine.  Note however, that such a configuration couldn't even do 100K
RRD files without the patch to fadvise RANDOM to suppress readahead.
I believe any post-2.6.5 Linux has the posix_fadvise behavior that
the patch leverages.

Also, Tobi has integrated the patch in the code he's testing.

> We have about 45k RRDs and our testing so far says the fadvise changes
> are very nice--thanks!  We're also testing local disk (via cciss driver)
> vs SAN storage.  Our current RRD server is pretty crushed io-wise.  So
> far the SAN storage looks like a big win too.

You should be able to use sar to determine that your reads (for
rrdtool) are much lower than your writes and that the CPU is not
spending too much time in I/O wait state.  These are good indications
that (a) unnecessary readahead has been suppressed and (b) that the
buffer-cache is being used effectively.

I've also released a command called fincore ("File IN CORE") that you
can use to examine the buffer-cache to determine that the RRD files
(or any files) are cached as expected:

   http://net.doit.wisc.edu/~plonka/fincore/

Dave

-- 
plonka at doit.wisc.edu  http://net.doit.wisc.edu/~plonka/  Madison, WI
-------------- next part --------------
            Performance Recommendations for RRD and MRTG Systems

 * When building a very large RRD measurement system, dedicate the
   machine to this purpose.  Since RRD is a file-based database,
   it relies on the buffer-cache that is shared across all system
   activity.  Because of RRD's unique file-access characteristics
   and buffering requirements, it is easier to achieve performance
   gains by tuning the system just for RRD.

 * Use an RRDTool that has our fadvise RANDOM patch.  On systems
   that have a fairly aggressive initial readahead (such as Linux),
   this will very likely increase file update performance by reducing
   the page fault rate and the buffer-cache memory required.

 * Avoid file-level backups of RRD files unless the set of RRD files
   complete fit into buffer-cache memory.  File-level backups read
   each modified file completely and sequentially; this can fill
   the buffer-cache and subsequently causes more page faults on RRD
   updates.  Backups are essentially indifferentiable from application
   access, and thus unnecessarily populate the system's buffer-cache
   with content that won't be re-used soon.  (Note that backup
   programs could call fadvise NOREUSE or fadvise DONTNEED to inform
   the operating system that the file content will not be re-used.)

 * Split MRTG targets into a number of groups and run a separate
   daemon for each.  In our system, we reconfigure daily and run a
   target_splitter script to produce an new set of ``.cfg'' files each
   with approximately 10,000 targets per MRTG daemon. Note that polling
   performance is also influenced by the SNMP agent performance on the
   network device polled.  So, if the splitting results in grouping
   like targets together based on the model of device monitored,
   there could be quite a disparity in time to complete the MRTG
   ``poll targets'' phase.

 * Do not create RRD files all at once.  By staggering the start
   times, updates to like RRA updates will cross block boundaries
   at different times, distributing the page faults that occur on
   block boundary crossings.  As a network is deployed and grows,
   these RRD file start times would naturally be staggered, but this
   could be quite different when introducing measurement to an existing
   deployed network.

 * Run a caching resolver or a nameserver on the localhost, i.e. the
   MRTG system itself.  This reduces ``poll targets'' latency
   due to host name resolution;  MRTG performs very many DNS name
   resolutions when hostnames are used (rather than IP addresses)
   in target definitions.

 * Configure an appropriate number of forks for each MRTG daemon to
   minimize the time for the ``poll targets'' phase.  On our system,
   4 forks per daemon works well to keep polling in the tens of seconds
   for 10,000 targets.  This might differ for a wide-area network.

 * Place RRD files in a file-system of their own, ideally one
   associated with separate logical volumes or disks.  This gives the
   system administrator flexibility to change mount options or other
   file-system options.  It also isolates the system activity data
   (e.g. as displayed by sar) from unrelated activity.

 * Consider mounting the file-system that contains the RRD files
   with the ``noatime'' and ``nodiratime'' options so that RRD file
   reads do not require an update to the file inode block.  Of course
   the effect of this is that file access times will be inaccurate,
   but often these are not of interest for ``.rrd'' files.

 * Consider enabling dir_index on ext file-systems to speed up lookups
   in large directories.  MRTG places all RRD files in the same directory,
   and we've scaled to hundreds of thousands.