[rrd-developers] patch/tuning for very large RRD systems (was "Re: SQL Backend request")

Thu May 31 01:24:43 CEST 2007

Hi Henrik,

On Thu, May 31, 2007 at 12:04:48AM +0200, Henrik Stoerner wrote:
> On Tue, May 29, 2007 at 05:00:18PM -0500, Dave Plonka wrote:
> > I wrote:
> > > However, the general I/O
> > > activity did not go down - in fact, it increased by about 15-20%,
> > > as measured by vmstat.
> > 
> > With some more detail about what you saw, we could find what's
> > really going on.
> 
> I've been doing some more detailed measurements today. The initial 
> result still holds, but it was kind of obscured by the fact that
> prior to changes in the application, the disk system on the box
> was saturated and couldn't keep up with the load - this of course
> makes comparisons impossible.
> 
> So today I've done some measurements while running a version of the
> application that batches updates, which by itself has kept the
> I/O utilisation well below the system capacity. I've used vmstat
> to track overall CPU utilisation, and iostat to monitor I/O.

I use the sar data collector, in root's cron like this:

   # collect and store system activity to daily files:
   35 * * * *  /usr/lib/sa/sa1 300 12

This gets 5 minute averages - like we're used to ;-)
You can use any minute of the hour to start it.

Then you can examin CPU utilization (incl. I/O wait) like this:

   $ sar -u

And this for the disk:

   $ sar -d

Then you can also use ksar to visualize the measurements easily.
   http://ksar.atomique.net
(Be sure to do "LANG=C sar -A > sar-A_localhost.txt" for ksar input,
 it has a bug unless the LANG is forced.)

> The system used for testing has:
> * 40.000 RRD files with 100.000 datasets. Each has 4 RRA with a
>   5 minute/30 minute/4 hours/24 hours averaging function. This
>   is currently a very stable set of files, new files haven't
>   been added during the tests.
> * Update frequency is once per 5 minutes (as with MRTG)
> * 2 CPUs, 1 GB RAM, 2 72 GB SCSI disks (10K rpm) with hardware RAID-1.
> * Local filesystem formatted with ReiserFS, mounted with 
>   "notail,noatime,nodiratime" options.
> 
> I've tried the following scenarios, all with rrdtool 1.0.50 as the
> base version: 
> 
> 1) Stock rrdtool 1.0.49
> 2) rrdtool 1.0.50 just with the
>       fadvise(fileno(*in_file), 0, 0, POSIX_FADV_RANDOM)
>    in rrd_open.c, i.e. Dave original patch.
> 3) Same as 2), but with the first "0" changed to "4096",
>    i.e. all access after the first 4 KB is random.
> 4) As 3), but with the limit set at 1 KB instead of 4
> 5) As 2) but patched with the change that is in the SVN version
>    of rrdtool 1.2, where some additional POSIX_FADV_DONTNEED
>    calls have been inserted for rrd_fetch, rrd_create etc.
> 
> All tests ran for at least 1 hour. The system is dedicated for this
> purpose, so nothing else would disturb measurements.

OK, but since you have aggregation RRAs your system can't reach a
steady state until you pass each of your aggregation intervals...
You said you use 5 mins, 30 mins, 4 hours, and 24 hours.  So running
the test across zero hours UTC (or 24 hours) is the best test.
This will be long enough to attempt to bring all the hot pages in
each of your RRAs into buffer-cache.

With the 4 RRAs you have defined, based on an analytical model we've
developed, I think you'll need only 5 buffer-cache pages available
per file to reduce disk reads (for RRD files) to near zero (<.005
page faults per second average).  That's 781MB (40,000 x 4KB x 5),
which is probably a bit less than you typically have available with
1GB of physical memory (if its dedicated to this purpose).  This might
lead one to suggest doubling the the RAM in your system to gain some
level of confidence that you can maintain acceptable performance.

> The attached image shows the vmstat behaviour. The attached iostat*
> files are the iostat data collected. I've grabbed some rough numbers
> from this:
> 
> Test 1) ran from 00:00 - 11:00, and shows an average CPU utilisation 
> of 2-4% (the peak just after midnight and around 06:40 are daily cron jobs).
> iostat data is in iostat-noadvise.txt, showing a utilisation of 6-8% of 
> capacity, with 10 reads/sec and 60 writes/sec.
> 
> Test 2) ran from 11:00 - 13:00. Avg. CPU is higher, 6-8%.
> iostat-fadvise_randomonly.txt shows 15% utilisation, with 25 reads/sec
> and 60 writes/sec.
> 
> Test 3) ran from 13:00 - 14:00. Avg. CPU is a bit higher than 1), but
> less than 2) - probably 4-7% (I should have run this a bit longer).
> iostat-fadvise_randomPost4K.txt is roughly like 2).
> 
> Test 4) ran from 14:30 - 22:00. Avg. CPU is lower than 2) and 3),
> around 4-5%. iostat-fadvise_randomPost1K.txt shows utilisation of
> 10%, with 20 reads/sec and 50 writes/sec.
> 
> Test 5) ran from 22:30 - 00:00. Avg. CPU is the highest, 12-15%.
> iostat-fadvise_randomanddontneed.txt shows utilisation is about 
> 25%, with 150 reads/sec and 60 writes/sec.

Admitted, I didn't pore over the tabular data, but from the numbers
you mentioned (only tens of reads/writes per second?) and the graph
attached your I/O wait CPU utilization is so low that I don't see
why any of this would matter.  What was the original performance
problem you observed and what evidence of it do you have in the
system measurements?

> Based on this, the fadvise() additions do not seem to be universally
> good.

That's not necessarily the thing to conclude.  The performance goal
I'm using is not to dial in a specific CPU or I/O utilization, but
rather to increase throughput and overall performance by minimizing the
update times so we can record more stats in a given amount of realtime.

This is best measured by timing the polling and update phases at
the application level.  Perhaps before the fadvise RANDOM you were
missing measurements (due to blocking in I/O wait) and now without
that you're getting more total work done - i.e. more successful
measurements polled and recorded within the step interval, hence the
higher CPU or I/O levels.

> At the very least, one should use it so that read-ahead is done 
> on the first 1K of the RRD file, and the rest is flagged with FADV_RANDOM
> (test 3). But the default rrdtool - with no fadvise() - still has 
> fewer reads, and less cpu+disk utilisation.

There's no point (nor even a way) to have readahead smaller than
a block/page size (e.g. 4KB) - readahead happens in units of pages
or blocks.  Under Linux, an initial read at the beginning of the file
triggers the first readahead (if any) and at least a page will be read
regardless of the size of the read that the application requested.
(I.e. if you read the first byte of the file, a 4KB page will be
filled in the buffer-cache with the first 4KB of the file even if
readahead is zero.)

> It is indisputable that the DONTNEED additions in test 5) result in the
> buffer cache occupying much less memory. This could be an advantage, if 
> there are applications running that require lots of memory.

Yes, but the goal isn't to have buffer-cache occupying less memory,
its to keep the hot pages for RRD files in there.  Indeed buffer-cache
is being used most efficiently when it uses all available memory.
This is why we want to use fincore to see how much of the RRD files
are in cache, rather than just looking at the size of the buffer-cache
overall.

If some application requires lots of memory, the system will reclaim
pages from the buffer-cache (this is a lazy management strategy to
avoid work unless needed), and if we don't unnecessarily have a bunch
of readahead pages in there, its more likely to properly evict/reclaim
the colder pages.

> I can only speculate about why I see these results which differ somewhat
> from David's. One obvious difference between our systems is the memory
> size; but we do share the fact that our set of RRD files does not fit in
> the buffer cache.

Indeed, that's the same.

> Another - perhaps more important? difference - is that David uses SAN
> storage, whereas my system has local disks only. I suspect the SAN
> drivers might do read-ahead on their own and buffer the data, but that
> is just me guessing.

That's why minimizing reads is important... it takes the disk out
of the equation, or at least dedicates it to just writes.  I agree
that your disk might have different performance characteristics,
but is it in I/O wait or not?  If not, its keeping up with the load.
Regardless of the different disk performance, we benefit equally
from the buffer-cache being used effectively to reduce RRD file block
reads on updates to near zero.

> Anyway, it's been interesting to test - and I've ended up with a much
> better performing system. So I'm happy :-)

Ah yes, "This is only temporary, unless it works." ;-)
(one of my favorite random .signatures)

Dave

-- 
plonka at doit.wisc.edu  http://net.doit.wisc.edu/~plonka/  Madison, WI