[rrd-users] Sanity checks on Buffer Cache and I/O at peak aggregation times

Wed Sep 28 19:30:42 CEST 2011

Tobi,

> the 'problem' is that a disk block which is only hit once a day
> will most likely not stay in buffer cache, so as rrdtool goes in to
> update the slow RRAs, it will first have to read the block in order
> to then update it ...

Ah - that makes sense. I noticed Ryan suggested 30-minute RRAs, so it
sounds like that at least at that threshold, it's safe to assume those
blocks are still in the buffer cache. Granted, I'm not familiar enough
with buffer cache behavior so I'll need to test some things out.

> RRAs allow you to save diskspace and increase graphing performance,
> as you may not want to read a years worth of 1 minute intervals to
> draw a yearly graph ...

Makes sense (I've benchmarked this and you can see the difference).
Somewhat related: it looked like the number of rows in the RRA didn't
impact graphing performance when no consolidations are needed in XPORT
command (ex: the time required to fetch a 6 hour interval at 5-min
resolution from a 5-min RRA w/varying row sizes was the same).

>> >    3) reformat your rrd filesystems from 4K blocks to 1K blocks.
> at the end of the day it is important to also run tests :-)

Definitely - if anyone else has any experience modifying the block
size, I'd be curious to hear how it worked out.

Thanks all,

Derek

On Tue, Sep 27, 2011 at 11:53 AM, Tobias Oetiker <tobi at oetiker.ch> wrote:
> Hi Derek,
>
> Today Derek Haynes wrote:
>
>> Ryan,
>>
>> Awesome performance gold nuggets in here. I've put a couple of
>> followups below. Once we've got these running in production, I'd be
>> happy to update the "Tuning RRD" wiki page with the new information.
>>
>> > rrdtool is very fast on updates when not having to consolidate large periods
>> > of time (ie : when it's not hitting disk for blocks to do consolidation
>> > with.)
>>
>> HHmm - I'm not sure I grasp why this would be the case. When I call
>> "rrdtool update" on a file that has an RRA with a large number of
>> steps (ex: one day) and then call "rrdtool info", I'll see something
>> like the following:
>>
>> rra[0].cdp_prep[0].value = NEW VALUE
>> rra[0].cdp_prep[0].unknown_datapoints = previous - 1
>>
>> I can't see why this would make the consolidation expensive - it seems
>> like the math involved to consolidate the RRA is basically the same,
>> regardless of the number of PDPs. In fact, because very view updates
>> are actually appending new rows, it seems like this would be more
>> efficient than a short consolidation period that frequently appends
>> rows.
>>
>> I do see why it's expensive to have 2 RRAs consolidating at the same
>> time (ie: ev 5 minutes and 1 day both consolidate at 00:00 UTC).
>
> the 'problem' is that a disk block which is only hit once a day
> will most likely not stay in buffer cache, so as rrdtool goes in to
> update the slow RRAs, it will first have to read the block in order
> to then update it ...
>
>> >    1) your rrd files have too many RRAs, don't consolidate to a day, or 6
>> > hours or 3 hours, or even an hour.
>>
>> Agree 100% with (a) too many RRAs and (b) wasting resources with unused fields.
>
> RRAs allow you to save diskspace and increase graphing performance,
> as you may not want to read a years worth of 1 minute intervals to
> draw a yearly graph ...
>
>> >    2) don't bother with rrdcached ( it's slow, adds complication, so just
>> > adds an intermediate buffer (with it's own IO) which doesn't serve a useful
>> > purpose )
>>
>> Interesting. We'll be trying out the vm-level optimizations.
>
> the trick behind rrd cached is, that it is faster to run multiple
> updates on a single rrd file than running many on different files
> ... so rrdcached first collects the updates and then runs them into
> the rrdfile at once ... this is especially helpful if your
> buffercache is not sufficient to keep all active blocks in memory
> ... but even if you have enough space, you can save massively on
> disk writes, since rrdcached will write many updates in quick
> succession into an rrd file, thus causing the block to be written
> to disk once, whereas in a normal setup the block would be written
> out on every update ...
>
>> > c) as point of reference, I have dirty buffers set to 2 hours, with polling
>> > interval for devices at 60 seconds and one server (a bit faster than yours)
>>
>> It sounds like you might have a 60 second step and an RRA w/step=1.
>> That would append a new row on every update. I'm guessing that setting
>> dirty_expire_centisecs = 2 hours means you're not appending rows on
>> every 1-minute update, which saves a lot of writes. We're looking at
>> adding a 1 minute RRA as well.
>
> this is how ryan workes around the problem described above, you
> will have to take precautions that you do not loose the two hours
> worth of data if the box crashes ...
>
>> >    3) reformat your rrd filesystems from 4K blocks to 1K blocks.
>>
>> For your (1) one-DS per RRD file (2) 2-hour dirty_expire_centisecs
>> setup, you'll need to write close to 1K and because your block size is
>> 1K, this make your writes really efficient.
>>
>> After this write, does this require a new page-in to grab the next hot
>> page for writing RRD rows? If the block size was still 4K, would this
>> page-in only be required every 8 hours instead of every 2 hours since
>> you're reading more data into the buffer cache?
>>
>> In the extreme case of our current setup (20 data sources per RRD
>> file), I believe we'd only fit 6 updates into a 1k block. Given my
>> obvious newbie understanding of the buffer cache and file system, I'm
>> worried it might result in a lot more reads.
>
> at the end of the day it is important to also run tests :-)
>
>> >    7) don't ever use rrdtool exec calls, if you've written your own app use
>> > either the perl (RRDs), python, ruby or C bindings
>>
>> It's for our own app, but we're storing the RRD files on a separate
>> server and connecting using RRD Server. I don't think it's possible to
>> use the bindings in this case as the files aren't accesible via the
>> file system from our app servers (we've run into issues before with
>> NFS).
>
> you can also feed updates into 'rrdtool -' this is as fast as if
> you used bindings and works very well over the network ...
>
> cheers
> tobi
>
>
>> Thanks a ton for the feedback Ryan. I think the root of my issues is a
>> lack of understanding of the buffer cache and file system
>> optimizations.
>>
>> Cheers,
>>
>> Derek
>>
>> On Sun, Sep 25, 2011 at 12:16 PM, Ryan Kubica <kubicaryan at yahoo.com> wrote:
>> >
>> > rrdtool is very fast on updates when not having to consolidate large periods
>> > of time (ie : when it's not hitting disk for blocks to do consolidation
>> > with.)
>> > Definition of very fast on a memory bound ( no disk IO ) system like yours
>> > (depending on the model of CPU): ~48 (forty eight) thousand rrd updates per
>> > second sustained, to separate rrd files (1 datasource per file.)
>> > The trick is to ensure that every block is always cached so your system is
>> > never reading for any updates it must do:
>> >    1) your rrd files have too many RRAs, don't consolidate to a day, or 6
>> > hours or 3 hours, or even an hour.
>> > a) for every consolidation RRA that is another IOP to write
>> > b) for every consolidation rrdtool must calculate it with the pdps for it's
>> > row (that means pdps for a whole day on your 1 day RRA)
>> > c) since you have a lot of RRA consolidation definitions, you have
>> > incremental increase in write IOPs for each next consolidation period.
>> > for the 1 day you have IOPs for:
>> > 1) filesystem meta data
>> > 1) rrd header
>> > 18) RRA updates (min/avg/max): 5 minute, 10 minute, 1 hour, 6 hour, 12 hour,
>> > 1 day
>> > in total: 20 write IOPs per rrd - compounded by 20K rrds - 400,000 write
>> > IOPs
>> > d) having unused datasources defined in an rrd is just a waste of
>> > CPU/IO/disk for something that's not being used.
>> > e) having rrdtool doing all those consolidations even with blocks cached is
>> > CPU expended for data that's not very useful in comparison
>> > to the burden it creates on system resources.
>> >    2) don't bother with rrdcached ( it's slow, adds complication, so just
>> > adds an intermediate buffer (with it's own IO) which doesn't serve a useful
>> > purpose )
>> > a) let the Linux VM do it's job and just set dirty_buffer tuning to hold
>> > onto dirty blocks for longer than default (30 second is default)
>> > set it to something like 15 minutes, or 30 minutes
>> > (vm.dirty_expire_centisecs) which means Linux will only send the
>> > block to disk after it is 30 minutes dirty ( and in your case of 5 minute
>> > updates that's 6 rrd updates per disk IO )
>> > b) the *only* downside to large dirty buffers is if your server crashes you
>> > lose the data in the buffers, but if your server
>> > crashes then you lose data anyway because it's likely you're server isn't
>> > replicated and is probably a single-point-of-failure
>> > anyways.  and hardware failures, power failures, network outages, etc, will
>> > keep your server down and losing data during
>> > the outage anyway.
>> > c) as point of reference, I have dirty buffers set to 2 hours, with polling
>> > interval for devices at 60 seconds and one server (a bit faster than yours)
>> > can store just a bit over 4 million datasources per minute.  Obviously you
>> > need the disk space for that, but the CPU/memory
>> > will have no problem with this and just a couple decent sized SSDs can keep
>> > up with the meager IOPs/size this entails.
>> >    3) reformat your rrd filesystems from 4K blocks to 1K blocks.
>> > a) 1K blocks hold 128 intervals (so for 5 minute intervals that's ~10 hours
>> > worth of data in 1 block)
>> > b) that saves a lot of memory in the IO cache (4x)
>> > c) 1K block goes back to why I do 2 hour dirty buffers, since 2 hours of 60
>> > second steps just barely fit into 1 block at (120 updates)
>> > so the server is literally only writing to disk 120 times less per
>> > datasource.
>> >    4) ext3, noatime, journal=writeback
>> > a) just simply don't not do this, ext3 journal defaults is ordered which
>> > means a little less to your SSD (huge difference to a real drive)
>> > but there's little point to not doing this either way, noatime is obvious.
>> > b) xfs is also a very good filesystem, though on an SSD it's debatable if it
>> > matters.
>> >    5) don't bother with min/max on consolidated RRA's just buy more disk and
>> > store at 5 minute for 6 months and 30 minute AVERAGE for
>> > 2 years.  More RRA's mean more CPU, more IO, more of everything that will
>> > slow rrdtool down.
>> > a) rrdtool is -crazy fast- at real time consolidation ... there's little to
>> > no point in RRA consolidation unless trying to save on disk space.
>> > that's about ~206K read for a full 2 year at 30 minutes (and almost no one
>> > is going to look at these anyway, most users will be
>> > looking at hour, day, few day, maybe a week, even less maybe a month)
>> >    6) worth mentioning again: rrdtool is crazy fast
>> > a) in fact it's so fast that on a multi-cpu server rrd updates are limited
>> > by the Linux VM big kernel lock
>> > b) I haven't yet tested a newer kernel, I'm still using RHEL5.
>> >    7) don't ever use rrdtool exec calls, if you've written your own app use
>> > either the perl (RRDs), python, ruby or C bindings
>> >
>> > Caveat to some of this is it does depend on what monitoring application you
>> > are using, Cacti, Ganglia, etc ... they aren't all very efficient in their
>> > use of rrdtool so the 'updates per second' are likely lower with most of
>> > them, but in all cases the above will make those applications much more
>> > efficient/performant - and you don't have to modify those apps to get the
>> > gains (except (7) - which I don't know how many of them use exec calls
>> > anymore.)
>> > What you definitely want to do is make sure that your disk read IO goes away
>> > entirely or almost entirely; having it do reads for user requests only, not
>> > updates.
>> > Regards,
>> > -Ryan
>> >
>> > ________________________________
>> > From: Derek Haynes <derek.haynes at gmail.com>
>> > To: rrd-users at lists.oetiker.ch
>> > Sent: Sunday, September 18, 2011 10:01 AM
>> > Subject: [rrd-users] Sanity checks on Buffer Cache and I/O at peak
>> > aggregation times
>> >
>> > Hi all,
>> >
>> > I'm doing some tuning work on our rrdtool setup: we'd like to get a
>> > bit more disk i/o overhead. I have two questions/assumptions I'd love
>> > to get a sanity check on.
>> >
>> > First, some background on our setup:
>> >
>> > * Hardware
>> > ** 16 2.27 GHZ CPUs
>> > ** 16 GB of RAM
>> > ** Dedicated to RRDTool - no other services are installed on this server.
>> > ** SSD Drive
>> >
>> > * RRDTool Config
>> > ** RRDtool 1.4.5
>> > ** 20k RRD Files, each 539K in size
>> > ** 20 data sources per-file
>> > ** 18 RRAs per file (Min/Max/Average of each: 5 minutes for 6 hours,
>> > 10 minutes for 12 hours, 1 hour for 3 days, 6 hours for one week, 12
>> > hours for 2 weeks, 1 day for 2 years)
>> >
>> > * Performance
>> > ** Typically, almost no reads (showing buffer cache is working)
>> > ** 315 writes/sec | 4 MB/sec
>> >
>> > Second, what we're planning:
>> >
>> > I've load tested several different rrdtool file configurations and
>> > what I saw aligned with the behavior in this paper by David Plonka:
>> > http://www.usenix.org/event/lisa07/tech/full_papers/plonka/plonka_html/.
>> > If I'm summarizing the paper correctly, the 2 RRDTool file
>> > configurations that impact I/O activity are (1) # of data sources and
>> > (2) # of RRAs. The number of rows in an RRA does not impact
>> > performance.
>> >
>> > * Reducing data sources per-file to 5: Most of our fields are reserved
>> > for later use (since you can't add a new data source to an existing
>> > file via an rrdtool command). Instead, we're going to start with 5 and
>> > manually add new data sources via rrdtool dump / restore. It's rare to
>> > add new data sources to a file.
>> > * Reducing the number of RRAs to 7 (Adding LAST of one-minute data for
>> > 60 minutes, extending 5-minutes for 1 week, and dropping all other
>> > RRAs but daily).
>> >
>> > When needing to graph data over a several day period, I'm planning on
>> > using the --step option with a 60 minute or greater value to limit the
>> > number of data points returned. I haven't tested this yet, but I'm
>> > assuming reading the many more rows of 5-minute storage may be the
>> > primary performance penalty of the new RRD setup.
>> >
>> > And finally, the RRD performance pieces I'm having trouble understanding:
>> >
>> > Memory Sizing ---
>> >
>> > The "Tuning RRD" wiki page states: "To keep the header (4k) and the
>> > active block of at least one RRA (4k) in memory, you need 8k per RRD
>> > file." For our current configuration, we have 20 data sources and 18
>> > RRAs.
>> >
>> > >From the illustration on the page, it looks like our DS Header storage
>> > space would be:
>> > 4k per data source * 20 data sources = 80k
>> >
>> > Is my maths legit?
>> >
>> > For the RRA Header, is it (1) one per-RRA independent of the number of
>> > data sources or (2) one per-RRA * the number of data sources?
>> >
>> > If one-per RRA independent of data sources:
>> > 18 RRAs * 4k = 72k
>> >
>> > If one-per RRA for each data source:
>> > 18 RRA * 20 data sources = 360k
>> >
>> > Assuming my data source math checks out, this means either 152k or
>> > 440k must be stored in the buffer cache for the entire header. free -m
>> > shows that we have 5,119 MB cached: with nearly 20k rrd files, it
>> > looks like the header contains one RRA for each data source. Does this
>> > check out?
>> >
>> > High I/O Wait during peak aggregation periods ---
>> >
>> > At 00:00 UTC, we see our peak I/O Wait. At this time, reads increase
>> > to 11 reads/sec (typically we have almost no read activity) and writes
>> > increase to 700 writes/sec (typically around 315 writes/sec). One
>> > observation: even though our read throughput is far less than our
>> > write throughput, this increase in reads seems to have a dramatic
>> > impact on I/O Wait.
>> >
>> > I can see why there is an increase in writes: currently, we have 18
>> > RRAs across 20 data sources that need to append data to the file at
>> > this time. I'm a bit confused why the read activity increases
>> > dramatically:
>> >
>> > * We have 16 GB of RAM
>> > * It looks like < 5 GB of that is used by the buffer cache
>> >
>> > My low-level Linux fu is poor: can anyone give some insight into the
>> > read activity at these peak aggregation times?
>> >
>> > RRDTool is am amazing piece of work: it's a testament to how well it
>> > works that we haven't needed to investigate performance significantly
>> > in the 3+ years we've been using it.
>> >
>> > Cheers,
>> >
>> > Derek
>> >
>> > --
>> > Derek Haynes
>> > Scout Web Monitoring and Reporting ~ http://scoutapp.com
>> > Blog ~ http://blog.scoutapp.com
>> > 415.871.7979
>> >
>> > _______________________________________________
>> > rrd-users mailing list
>> > rrd-users at lists.oetiker.ch
>> > https://lists.oetiker.ch/cgi-bin/listinfo/rrd-users
>> >
>> >
>> >
>>
>>
>>
>>
>
> --
> Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
> http://it.oetiker.ch tobi at oetiker.ch ++41 62 775 9902 / sb: -9900

-- 
Derek Haynes
Scout Web Monitoring and Reporting ~ http://scoutapp.com
Blog ~ http://blog.scoutapp.com
415.871.7979