[rrd-developers] rrdcached contention when flushing

Tue Nov 4 17:57:27 CET 2008

> > They all become un-stuck at the same time, maybe 20 seconds 
> later, and 
> > then the graphs appear very quickly.
> 
> The FLUSH commands are waiting to be notified that the file 
> has been written out to disk.  They block on 
> pthread_cond_wait() and don't return until the queue thread 
> has written the file out to disk.
> 
> What is happening on your system at that time?  Are there 
> other events which may slow down the I/O?

The I/O is not so bad - I'm watching it with iostat -k 1 -x and I also
have a Ganglia metric module for IO which gives me a nice graph.

I've experimented with sysctl, here are values I'm currently using:

vm.dirty_expire_centisecs = 179971
vm.dirty_writeback_centisecs = 35993
vm.dirty_ratio = 90
vm.dirty_background_ratio = 2
vm.max_map_count = 4000000

If I understand correctly, then vm.dirty_ratio means nothing should
block until 90% of the RAM is taken up by dirty pages.  Given that
mmap() is being used with MAP_SHARED, and I have 8GB of RAM, all the
necessary pages should be staying in RAM.  If you can suggest a more
appropriate strategy for configuring the cache, it would be very
welcome.

> I have seen this behavior before on one of my Linux 2.6.x 
> machines.  When it has dirtied too many pages, all I/O on the 
> system pauses until it has flushed the "write-back" pages out 
> to disk.  What kind of system are you running?

RHEL5:
Linux xxx 2.6.18-53.1.13.el5 #1 SMP Mon Feb 11 13:27:27 EST 2008 x86_64
x86_64 x86_64 GNU/Linux

> 
> > I'm using r1621 + the patch adding pthread_cond_init(&ci->flushed, 
> > NULL);
> 
> You should upgrade to at least r1626.  Otherwise, you may 
> notice some files that are getting flushed by the flush 
> proces (corresponding to the -f timer) are hanging around in 
> queue forever.  The bug was introduced in r1588, resolved in r1626.
> 
I've now merged in changes to rrd_daemon.c from r1626, still have the
same problem though.

There is also a memory leak somewhere (maybe in my striping code, maybe
in rrdcached).  I've tried to start rrdcached with valgrind, but my
large mmap() call fails with EINVAL when using valgrind.

The memory leak could be the cause of the performance issue - it grows
to several gigabytes and there is swapping, that might be reducing the
amount of RAM available for caching the mmap() pages.  Can you make any
suggestions for using valgrind or another tool in this scenario?
_______________________________________________

This e-mail may contain information that is confidential, privileged or otherwise protected from disclosure. If you are not an intended recipient of this e-mail, do not duplicate or redistribute it by any means. Please delete it and any attachments and notify the sender that you have received it in error. Unless specifically indicated, this e-mail is not an offer to buy or sell or a solicitation to buy or sell any securities, investment products or other financial product or service, an official confirmation of any transaction, or an official statement of Barclays. Any views or opinions presented are solely those of the author and do not necessarily represent those of Barclays. This e-mail is subject to terms available at the following link: www.barcap.com/emaildisclaimer. By messaging with Barclays you consent to the foregoing.  Barclays Capital is the investment banking division of Barclays Bank PLC, a company registered in England (number 1026167) with its registered office at 1 Churchill Place, London, E14 5HP.  This email may relate to or be sent from other members of the Barclays Group.
_______________________________________________