[rrd-developers] Use of madvise / msync kills performance for me

Fri Jun 27 13:13:17 CEST 2008

On Fri, Jun 27, 2008 at 02:58:15AM -0700, Marcus Reid wrote:
>On Fri, Jun 27, 2008 at 10:15:24AM +0200, Bernhard Fischer wrote:
>> On Thu, Jun 26, 2008 at 02:28:11PM -0700, Marcus Reid wrote:
>> >On Thu, Jun 26, 2008 at 07:30:03AM +0200, Tobias Oetiker wrote:
>> >> Hi Marcus,
>> >> 
>> >> Have you tried compiling rrrdtool without mmapping ? Note that
>> >> removing msync is BAD. Have a look at the manual page.
>> >> 
>> >>        msync()  flushes  changes  made  to  the in-core copy of a
>> >>        file that was mapped into memory using mmap(2) back to disk.
>> >>        Without use of this call there is no guarantee that changes
>> >>        are written back before munmap(2) is called.
>> >
>> >I think things may be different in FreeBSD land.  From the msync
>> >man page:
>> >
>> >     The msync() system call is obsolete since BSD implements a coherent file
>> >     system buffer cache.  However, it may be used to associate dirty VM pages
>> >     with file system buffers and thus cause them to be flushed to physical
>> >     media sooner rather than later.
>> >
>> >> Obviously it will be faster without this call, but then again,
>> >> the price (potential file corruption) might be a bit high.
>> >> 
>> >> Why some of the madvise calls are are taking so long is unclear to
>> >> me. You might want to try only to drop the WILLNEED calls only and not
>> >> the RANDOM since it is crucial in preserving cache memory ...
>> >
>> >Yeah, that part seems odd to me as well.  I'll try asking about that on the
>> >right mailing list and find out what some kernel guys think about it.
>> 
>> That's the crucial information, yes.
>> I wouldn't be surprised if there are some loose ends in your kernel (we
>> also tripped a timestamp buglet on linux, fwiw :), but that's obviously
>> pure speculation for now.
>
>Matt Dillon provided some good information on this subject that I would
>like to pass on.  First, a little background..  I discovered that the
>long slow msync() calls only happen on files over a certain size.  The
>file that's slowing things down is 1161mB long, and msync() calls to a
>file that's 940mB long are fast.  That's probably a kernel problem that
>could be worth looking into.

eh, that's quite big, an order of magnitudes bigger than my files, at
least.

>
>This is probably an edge case -- I'm updating an rrd file that's over
>a gig in size and I don't know how common that is.
>
>Here's part of Matt's comment, which suggests that maybe we can limit
>the region of the msync() to the part of the file that was known to be
>changed, if that can be determined.
>
>    The msync() is clearly the problem.  There are numerous optimizations
>    in the kernel but msync() is frankly a rather nasty critter even with
>    the optimizations work.  Nobody using msync() in real life ever tries
>    to run it over the entirety of such a large mapping... usually it is
>    just run on explicit sub-ranges that the program wishes to sync.

Exactly. That whole msync() thing just comes from linux+NFS (IIRC) where
the user would otherwise end up with stale data (due to NFS details that
are not really interresting here).
>
>    One reason why msync() is so nasty is that the kernel must physically
>    check the page table(s) to determine whether a page has been marked dirty
>    by the MMU, so it can't just iterate the pages it knows are dirty in
>    the VM object.  It's nasty whether it scans the VM object and iterates
>    the page tables, or scans the page tables and looks up the related VM
>    pages.   The only way to optimize this is to force write-faults by
>    mapping clean pages read-only, in order to track whether a page is
>    actually dirty in real time instead of lazily.  Then msync() would
>    only have to do a ranged-scan of the VM object's dirty-page list
>    and would not have to actually check the page tables for clean pages.
>
>    A secondary effect of the msync() is that it is initiating asynchronous
>    I/O for what sounds like hundreds of VM pages, or even more.  All those
>    pages are locked and busied from the point they are queued to the point
>    the I/O finishes, which for some of the pages can be a very, very long
>    time (into the multiples of seconds).  Pages locked that long will
>    interfere with madvise() calls made after the msync(), and probably
>    even interfere with the follow msync().
>
>    It used to be that msync() only synced VM pages to the underlying
>    file, making them consistent with read()'s and write()'s against
>    the underlying file.  Since FreeBSD uses a unified VM page cache
>    this is always true.  However, the Open Group specification now
>    requires that the dirty pages actually be written out to the underlying
>    media... i.e. issue real I/O.  So msync() can't be a NOP if you go by
>    the OpenGroup specification.

It can't be a NOP but you can (and we do) use MS_ASYNC, as opposed to
synchronous flushing:
"
When MS_ASYNC is specified, msync() shall return immediately once all
the write operations are initiated or queued for servicing; when MS_SYNC
is specified, msync() shall not return until all write operations are
completed as defined for synchronized I/O data integrity completion.
Either MS_ASYNC or MS_SYNC is specified, but not both.
"

So perhaps you just need to improve sync vs. async behaviour of msync,
but without having looked at the code, i suspect that you burn too much
time building up the list of involved pages and handle sync/async
properly anyway.

There are a few (obvious) things you could do, just for example:
a improve your VM so that "msync" does the least work possible to build up
  a queue that is used to flush dirty data out to the disks
b don't msync in your local copy if you are sure that you don't need it
c msync in rrd_write()

>From a user perspective, a) sounds like the best approach since it
establishes the behaviour that i would expect.
b) and c) are workarounds and both of them have different (esthetic)
problems.

HTH + cheers,