[rrd-developers] Use of madvise / msync kills performance for me

Fri Jun 27 11:58:15 CEST 2008

On Fri, Jun 27, 2008 at 10:15:24AM +0200, Bernhard Fischer wrote:
> On Thu, Jun 26, 2008 at 02:28:11PM -0700, Marcus Reid wrote:
> >On Thu, Jun 26, 2008 at 07:30:03AM +0200, Tobias Oetiker wrote:
> >> Hi Marcus,
> >> 
> >> Have you tried compiling rrrdtool without mmapping ? Note that
> >> removing msync is BAD. Have a look at the manual page.
> >> 
> >>        msync()  flushes  changes  made  to  the in-core copy of a
> >>        file that was mapped into memory using mmap(2) back to disk.
> >>        Without use of this call there is no guarantee that changes
> >>        are written back before munmap(2) is called.
> >
> >I think things may be different in FreeBSD land.  From the msync
> >man page:
> >
> >     The msync() system call is obsolete since BSD implements a coherent file
> >     system buffer cache.  However, it may be used to associate dirty VM pages
> >     with file system buffers and thus cause them to be flushed to physical
> >     media sooner rather than later.
> >
> >> Obviously it will be faster without this call, but then again,
> >> the price (potential file corruption) might be a bit high.
> >> 
> >> Why some of the madvise calls are are taking so long is unclear to
> >> me. You might want to try only to drop the WILLNEED calls only and not
> >> the RANDOM since it is crucial in preserving cache memory ...
> >
> >Yeah, that part seems odd to me as well.  I'll try asking about that on the
> >right mailing list and find out what some kernel guys think about it.
> 
> That's the crucial information, yes.
> I wouldn't be surprised if there are some loose ends in your kernel (we
> also tripped a timestamp buglet on linux, fwiw :), but that's obviously
> pure speculation for now.

Matt Dillon provided some good information on this subject that I would
like to pass on.  First, a little background..  I discovered that the
long slow msync() calls only happen on files over a certain size.  The
file that's slowing things down is 1161mB long, and msync() calls to a
file that's 940mB long are fast.  That's probably a kernel problem that
could be worth looking into.

This is probably an edge case -- I'm updating an rrd file that's over
a gig in size and I don't know how common that is.

Here's part of Matt's comment, which suggests that maybe we can limit
the region of the msync() to the part of the file that was known to be
changed, if that can be determined.

    The msync() is clearly the problem.  There are numerous optimizations
    in the kernel but msync() is frankly a rather nasty critter even with
    the optimizations work.  Nobody using msync() in real life ever tries
    to run it over the entirety of such a large mapping... usually it is
    just run on explicit sub-ranges that the program wishes to sync.

    One reason why msync() is so nasty is that the kernel must physically
    check the page table(s) to determine whether a page has been marked dirty
    by the MMU, so it can't just iterate the pages it knows are dirty in
    the VM object.  It's nasty whether it scans the VM object and iterates
    the page tables, or scans the page tables and looks up the related VM
    pages.   The only way to optimize this is to force write-faults by
    mapping clean pages read-only, in order to track whether a page is
    actually dirty in real time instead of lazily.  Then msync() would
    only have to do a ranged-scan of the VM object's dirty-page list
    and would not have to actually check the page tables for clean pages.

    A secondary effect of the msync() is that it is initiating asynchronous
    I/O for what sounds like hundreds of VM pages, or even more.  All those
    pages are locked and busied from the point they are queued to the point
    the I/O finishes, which for some of the pages can be a very, very long
    time (into the multiples of seconds).  Pages locked that long will
    interfere with madvise() calls made after the msync(), and probably
    even interfere with the follow msync().

    It used to be that msync() only synced VM pages to the underlying
    file, making them consistent with read()'s and write()'s against
    the underlying file.  Since FreeBSD uses a unified VM page cache
    this is always true.  However, the Open Group specification now
    requires that the dirty pages actually be written out to the underlying
    media... i.e. issue real I/O.  So msync() can't be a NOP if you go by
    the OpenGroup specification.

                                        -Matt
                                        Matthew Dillon
                                        <dillon at backplane.com>

Thanks,

Marcus