[rrd-developers] Use of madvise / msync kills performance for me
marcus at blazingdot.com
Fri Jun 27 11:58:15 CEST 2008
On Fri, Jun 27, 2008 at 10:15:24AM +0200, Bernhard Fischer wrote:
> On Thu, Jun 26, 2008 at 02:28:11PM -0700, Marcus Reid wrote:
> >On Thu, Jun 26, 2008 at 07:30:03AM +0200, Tobias Oetiker wrote:
> >> Hi Marcus,
> >> Have you tried compiling rrrdtool without mmapping ? Note that
> >> removing msync is BAD. Have a look at the manual page.
> >> msync() flushes changes made to the in-core copy of a
> >> file that was mapped into memory using mmap(2) back to disk.
> >> Without use of this call there is no guarantee that changes
> >> are written back before munmap(2) is called.
> >I think things may be different in FreeBSD land. From the msync
> >man page:
> > The msync() system call is obsolete since BSD implements a coherent file
> > system buffer cache. However, it may be used to associate dirty VM pages
> > with file system buffers and thus cause them to be flushed to physical
> > media sooner rather than later.
> >> Obviously it will be faster without this call, but then again,
> >> the price (potential file corruption) might be a bit high.
> >> Why some of the madvise calls are are taking so long is unclear to
> >> me. You might want to try only to drop the WILLNEED calls only and not
> >> the RANDOM since it is crucial in preserving cache memory ...
> >Yeah, that part seems odd to me as well. I'll try asking about that on the
> >right mailing list and find out what some kernel guys think about it.
> That's the crucial information, yes.
> I wouldn't be surprised if there are some loose ends in your kernel (we
> also tripped a timestamp buglet on linux, fwiw :), but that's obviously
> pure speculation for now.
Matt Dillon provided some good information on this subject that I would
like to pass on. First, a little background.. I discovered that the
long slow msync() calls only happen on files over a certain size. The
file that's slowing things down is 1161mB long, and msync() calls to a
file that's 940mB long are fast. That's probably a kernel problem that
could be worth looking into.
This is probably an edge case -- I'm updating an rrd file that's over
a gig in size and I don't know how common that is.
Here's part of Matt's comment, which suggests that maybe we can limit
the region of the msync() to the part of the file that was known to be
changed, if that can be determined.
The msync() is clearly the problem. There are numerous optimizations
in the kernel but msync() is frankly a rather nasty critter even with
the optimizations work. Nobody using msync() in real life ever tries
to run it over the entirety of such a large mapping... usually it is
just run on explicit sub-ranges that the program wishes to sync.
One reason why msync() is so nasty is that the kernel must physically
check the page table(s) to determine whether a page has been marked dirty
by the MMU, so it can't just iterate the pages it knows are dirty in
the VM object. It's nasty whether it scans the VM object and iterates
the page tables, or scans the page tables and looks up the related VM
pages. The only way to optimize this is to force write-faults by
mapping clean pages read-only, in order to track whether a page is
actually dirty in real time instead of lazily. Then msync() would
only have to do a ranged-scan of the VM object's dirty-page list
and would not have to actually check the page tables for clean pages.
A secondary effect of the msync() is that it is initiating asynchronous
I/O for what sounds like hundreds of VM pages, or even more. All those
pages are locked and busied from the point they are queued to the point
the I/O finishes, which for some of the pages can be a very, very long
time (into the multiples of seconds). Pages locked that long will
interfere with madvise() calls made after the msync(), and probably
even interfere with the follow msync().
It used to be that msync() only synced VM pages to the underlying
file, making them consistent with read()'s and write()'s against
the underlying file. Since FreeBSD uses a unified VM page cache
this is always true. However, the Open Group specification now
requires that the dirty pages actually be written out to the underlying
media... i.e. issue real I/O. So msync() can't be a NOP if you go by
the OpenGroup specification.
<dillon at backplane.com>
More information about the rrd-developers