[rrd-developers] Use of madvise / msync kills performance for me
Bernhard Fischer
rep.dot.nop at gmail.com
Fri Jun 27 13:13:17 CEST 2008
On Fri, Jun 27, 2008 at 02:58:15AM -0700, Marcus Reid wrote:
>On Fri, Jun 27, 2008 at 10:15:24AM +0200, Bernhard Fischer wrote:
>> On Thu, Jun 26, 2008 at 02:28:11PM -0700, Marcus Reid wrote:
>> >On Thu, Jun 26, 2008 at 07:30:03AM +0200, Tobias Oetiker wrote:
>> >> Hi Marcus,
>> >>
>> >> Have you tried compiling rrrdtool without mmapping ? Note that
>> >> removing msync is BAD. Have a look at the manual page.
>> >>
>> >> msync() flushes changes made to the in-core copy of a
>> >> file that was mapped into memory using mmap(2) back to disk.
>> >> Without use of this call there is no guarantee that changes
>> >> are written back before munmap(2) is called.
>> >
>> >I think things may be different in FreeBSD land. From the msync
>> >man page:
>> >
>> > The msync() system call is obsolete since BSD implements a coherent file
>> > system buffer cache. However, it may be used to associate dirty VM pages
>> > with file system buffers and thus cause them to be flushed to physical
>> > media sooner rather than later.
>> >
>> >> Obviously it will be faster without this call, but then again,
>> >> the price (potential file corruption) might be a bit high.
>> >>
>> >> Why some of the madvise calls are are taking so long is unclear to
>> >> me. You might want to try only to drop the WILLNEED calls only and not
>> >> the RANDOM since it is crucial in preserving cache memory ...
>> >
>> >Yeah, that part seems odd to me as well. I'll try asking about that on the
>> >right mailing list and find out what some kernel guys think about it.
>>
>> That's the crucial information, yes.
>> I wouldn't be surprised if there are some loose ends in your kernel (we
>> also tripped a timestamp buglet on linux, fwiw :), but that's obviously
>> pure speculation for now.
>
>Matt Dillon provided some good information on this subject that I would
>like to pass on. First, a little background.. I discovered that the
>long slow msync() calls only happen on files over a certain size. The
>file that's slowing things down is 1161mB long, and msync() calls to a
>file that's 940mB long are fast. That's probably a kernel problem that
>could be worth looking into.
eh, that's quite big, an order of magnitudes bigger than my files, at
least.
>
>This is probably an edge case -- I'm updating an rrd file that's over
>a gig in size and I don't know how common that is.
>
>Here's part of Matt's comment, which suggests that maybe we can limit
>the region of the msync() to the part of the file that was known to be
>changed, if that can be determined.
>
> The msync() is clearly the problem. There are numerous optimizations
> in the kernel but msync() is frankly a rather nasty critter even with
> the optimizations work. Nobody using msync() in real life ever tries
> to run it over the entirety of such a large mapping... usually it is
> just run on explicit sub-ranges that the program wishes to sync.
Exactly. That whole msync() thing just comes from linux+NFS (IIRC) where
the user would otherwise end up with stale data (due to NFS details that
are not really interresting here).
>
> One reason why msync() is so nasty is that the kernel must physically
> check the page table(s) to determine whether a page has been marked dirty
> by the MMU, so it can't just iterate the pages it knows are dirty in
> the VM object. It's nasty whether it scans the VM object and iterates
> the page tables, or scans the page tables and looks up the related VM
> pages. The only way to optimize this is to force write-faults by
> mapping clean pages read-only, in order to track whether a page is
> actually dirty in real time instead of lazily. Then msync() would
> only have to do a ranged-scan of the VM object's dirty-page list
> and would not have to actually check the page tables for clean pages.
>
> A secondary effect of the msync() is that it is initiating asynchronous
> I/O for what sounds like hundreds of VM pages, or even more. All those
> pages are locked and busied from the point they are queued to the point
> the I/O finishes, which for some of the pages can be a very, very long
> time (into the multiples of seconds). Pages locked that long will
> interfere with madvise() calls made after the msync(), and probably
> even interfere with the follow msync().
>
> It used to be that msync() only synced VM pages to the underlying
> file, making them consistent with read()'s and write()'s against
> the underlying file. Since FreeBSD uses a unified VM page cache
> this is always true. However, the Open Group specification now
> requires that the dirty pages actually be written out to the underlying
> media... i.e. issue real I/O. So msync() can't be a NOP if you go by
> the OpenGroup specification.
It can't be a NOP but you can (and we do) use MS_ASYNC, as opposed to
synchronous flushing:
"
When MS_ASYNC is specified, msync() shall return immediately once all
the write operations are initiated or queued for servicing; when MS_SYNC
is specified, msync() shall not return until all write operations are
completed as defined for synchronized I/O data integrity completion.
Either MS_ASYNC or MS_SYNC is specified, but not both.
"
So perhaps you just need to improve sync vs. async behaviour of msync,
but without having looked at the code, i suspect that you burn too much
time building up the list of involved pages and handle sync/async
properly anyway.
There are a few (obvious) things you could do, just for example:
a improve your VM so that "msync" does the least work possible to build up
a queue that is used to flush dirty data out to the disks
b don't msync in your local copy if you are sure that you don't need it
c msync in rrd_write()
>From a user perspective, a) sounds like the best approach since it
establishes the behaviour that i would expect.
b) and c) are workarounds and both of them have different (esthetic)
problems.
HTH + cheers,
More information about the rrd-developers
mailing list