[rrd-developers] Bug is actually in librrd4, backtrace included

Sebastian Harl sh at tokkee.org
Thu Sep 18 14:40:12 CEST 2008


Hi,

(This is a follow-up for Debian bug #498183 - see [1] for details.
Please keep 498183 at bugs.debian.org Cc'ed when replying.)

[1] http://bugs.debian.org/498183

On Sun, Sep 07, 2008 at 09:51:16PM +0100, Jurij Smakov wrote:
> I don't have any problem reproducing it on sparc, so reopening. The 
> segfault occurs in rrd_open() function in librrd4, as following gdb 
> session illustrates (rebuilt rrd with debugging symbols to get it):
[...]
> (gdb) list
> 363             rra_start +=
> 364                 rrd->rra_def[i].row_cnt * rrd->stat_head->ds_cnt *
> 365                 sizeof(rrd_value_t);
> 366         }
> 367     #ifdef USE_MADVISE
> 368         madvise(rrd_file->file_start + dontneed_start,
> 369                 rrd_file->file_len - dontneed_start, MADV_DONTNEED);
> 370     #endif
> 371     #ifdef HAVE_POSIX_FADVISE
> 372         posix_fadvise(rrd_file->fd, dontneed_start,
[...]
> (gdb) print dontneed_start
> $16 = 8192
> (gdb) print rrd_file->file_len
> $17 = 972
> (gdb) print rrd_file->file_len - dontneed_start
> $18 = 4294960076
[...]
> (gdb) n
> 
> Program received signal SIGSEGV, Segmentation fault.
> rrd_dontneed (rrd_file=Cannot access memory at address 0x44) at rrd_open.c:372
> 372         posix_fadvise(rrd_file->fd, dontneed_start,
> Disabling display 5 to avoid infinite recursion.
> 5: i = Cannot access memory at address 0xffffffe8

(See [2] for the full session dump.)

Jurij, thanks a lot for the detailed information - that was very
helpful.

[2] http://bugs.debian.org/498183#25

> I guess that the problem here is passing negative second argument to 
> madvise() which makes it very unhappy and smashes the stack, but I did 
> not grok the code yet to understand what's going on here.

Yes, that seems to be the problem. Roughly, what's going on here is
that we're stepping through all RRAs of the RRD file and mark "cold"
blocks as unused (using madvise() und posix_fadvise()). dontneed_start
is used as an offset into the file in that search.

After we've stepped through all RRAs, the last call to madvise() (which
will then trigger the segfault) marks the remainder of the file as
unused as well. Now, we might have already passed the end of the file as
dontneed_start is increased by multiples of the page size only. E.g.
this happens if the page size is larger than the file size as in this
case.

For some reasons that I don't know, amd64 and i386 don't seem to care
about that. I was not able to reproduce the problem but I could verify
the same situation in the debugger.

The attached patch should solve this issue. I've simply added a check if
we're already passed the end of the file. Since I do not have access to
a sparc box, I'd like to get some feedback if that really solves the
issue. Also, I'd like Tobi (or anyone else involved in that specific
code) to comment on that just to make sure that I did not miss some
important fact. Thanks in advance!

Cheers,
Sebastian

-- 
Sebastian "tokkee" Harl +++ GnuPG-ID: 0x8501C7FC +++ http://tokkee.org/

Those who would give up Essential Liberty to purchase a little Temporary
Safety, deserve neither Liberty nor Safety.         -- Benjamin Franklin

-------------- next part --------------
A non-text attachment was scrubbed...
Name: madvise_segfault.patch
Type: text/x-diff
Size: 975 bytes
Desc: not available
Url : http://lists.oetiker.ch/pipermail/rrd-developers/attachments/20080918/6a252486/attachment.bin 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : http://lists.oetiker.ch/pipermail/rrd-developers/attachments/20080918/6a252486/attachment-0001.bin 


More information about the rrd-developers mailing list