[rrd-developers] rrdcached use corrupting RRD files (trunk)

kevin brintnall kbrint at rufus.net
Fri Oct 22 04:03:56 CEST 2010


On Thu, Oct 21, 2010 at 8:50 PM, Steve Shipway <s.shipway at auckland.ac.nz>wrote:

>  The corrupted file ends up the correct size; however the entire file is
> filled with zeroes (fortunately, we archive our RRD files nightly so I can
> go back and retrieve the last uncorrupted version plus the corrupted
> version)
>

Strange...  The failure mode for rrd_open() unmaps and closes the file...
 that's about it.  I'm not sure how it could zero the file like that.


>
>
> The system is not (normally) memory or process-constrained; there is in
> fact nothing to speak of running apart from apache and the rrdcached
> daemon.  The rrdinfo response is ‘not an RRD file’, since it doesn’t have
> the RRD header.
>
>
>
> It has run fine for a whole week at these rates before the problem hit; so
> that’s why I think it might be a leak in the RRD functions (which would of
> course not show up in a non-daemon situation).  We use the remote update,
> info and (occasionally) create via the TCP socket; plus the info, last,
> flush and fetch via the UNIX socket.
>

My workload is all UPDATE and FLUSH and I'm not seeing any problems.  It's
possible that the newer code (info, create) has a leak that I haven't caught
yet in production.

Could you show me:

 - the output of 'stats' from your daemon
 - "rrdtool info" from an RRD that's typical of your workload
 - the args you're using when starting the rrdcached daemon


> The build is the absolute latest r2136 .
>
>
>
> The memory usage of the rrdcached process is definitely increasing; however
> that may also be due to the number of items in the queue?  It is currently
> at 768m virtual, 560m physical (17% usage) which seems somewhat high to me,
> even for 20,000+ RRD files.  Eventually it will hit address-space limits
> (this is a 32bit RHEL5 box with 4G physical memory)
>

My rrdcached runs around 2GB.  That's with about 350k RRDs and 72 cached
values per RRD.  So, your memory utilization does look high.


>  Unfortunately I don’t have any of the nice developer tools for tracking
> memory leaks…
>

You could install "valgrind" and run the daemon under that for a while.  The
daemon should be compiled with debugging symbols (-g) and not stripped in
this case.  i.e.

% valgrind --leak-check=full --show-reachable=yes rrdcached -args blah blah
blah

Then, on exit it will show you what's leaking.

Alternatively, if you can make a script that typifies your workload (perhaps
at a smaller scale) that would help to reproduce the problem.

-kb



>
>
> Steve
>
>
>  ------------------------------
>
> *Steve Shipway*
>
> ITS Unix Services Design Lead
>
> University of Auckland, New Zealand
>
> Floor 1, 58 Symonds Street, Auckland
>
> *Phone: +64 (0)9 3737599 ext 86487*
>
> *DDI: +64 (0)9 924 6487*
>
> *Mobile: +64 (0)21 753 189*
>
> *Email: s.shipway at auckland.ac.nz*
>
> P Please consider the environment before printing this e-mail
>
> * *
>
>
>
> *From:* kevin brintnall [mailto:kbrint at rufus.net]
> *Sent:* Friday, 22 October 2010 1:40 p.m.
> *To:* Steve Shipway
> *Cc:* rrd-developers at lists.oetiker.ch; rrd-users at lists.oetiker.ch
> *Subject:* Re: [rrd-developers] rrdcached use corrupting RRD files (trunk)
>
>
>
> Sebastian,
>
>
>
> I don't think the problem is specific to rrdcached; it uses normal librrd
> API.  This problem likely affects any RRD access in a memory constrained
> system.
>
>
>
> Is there a lack of memory (or address space if 32-bit) on the system?  Or
> is it running up against per-process limits?
>
>
>
> How does the file end up?  Is it the right size?  What errors do you get
> (i.e. when you "rrdtool info").  What architecture are you running on?
>  mmap() under failure conditions is likely to be OS-specific.
>
>
>
> What revision of trunk?
>
>
>
> Let us know what you find re: memory leak.
>
>
>
> -kb
>
> On Thu, Oct 21, 2010 at 5:07 PM, Steve Shipway <s.shipway at auckland.ac.nz>
> wrote:
>
> I’ve had this happen too often now for it to be a fluke.  OK, so I’m using
> the trunk version of rrdtool 1.4, but (as far as I know) there is nothing in
> there to modify the update code.  We have a high update frequency – approx.
> 20,000 MRTG targets at 5min intervals, which equates to about 70 updates per
> second, and it took about a week for the problem to first hit.
>
>
>
> It seems that something is happening on update, possibly involving memory
> allocation failure, that results in a corrupted file.
>
>
>
> I have some processes that may be reading the file without using the
> rrdcached, but all updates are certainly going this way (no data collection
> is run on this server any more, it all comes over TCP)
>
>
>
> Selected error logs show:
>
> listen_thread_main: pthread_create failed.
>
> queue_thread_main: rrd_update_r (/u01/rrdtool/maildelivery-mx1.rrd) failed
> with status -1. (mmaping file '/u01/rrdtool/maildelivery-mx1.rrd': Cannot
> allocate memory)
>
> *   (restarted rrdcached here)*
>
> replaying from journal: /u01/rrdtool/journal/rrd.journal.1285603416.766523
>
> Replayed 61011 entries (0 failures)
>
> replaying from journal: /u01/rrdtool/journal/rrd.journal.1285607016.766153
>
> Malformed journal entry at line 31024
>
> Replayed 31023 entries (1 failures)
>
> journal processing complete
>
> queue_thread_main: rrd_update_r (/u01/rrdtool/maildelivery-mx1.rrd) failed
> with status -1. ('/u01/rrdtool/maildelivery-mx1.rrd' is not an RRD file)
>
>
>
> Although there was only one journal failure, there were in fact several RRD
> files corrupted (I suspect the ones which were open at the time of the
> memory failure?) and even more with the rrd_update_r memory allocation
> failure.
>
>
>
> It seems that the memory ran out (memory leak?) and somewhere in the
> rrd_update_r something was half-done.  The resultant corrupted RRD file
> doesn’t even load in rrdtool, seems the header is corrupt – I don’t (yet)
> understand enough of the mmap code to work out what could be causing this.
> I’m also trying to track the memory usage of the rrdcached process to see if
> it is indeed growing due to a leak.
>
>
>
> I think there are two bugs here – first, the memory leak causing the
> failure, and second, something in the code is not correctly handling a
> memory allocation failure and corrupts the RRD file as a result.
>
>
>
> Has anyone else experienced this?  And, more to the point, any RRD
> developers who understand the MMAP update code want to take a look or give
> some pointers?
>
>
>
> Steve
>
>
>  ------------------------------
>
> *Steve Shipway*
>
> ITS Unix Services Design Lead
>
> University of Auckland, New Zealand
>
> Floor 1, 58 Symonds Street, Auckland
>
> *Phone: +64 (0)9 3737599 ext 86487*
>
> *DDI: +64 (0)9 924 6487*
>
> *Mobile: +64 (0)21 753 189*
>
> *Email: s.shipway at auckland.ac.nz*
>
> P Please consider the environment before printing this e-mail
>
> * *
>
>
>
>
> _______________________________________________
> rrd-developers mailing list
> rrd-developers at lists.oetiker.ch
> https://lists.oetiker.ch/cgi-bin/listinfo/rrd-developers
>
>
>
>
> --
>  kevin brintnall =~ /kbrint at rufus.net/
>



-- 
 kevin brintnall =~ /kbrint at rufus.net/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.oetiker.ch/pipermail/rrd-developers/attachments/20101021/68f4bcb0/attachment-0001.htm 


More information about the rrd-developers mailing list