[rrd-developers] [rrd] Re: rrdcached dies when journal reaches 2G

Fri Jul 17 22:34:43 CEST 2009

Hi Kevin,

> >> For large file support, we need to consider the things that link against
> >> librrd.  We can either require all dependents (transitively) to enable
> >> large file support (bad) or we could remove any functions that use off_t
> >> from the public librrd interface.  It looks like all such functions are
> >> already deprecated.
>
> This doesn't look like a problem...  Also, perhaps 1.4 is a good time to
> remove these functions from the public-facing API altogether??

I don't want to remove api functions with 1.4 since force us to
break compatibility with old code which otherwhise is not necessary
for 1.4 ... but 1.5 or 2.0 will be a different matter as I plan to
redo the fileformat there and thus have a good oportunity todo some
major house-cleaning ...

> > > In the end, it's probably better to just rotate the journals before 2^31
> > > bytes.  That way, we can support systems that do not have large file
> > > support.  There are some implicit assumptions during journal replay that
> > > I'll have to take a look at.
> >
> > rotation sounds good ... the current rrd code can not deal with
> > files larger 2gb on 32 bit system reliably anyway ...
>
> I've started to implement both options..  The large file support is dead
> simple, whereas the split journal support is much more complicated.  There
> are three combinations:
>
>  (1) go with large file support
>      (1a) OS supports it
>      (1b) OS doesn't support it
>
>  (2) split journal
>
> In cases (1b) and (2), we need to watch the journal size as we write to
> it, and roll-over to a new file once the current file gets too large.  For
> (1b), a full tree flush is required.
>
> The forced flush requirement for (1b) doesn't seem too onerous when we
> consider that a journal limited to 2^31 bytes will nearly always be on a
> system whose per-process memory space is limited to 2^32 bytes (or
> smaller).  Under these conditions, there are only a couple cases where
> rrdcached will not run out of memory anyway.  If (small -w), then the
> forced flush won't have many values to write.  If (small -f), then the
> flush would only be twice as often as normal in the worst case.
>
> I'm looking for some old OS's without large file support to test with, but
> I'm having a hard time finding one.  Perhaps that's a good sign.  Do you
> have any demographic info on the RRD install base?

I only have access stats of the web site this does not realy help
regarding oses ... there is a number of people who run rrdtool on
embeded systems probably with old linux kernels ... also netware is
a target where I am not sure about the large file support, but then
again these systems mostly are running on old versions or rrdtool
since the library dependencies of newer versions are much more
complex anyway ...

> I'm leaning strongly towards (1) after seeing the implications on the
> code...

I think a forced flush is a good solution here ...

I have not looked at the code, but do you only reset the journal on
flush or when there are no data left in the cache ? This might
cause a long running copy of cached to generate rather huge an
unwieldy journals ...

da directory tree with bits of journal that can be unlinked as their
content has gone to disk would seem to be a more sustainable
solution ...

cheers
tobi

-- 
Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
http://it.oetiker.ch tobi at oetiker.ch ++41 62 775 9902 / sb: -9900