[rrd-developers] rrdcached performance problem
Thorsten von Eicken
tve at voneicken.com
Fri Oct 30 17:47:43 CET 2009
kevin brintnall wrote:
> On Fri, Oct 30, 2009 at 09:19:30AM -0700, Thorsten von Eicken wrote:
>
>> Thorsten von Eicken wrote:
>>
>>>> 37.1 % of the time it spent in ?handle_request_update? the daemon is
>>>> actually waiting for ?realloc?. This is (to me) very unexpected and a
>>>> schoolbook example of ?measure before you optimize?.
>>>>
>>>> I think we can get rid of this bottleneck by writing a specialized
>>>> version of ?rrd_add_strdup? which reallocates powers of ten. Something
>>>> like:
>>>>
>>>> [...]
>>>>
>>>> It'd be great if you could give the attached patch a try
>>>>
>>> spends all of its time (more than 99 %) in ?realloc?. So in consequence
>>> Test is running, including Kevin's simplification... Thanks for the
>>> help!!!
>>>
>> Things are again looking much better, almost great I should say! The one
>> thing that still makes me a bit uncomfortable is that at the end of the
>> second hour of run-time there was a cpu spike which caused collectd to
>> grow rapidly. (Still using -w 3600 -z 3600 -f 7200, I put a load of ~50k
>> tree nodes right from the start.) You can see the graphs at
>> http://www.voneicken.com/dl/rrd/ look for the rrdcached-6* series. It
>> flattens out nicely after the spike, but it's one of those things that
>> tend to bite sooner or later. I'm not sure what to do about it.
>>
>
> Thorsten,
>
> What's the start time on that daemon? If it was started slightly before
> the graphs (~21:30-21:35), then it's possible that the CPU increase is
> associated with the first flush (-f).
>
It was right at the start of the graphs. If you look at the
"write-data_sets" graph, you can see very clearly how the flushing
starts at 22:38-22:39 and the first hour of flushing ends 23:46-23:47.
The cpu spike starts to build at 23:26-23:28 so it's not on a clean
boundary at all.If you look at the if-packets graph you can see how
inbound traffic is 100% stable throughout the whole run.
> Something is queueing a lot of files; I would only expect that to result
> from the flush process if writing had stopped for a set of your RRDs. For
> example, if you have a large number of RRD files in the journal that
> aren't being re-written afterwards, then the first flush would contain all
> those.
>
> Did you use journal on startup for this test?
>
I cleared the journals before starting rrdcached.
> If you could profile during that "hump" it would be very instructive...
>
Yeah, no kidding. The question is how to reproduce this given the
callgrind slow-down. I may try later next week when I hope to have some
more time.
Thorsten
More information about the rrd-developers
mailing list