[rrd-developers] rrdcached performance problem

Fri Oct 30 17:47:43 CET 2009

kevin brintnall wrote:
> On Fri, Oct 30, 2009 at 09:19:30AM -0700, Thorsten von Eicken wrote:
>   
>> Thorsten von Eicken wrote:
>>     
>>>> 37.1 % of the time it spent in ?handle_request_update? the daemon is
>>>> actually waiting for ?realloc?. This is (to me) very unexpected and a
>>>> schoolbook example of ?measure before you optimize?.
>>>>
>>>> I think we can get rid of this bottleneck by writing a specialized
>>>> version of ?rrd_add_strdup? which reallocates powers of ten. Something
>>>> like:
>>>>
>>>> [...]
>>>>
>>>> It'd be great if you could give the attached patch a try
>>>>         
>>> spends all of its time (more than 99 %) in ?realloc?. So in consequence
>>> Test is running, including Kevin's simplification... Thanks for the 
>>> help!!!
>>>       
>> Things are again looking much better, almost great I should say! The one 
>> thing that still makes me a bit uncomfortable is that at the end of the 
>> second hour of run-time there was a cpu spike which caused collectd to 
>> grow rapidly. (Still using -w 3600 -z 3600 -f 7200, I put a load of ~50k 
>> tree nodes right from the start.) You can see the graphs at 
>> http://www.voneicken.com/dl/rrd/ look for the rrdcached-6* series. It 
>> flattens out nicely after the spike, but it's one of those things that 
>> tend to bite sooner or later. I'm not sure what to do about it.
>>     
>
> Thorsten,
>
> What's the start time on that daemon?  If it was started slightly before
> the graphs (~21:30-21:35), then it's possible that the CPU increase is
> associated with the first flush (-f).
>   
It was right at the start of the graphs. If you look at the 
"write-data_sets" graph, you can see very clearly how the flushing 
starts at 22:38-22:39 and the first hour of flushing ends 23:46-23:47. 
The cpu spike starts to build at 23:26-23:28 so it's not on a clean 
boundary at all.If you look at the if-packets graph you can see how 
inbound traffic is 100% stable throughout the whole run.
> Something is queueing a lot of files; I would only expect that to result
> from the flush process if writing had stopped for a set of your RRDs.  For
> example, if you have a large number of RRD files in the journal that
> aren't being re-written afterwards, then the first flush would contain all
> those.
>
> Did you use journal on startup for this test?
>   
I cleared the journals before starting rrdcached.
> If you could profile during that "hump" it would be very instructive...
>   
Yeah, no kidding. The question is how to reproduce this given the 
callgrind slow-down. I may try later next week when I hope to have some 
more time.

Thorsten