[rrd-developers] Possible race conditions / data loss in rrdcached?

kevin brintnall kbrint at rufus.net
Sun Jul 12 17:23:19 CEST 2009


On Sun, Jul 12, 2009 at 12:15:54PM +0200, Sebastian Harl wrote:
> Hi,
> 
> I've just stumbled across the following (possible) issue. Unfortunately,
> I don't have the time to have a more in-depth look at it, but maybe
> Kevin and / or Florian can comment on that:

Sebastian, your summary is correct.

This unlock allows rrdcached to continue processing other updates as soon
as possible.  stat() may block on a system that is heavily IO constrained
(sometimes for multiple seconds).  If the cache lock were held during that
time, we'd introduce the exact type of IO-bound update delay that
rrdcached is aiming to avoid.

With sub-second updates, the incoming update rate is even higher; thus,
the potential for a single stat() to delay updates is even higher.

For rrdcached to be beneficial, the arrival rate of new nodes must be
_significantly_ smaller than the arrival rate of updates.  Therefore, I
think this is a sensible decision.  If there is strong community feedback
to the contrary, we can re-evaluate.

-- 
 kevin brintnall =~ /kbrint at rufus.net/


> In handle_request_update(), the following code is used:
> 
>   pthread_mutex_lock (&cache_lock);
>   ci = g_tree_lookup (cache_tree, file);
> 
>   if (ci == NULL) /* {{{ */
>   {
> 
> /* I.e. the file is not yet available in the cache => we need to create
>  * a cache item and insert in into the tree. */
> 
>   [...]
>   /* don't hold the lock while we setup; stat(2) might block */
>   pthread_mutex_unlock(&cache_lock);
> 
>   [...] /* Allocate and initialize the cache item. */
> 
>   pthread_mutex_lock(&cache_lock);
>   g_tree_replace (cache_tree, (void *) ci->file, (void *) ci);
> 
> Now, in the mean time, another "UPDATE" command could have been issued
> on the same file, which would have created the cache item as well which
> would then be replaced by the "new" one when calling g_tree_replace(),
> losing the data submitted by the other "UPDATE" command.
> 
> g_tree_replace() will take care of properly freeing the "old" cache
> entry (using free_cache_item()), however that does not take care of
> flushing the file.
> 
> I agree that this describes a very unlikely situation but it is
> semantically wrong and might result in problems that are very hard to
> trace back. Suppose that multiple updates are accumulated into one
> "UPDATE" command before being submitted to rrdcached (for whatever
> reason). Then all data of some UPDATE might be lost. When thinking about
> plans to introduce sub-second resolution some time in the future, this
> might become a real problem.
> 
> Am I missing something? Any comments?
> 
> Cheers,
> Sebastian
> 
> -- 
> Sebastian "tokkee" Harl +++ GnuPG-ID: 0x8501C7FC +++ http://tokkee.org/
> 
> Those who would give up Essential Liberty to purchase a little Temporary
> Safety, deserve neither Liberty nor Safety.         -- Benjamin Franklin
> 



> _______________________________________________
> rrd-developers mailing list
> rrd-developers at lists.oetiker.ch
> https://lists.oetiker.ch/cgi-bin/listinfo/rrd-developers



More information about the rrd-developers mailing list