From steve at steveshipway.org  Mon Oct  4 00:18:26 2010
From: steve at steveshipway.org (Steve Shipway)
Date: Mon, 4 Oct 2010 11:18:26 +1300
Subject: [mrtg-developers] MRTG check scheduling while in daemon mode
Message-ID: <000001cb6348$ee9b9580$cbd2c080$@org>

Has anyone on the list any thoughts on the MRTG check scheduler?

 
Currently (we're considering Daemon mode only, here), every 5 mins it will
run ALL the Target checks sequentially, running multiple threads according
to the Forks: directive.  After all checks are finished, it will sleep until
the next 5-min cycle starts.

 
This is sub-optimal because

1)      You get a huge burst of CPU usage followed by a period of silence,
which can make the frontend go slow and messes up monitoring of the system's
own CPU

2)      If the checks exceed the 5min window, then you miss a polling cycle
and need to manually tune your forks upwards.

 
I would propose an alternative method of scheduling.

 
1.       Rather than specifying a number of forks, make it a MAXIMUM number
(a bit like when defining threads in apache)

2.       After the initial read of the CFG files, MRTG knows how many
Targets there are. Divide the Interval by this to get the interleave.  Then,
start a new check every interleave, starting a new thread if necessary and
if we've not hit the maximum threads.

 
Benefits would be that it can expand to handle more targets, and spreads the
load over the window.

 
Disadvantages would be that it is hard to tell when you're reaching
capacity, and (more importantly) it might be hard to do the optimisation
that MRTG does where a single device is queried once for all interfaces.

 
We coded up basically this system here, however it didn't use MRTG in daemon
mode which negates a lot of the benefits you can gain from daemon mode and
the new RRD memory-mapped IO.  I've not yet looked at coding it directly
into the MRTG code.

 
Anyone have any thoughts?

 
Steve

 
  _____  

Steve Shipway

steve at steveshipway.org

Routers2.cgi web frontend for MRTG/RRD; NagEventLog Nagios agent for Windows
Event Log monitoring; check_vmware plugin for VMWare monitoring in Nagios
and MRTG; and other Open Source projects.

Web: http://www.steveshipway.org/software

P Please consider the environment before printing this e-mail 

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.oetiker.ch/pipermail/mrtg-developers/attachments/20101004/75e65c0f/attachment.htm 

From Niall.oReilly+mrtg-dev at ucd.ie  Tue Oct  5 21:03:19 2010
From: Niall.oReilly+mrtg-dev at ucd.ie (Niall.oReilly+mrtg-dev at ucd.ie)
Date: Tue, 05 Oct 2010 20:03:19 +0100
Subject: [mrtg-developers] Fwd: Re: MRTG check scheduling while in daemon
	mode
Message-ID: <4CAB7677.3050207@ucd.ie>

	Sorry, wrong identity first time
	/N

-------- Original Message --------
Subject: Re: [mrtg-developers] MRTG check scheduling while in daemon mode
Date: Tue, 05 Oct 2010 17:49:54 +0100
From: Niall O'Reilly <Niall.oReilly at ucd.ie>
To: mrtg-developers at lists.oetiker.ch

On 03/10/10 23:18, Steve Shipway wrote:
> Has anyone on the list any thoughts on the MRTG check scheduler?

	Not exactly, but rather some comments on the method
	you suggest.

> Currently (we're considering Daemon mode only, here), every 5 mins it will
> run ALL the Target checks sequentially, running multiple threads according
> to the Forks: directive.  After all checks are finished, it will sleep until
> the next 5-min cycle starts.
> 
> This is sub-optimal because
> 
> 1)      You get a huge burst of CPU usage followed by a period of silence,
> which can make the frontend go slow and messes up monitoring of the system's
> own CPU
> 
> 2)      If the checks exceed the 5min window, then you miss a polling cycle
> and need to manually tune your forks upwards.

	Agreed.

> I would propose an alternative method of scheduling.
> 
>  
> 
> 1.       Rather than specifying a number of forks, make it a MAXIMUM number
> (a bit like when defining threads in apache)

	Makes sense.

> 2.       After the initial read of the CFG files, MRTG knows how many
> Targets there are. Divide the Interval by this to get the interleave.

	I'ld be inclined to use a bigger number than the Target count
	(maybe twice the count or so), and so create some slack for
	delays caused by equipment or network incidents.

>  Then,
> start a new check every interleave, starting a new thread if necessary and
> if we've not hit the maximum threads.
> 
> Benefits would be that it can expand to handle more targets, and spreads the
> load over the window.
> 
> Disadvantages would be that

	Adding deliberate jitter to the probe cycle might disturb
	the interpolation of values for the nominal (interval-aligned)
	probe instants, or at least give rise to "interesting" aliasing.

> it is hard to tell when you're reaching capacity,

	Depends on what is available by way of fork management.

	Wouldn't it make sense to (try to) count elapsed, CPU, and
	wait (disk and network) times for each fork, and derive some
	estimate of remaining headroom?

	I have very little idea of the level of difficulty involved
	in this.

> and (more importantly) it might be hard to do the optimisation
> that MRTG does where a single device is queried once for all interfaces.

	Probably less a problem than it looks at first sight.

	The grouping of Targets MRTG already does could surely
	be exploited as input to the interleaving calculation.

> We coded up basically this system here, however it didn't use MRTG in daemon
> mode which negates a lot of the benefits you can gain from daemon mode

	Not only that, but retaining state from run to run may allow
	Target 'reputation' (based on delays and retries) to be used
	to tune the interleaving strategy for the actual environment.
	Without daemon mode, this opportunity would either have to be
	systematically foregone, or would require cacheing to disk.

> and
> the new RRD memory-mapped IO.  I've not yet looked at coding it directly
> into the MRTG code.
> 
> Anyone have any thoughts?

	You did ask!  8-)

	I hope this helps.

	Niall O'Reilly


From s.shipway at auckland.ac.nz  Wed Oct  6 01:20:11 2010
From: s.shipway at auckland.ac.nz (Steve Shipway)
Date: Tue, 5 Oct 2010 23:20:11 +0000
Subject: [mrtg-developers] Fwd: Re: MRTG check scheduling while in
 daemon	mode
In-Reply-To: <4CAB7677.3050207@ucd.ie>
References: <4CAB7677.3050207@ucd.ie>
Message-ID: <28E447343A85354483BCF7C3E9D5EAA51499C0BC@uxcn10-1.UoA.auckland.ac.nz>

> > 2.       After the initial read of the CFG files, MRTG knows how many
> > Targets there are. Divide the Interval by this to get the interleave.
> 
> 	I'd be inclined to use a bigger number than the Target count
> 	(maybe twice the count or so), and so create some slack for
> 	delays caused by equipment or network incidents.

This makes sense, so maybe that taking (Interval / #targets) as the interleave frequency, use (Interval*0.9 / #targets) to give a 10% headroom?

> 	Adding deliberate jitter to the probe cycle might disturb
> 	the interpolation of values for the nominal (interval-aligned)
> 	probe instants, or at least give rise to "interesting" aliasing.

This is true; however if you keep the same schedule order for the targets over the interval (IE don?t recalculate every cycle) then you'll still have the 5min gap between pollings and so the jitter will be minimal, provided you don?t hit your forks limit.

> > it is hard to tell when you're reaching capacity,
> 
> 	Depends on what is available by way of fork management.
> 	Wouldn't it make sense to (try to) count elapsed, CPU, and
> 	wait (disk and network) times for each fork, and derive some
> 	estimate of remaining headroom?

I'd say you can always know your max forks (as defined by the Forks: command) and how many forks you're currently using (due to running checks not completing before the next interleave period completes) so you can identify capacity this way.

> > and (more importantly) it might be hard to do the optimisation
> > that MRTG does where a single device is queried once for all
> interfaces.
> 
> 	Probably less a problem than it looks at first sight.
> 	The grouping of Targets MRTG already does could surely
> 	be exploited as input to the interleaving calculation.

Maybe; I know MRTG will bypass subsequent checks to a device if previous SNMP requests failed.  This might be harder to do in this new method because by the time the SNMP timeout hits, you've already kicked off new threads for the other Targets.    However since a SNMP thread in timeout uses minimal resources this might not be such an issue, though it would eat up threads...

> > We coded up basically this system here, however it didn't use MRTG in
> daemon
> > mode which negates a lot of the benefits you can gain from daemon
> mode
> 
> 	Not only that, but retaining state from run to run may allow
> 	Target 'reputation' (based on delays and retries) to be used
> 	to tune the interleaving strategy for the actual environment.
> 	Without daemon mode, this opportunity would either have to be
> 	systematically foregone, or would require cacheing to disk.

Nice idea; if you have an array (preserved between cycles) that holds target processing order (with each item separated by the interleave time) then this could be re-ordered to optimise?  Of course, if you re-order it too much or too often, then you hit the jitter problem you mentioned earlier.

Maybe have this array hold targetname/failcount/skipnextcount; then a failed SNMP poll can cancel the /next/ poll for this device, and a subsequent fail can cancel the /next two/ polls, and so on...  If you get a fail for a specific target, then increment failcount for that target, and set skipnextcount=failcount for all targets on the same device.  Then at next cycle, if skipnextcount>0 you decrement skipnextcount and skip the poll.  If the poll succeeds, you set failcount=0 for all targetnames on this device.

Such as this pseudocode (note that since the poll is done in a separate thread the actual processing is a little more complex)

Foreach targetname in targetqueue  {
  If targetqueue[targetname].skipnextcount {
    targetqueue[targetname].skipnextcount--;
    next;
  }
  Poll targetname
  If success {
    Foreach t in targetqueue on same device as targetname {
      targetqueue[t].skipnextcount=0;
      targetqueue[t].failcount=0;     
    }
  } else {
    targetqueue[targetname].failcount++  ;
    targetqueue[targetname].skipnextcount = targetqueue[targetname].failcount;
    Foreach t in targetqueue on same device as targetname {
      targetqueue[t].skipnextcount = targetqueue[targetname].failcount;
    }  
  }
}

Steve

Steve Shipway
ITS Unix Services Design Lead
University of Auckland, New Zealand
Floor 1, 58 Symonds Street, Auckland
Phone: +64 (0)9 3737599 ext 86487
DDI: +64 (0)9 924 6487
Mobile: +64 (0)21 753 189
Email: s.shipway at auckland.ac.nz
? Please consider the environment before printing this e-mail 


From Niall.oReilly+mrtg-dev at ucd.ie  Wed Oct  6 09:59:23 2010
From: Niall.oReilly+mrtg-dev at ucd.ie (Niall.oReilly+mrtg-dev at ucd.ie)
Date: Wed, 06 Oct 2010 08:59:23 +0100
Subject: [mrtg-developers] Fwd: Re: MRTG check scheduling while in
 daemon mode
In-Reply-To: <28E447343A85354483BCF7C3E9D5EAA51499C0BC@uxcn10-1.UoA.auckland.ac.nz>
References: <4CAB7677.3050207@ucd.ie>
	<28E447343A85354483BCF7C3E9D5EAA51499C0BC@uxcn10-1.UoA.auckland.ac.nz>
Message-ID: <4CAC2C5B.7020603@ucd.ie>

On 06/10/10 00:20, Steve Shipway wrote:
> Such as this pseudocode (note that since the poll is done in a
> separate thread the actual processing is a little more complex)

	It's already too complex! 8-)

	Having to copy the consequences of success or failure to the
	related targets is a symptom of a problem.  The normalization
	concept from database theory could be used to simplify things
	here. If a device queue (instead of a target queue) were to
	drive polling, data relating to the device woud not need to be
	duplicated.

	Setting up the device queue entries, each with its list of
	attached targets, would be part of configuration processing.
	The set of target data for each device could be processed
	and stored in the same thread in conjunction with polling.
	Alternatively, a second poll-free pass over the device queue,
	not necessarily in the same thread, could take care of this.

	Niall


From s.shipway at auckland.ac.nz  Mon Oct 18 06:12:23 2010
From: s.shipway at auckland.ac.nz (Steve Shipway)
Date: Mon, 18 Oct 2010 04:12:23 +0000
Subject: [mrtg-developers] Running with rrdcached
Message-ID: <28E447343A85354483BCF7C3E9D5EAA5149A200A@uxcn10-1.UoA.auckland.ac.nz>

I've been running MRTG with rrdcached, and just had rrdcached lock up (out of memory - seems to have a memory leak somewhere).  This causes MRTG (in daemon mode) to fail all updates.

Fine; I restart rrdcached and expect it all to work - but it doesn't.  Seems that once MRTG has failed to connect to rrdcached once, it never tries again until you restart.

Should MRTG die when a remote update fails?  Or should it re-attempt connection to rrdcached?  If so, I'm not sure how, as the RRDs Perl module seems to be the bit to blame for this.

Thoughts?

Steve

________________________________
Steve Shipway
ITS Unix Services Design Lead
University of Auckland, New Zealand
Floor 1, 58 Symonds Street, Auckland
Phone: +64 (0)9 3737599 ext 86487
DDI: +64 (0)9 924 6487
Mobile: +64 (0)21 753 189
Email: s.shipway at auckland.ac.nz<mailto:s.shipway at auckland.ac.nz>
P Please consider the environment before printing this e-mail


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.oetiker.ch/pipermail/mrtg-developers/attachments/20101018/9f87fd0e/attachment.htm