[smokeping-users] Gaps in Graphs
alter3d at alter3d.ca
Mon Nov 12 22:50:33 CET 2007
I believe what actually happens is that if one scan cycle runs over the
time limit, that scan cycle will complete and write data to the RRDs,
but the *next* cycle won't run at all. I haven't actually looked at
the relevant source code, but it would probably be my approach -- you
wouldn't want a huge backlog of processes piling up and causing a
cascade failure. :)
I'm not sure what flags/options are available on the 1.x series; I know
there is a -d (debug) flag you can use when starting SmokePing that will
give extra information to the console, but to my knowledge there isn't a
way to gather metrics from a 'normal' SmokePing daemon. The simplest
approach would be to have a cron job monitor your logs and send out an
email when that error shows up in the logs, although that's simply a
Maybe Tobi or Niko can suggest a more elegant solution?
Scott Moseman wrote:
> Hey Peter,
> Just as I started typing a reply, I saw the following message display
> in my syslog...
> WARNING: smokeping took 301 seconds to complete 1 round of polling. It
> should complete polling in 300 seconds. You may have unresponsive
> devices in your setup.
> I'm going to assume that Smokeping does not write ANY data to the RRDs
> if it cannot complete the polling for EVERY device in the config? Is
> that accurate? Other than seeing these alarms for when it fails, is
> there any way to see how long its taking for the polls that succeed,
> so I can see how we're doing?
> On Nov 12, 2007 1:58 PM, Peter Kristolaitis <alter3d at alter3d.ca> wrote:
>> Hi Scott;
>> The first thing I would check would be to see if any new devices have been
>> added to your SmokePing config around the time you started having problems.
>> If so, check the SmokePing logs for warnings that look like "Warning:
>> Polling took longer than the interval step." or something similar.
>> What could be happening is that at some point, you had X devices monitored,
>> and they took, for example, 298 seconds to scan. If you added another
>> device, all of a sudden it might take 302 seconds to scan. If you had your
>> scan cycles set to 5 minutes (300 seconds), this means that SmokePing can't
>> complete a round of scanning before another one starts. This could
>> definitely cause the problems you've been seeing.
>> If this is the case, the solution is to either lenghten the scan cycle,
>> remove some hosts, increase concurrency (although I don't think that's
>> supported in 1.x?), or upgrade to the current SmokePing series and implement
>> multiple monitors and/or master/slave.
>> Scott Moseman wrote:
>> We're running Smokeping 1.34. Yes, I'm aware its old, but it's been
>> deployed forever and it's been working fine. Lately we've been having
>> some weird issues with missed polls. I have attached a sample showing
>> the last 3 hours and it includes 3 missed holes. This happens across
>> every device in the Smokeping config. We have a script that runs
>> every 5 minutes to update the config if there's new entries and part
>> of that process is to verify the process is running (and restart, or
>> start, if necessary). It logs what's going on. When the config
>> updates and Smokeping restarts, there's never a gap. According to my
>> scripts, and looking at the age of the Smokeping process that's
>> running, these gaps were NOT caused by Smokeping having failed
>> execution. Also, I setup a ping tool to monitor the switch, router
>> and an external address every SECOND for awhile. There was never a
>> lack of connectivity during these gaps in Smokeping. Is there any
>> means to troubleshoot? I will enable the syslog function to see if it
>> provides any details about what's going on.
> smokeping-users mailing list
> smokeping-users at lists.oetiker.ch
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the smokeping-users