[smokeping-users] miniloss example alert creates a lot of alternating alerts

Peter Kristolaitis alter3d at alter3d.ca
Thu Jul 22 13:15:16 CEST 2010


Hi Marc;

The solution to your problem depends a bit on the alerting requirements 
at your site -- for example, do you care if alerts are delayed by one 
ore more polling cycles in SmokePing?

My first suggestion would be to define an alert something like this:

+someloss
type = loss
pattern = 0%, 0%, 0%, 0%, 0%, >0%, >0%, >0%
comment = Loss detected for last 3 polling cycles

This alert definition will trigger when you have 3 *consecutive* polling 
cycles with some packet loss;   this is different than the alert you 
tried (>0%, *12*, >0%, *12*, >0%, *12*) because the "*12*" in your 
pattern acts as a wildcard... it will match ANYTHING.   So your alert 
pattern basically says "If we've seen >0% three times in the last 39 
poll cycles, trigger an alert.    Based on the data samples you 
provided, I believe a consecutive model would suit your needs better.

If you need to get alerts sooner for actual problems, consider defining  
a second alert as well... something like:

+bigloss
type = loss
pattern = 0%, 0%, 0%, >20%
comment = We have sudden, severe packet loss


If you enable both alerts on your hosts, you will get alerts when you 
have persistent, low-to-moderate (1-20%) loss on the links, but you'll 
get an alert immediately when there are bigger problems (>20% loss).

I think these rules will probably serve you well as a baseline, but 
don't be afraid to experiment.   I find it usually takes a couple weeks 
of testing & tweaking to find an optimum set of alerts for any given 
network simply due to different topology/architecturer, etc.

- Peter




On 21/07/2010 2:44 AM, Marc Haber wrote:
> Hi,
>
> when a network device is quite busy (for example, when backup of some
> servers connected to this device is going on), it's going to drop some
> packets, resulting in loss data like this:
>
> 00:35:23
>     loss: 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%,
>           0%, 0%, 0%, 0%, 0%, 0%, 0%, 10%, 0%, 0%, 5%, 0%, 5%
> 00:35:52
>     loss: 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%,
>           0%, 0%, 0%, 0%, 0%, 0%, 10%, 0%, 0%, 5%, 0%, 5%, 0%
> 00:48:53
>     loss: 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 5%, 0%,
>           0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 5%, 0%, 0%, 5%
> 00:49:23
>     loss: 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 5%, 0%, 0%,
>           0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 5%, 0%, 0%, 5%, 0%
> 00:49:53
>     loss: 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 5%, 0%, 0%, 0%,
>           0%, 0%, 0%, 0%, 0%, 0%, 0%, 5%, 0%, 0%, 5%, 0%, 10%
> 00:50:23
>     loss: 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 5%, 0%, 0%, 0%, 0%,
>           0%, 0%, 0%, 0%, 0%, 0%, 5%, 0%, 0%, 5%, 0%, 10%, 0%
> 00:53:54
>     loss: 0%, 0%, 5%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 5%,
>           0%, 0%, 5%, 0%, 10%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 5%
> 00:54:24
>     loss: 0%, 5%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 5%, 0%,
>           0%, 5%, 0%, 10%, 0%,0%, 0%, 0%, 0%, 0%, 0%, 5%, 0%
>
> When one has the miniloss alert from the smokeping_config defined,
> this causes the alarm to get raised and cleared multiple times over
> this rather short period of time:
>
> 00:35:23
>     loss: 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%,
>           0%, 0%, 0%, 0%, 0%, 0%, 0%, 10%, 0%, 0%, 5%, 0%, 5%
>     alarm raised
> 00:35:52
>     loss: 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%,
>           0%, 0%, 0%, 0%, 0%, 0%, 10%, 0%, 0%, 5%, 0%, 5%, 0%
>     alarm cleared
> 00:48:53
>     loss: 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 5%, 0%,
>           0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 5%, 0%, 0%, 5%
>     alarm raised
> 00:49:23
>     loss: 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 5%, 0%, 0%,
>           0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 5%, 0%, 0%, 5%, 0%
>     alarm cleared
> 00:49:53
>     loss: 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 5%, 0%, 0%, 0%,
>           0%, 0%, 0%, 0%, 0%, 0%, 0%, 5%, 0%, 0%, 5%, 0%, 10%
>     alarm raised
> 00:50:23
>     loss: 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 5%, 0%, 0%, 0%, 0%,
>           0%, 0%, 0%, 0%, 0%, 0%, 5%, 0%, 0%, 5%, 0%, 10%, 0%
>     alarm cleared
> 00:53:54
>     loss: 0%, 0%, 5%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 5%,
>           0%, 0%, 5%, 0%, 10%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 5%
>     alarm raised
> 00:54:24
>     loss: 0%, 5%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 0%, 5%, 0%,
>           0%, 5%, 0%, 10%, 0%,0%, 0%, 0%, 0%, 0%, 0%, 5%, 0%
>     alarm cleared
>
> I am wondering whether it makes sense to clear the alarm just because
> there is a 0% in the last slot of the data being considered. This
> causes the alarm to flap in the case of occasional packet loss.
>
> I am thinking of either modifing the alarm so only go of for changes>
> 5 %, like
>
>          +miniloss
>          type = loss
>          # in percent
>          pattern =>5%,*12*,>5%,*12*,>5%
>          comment = detected loss 3 times over the last two hours
>
> or to have it stay raised even if the current loss is 0%, like
>
>          +miniloss
>          type = loss
>          # in percent
>          pattern =>0%,*12*,>0%,*12*,>0%,*12*
>          comment = detected loss 3 times over the last two hours
>
> or
>          +miniloss
>          type = loss
>          # in percent
>          pattern =>0%,*12*,>0%,*12*,>0%,*12*,>=0%
>          comment = detected loss 3 times over the last two hours
>
> I would like to ask the more experienced users how you would act in my
> position. Would you ditch the miniloss alert altogether, would you
> modify it, and if so, how?
>
> Greetings
> Marc
>
>    



More information about the smokeping-users mailing list