[smokeping-users] Edgetrigger and repeating alert notifications

Wed Feb 24 15:36:04 CET 2016

Thanks Greg. Appreciate it. I've tried to simplify but still doesn't seem to get things right :(

I changed 'someloss' as suggested to pattern:
<5%, >5%, >5%

# 3 sequential steps (each with a batch of 20pings), with first having lt 5% PL, followed second step gt 5% PL, followed by step with gt 5% PL

'hostdown' remains pattern: ==0%,==0%,==0%, ==U

# 4 sequential steps, s1-s3: no loss, s4: unknown (host dead)

I had a server that was alive.. no loss. Then I suddenly dropped all ICMP packages resulting in two email notifications in the following sequence:

1. Alert "someloss" was raised
Pattern
-------
<5%,>5%,>5%

Data (old --> now)
------------------
loss: 0%, 0%, 100%, 100%
rtt: 0ms, 0ms, U, U

Comment
-------
We've got loss 3 times in a row over the past 15min

2.  Alert "someloss" was cleared (received 5min after first alert above)
Pattern
-------
<5%,>5%,>5%

Data (old --> now)
------------------
loss: 0%, 100%, 100%, 100%
rtt: 0ms, U, U, U

Comment
-------
We've got loss 3 times in a row over the past 15min

I can understand the first 'smokeloss' got raised as the pattern is fulfilled by dropping all ICMP on the target. However, why didn't 'hostdown' fire, shouldn't 'hostdown' fire, or we never get there?

Secondly, I don't understand why 1. was cleared 5min later, I mean if the same rule is matching twice, shouldn't edgetrigger just keep it to raised as the device still is down, and not send second email?

After the second 'someloss' cleared, I didn't receive any further emails.

My two alert patterns may be flawed, but in a nutshell, I just want to be able to detect reasonable (0-15) PL and notify, and also when there's 'hostdown'. Is it possible to differentiate these two, and get either to notify appropriately or what's the recommended patters I should stick to?
Thx Will

On Wednesday, February 24, 2016 6:16 AM, Gregory Sloop <gregs at sloop.net> wrote:

Hi I've had problems with repeating alert notifications for a target that has been completely down all the time, it is shut down.

My Alerts pattern section looks as below, and I get repeated 'someloss' triggers for this particular node although I've edgetrigger and I shouldn't have got a second notification, it seem to just continue.

I've read the Smokeping config manual but I'm not sure if my patterns below are false or am I toggling the pattern unexpectedly ?

I have restarted my Smokeping with init script I'm not sure if that means that all states for forgotten and email will be resent?

The alert emails are all for the same target and look the same I get:

pattern: >0%,>5%,>=5%

Loss: S, 100%, 100%, 100%
RTT: S, U, U, U

Please advice on this edge trigger and if I've some faulty configuration below

Standard Smokeping 2.6.8 Ubuntu 14.04 package

+someloss
type = loss
edgetrigger = yes
pattern = >0%,>5%,>=5%
comment = We've got loss 3 times in a row over the past 15min

+hostdown
type = loss
edgetrigger = yes
pattern = ==0%,==0%,==0%, ==U
comment = host down!

Thank you, William 

1) 
>pattern = >0%,>5%,>=5%
>comment = We've got loss 3 times in a row over the past 15min

That's a *greater* than 0% loss sample followed by one that's greater than 5%, and a second immediately following of >= 5%.
It's not 3 samples of >=%5 over 15m - like in your desc. I suspect it's just the desc that's wrong, and it is actually as you intend.

Wouldn't this produce continuous matches?

Do you perhaps mean something like <5%,>5%,>5% [I tend to use < and >, not == - since I'd consider it "up" if it were 5% or less, not just zero percent loss.]

---
2) >I have restarted my Smokeping with init script I'm not sure if that means that all states for forgotten and email will be resent?

I don't believe smokeping keeps state between restarts. [Nagios does, I think.] So, a restart of SP may well generate a new set of alerts, depending on your alert conditions - even though the state hasn't changed.

---
I've always been fairly frustrated at smokepings' alerts. [It's just not that good at alerts - but it does do it's core work really well, so I'll live with the alert failings.] What is good about alerts is Nagios.

I know I say this in almost every discussion about SP - but really consider using Nagios to handle alerts. There's a smokeping plug-in for Nagios and nagios can monitor a bunch of other things well too. 

I use the smokeping alert pipe to run a MTR on the target to document the whole chain and where the problems are, and their severity. And in that case, I don't find it screwing up the edgetrigger operation and generating too many MTR's - so I'm 99% edgetriggger is operating as designed. [I'm using the Ubuntu/debian packaged version too.]

When you have problems with alerts try:
Simpler alert patterns. It's terribly easy to get "tricky" with patterns and then find they didn't work the way you expected. Use greater/less than expressions, rather than equal. Keep them short and as least complicated as possible. A simple pattern is a lot easier to troubleshoot. 

HTH

-Greg