[rrd-users] trigger an alert?

Thu May 3 23:19:32 CEST 2007

John Conner wrote:
> Thanks a lot, Sven!
> 
> Still fairly new to rrdtool and never used the "updatev" option, gonna 
> check it right now.
> 
> Do you have any documents handy on how you implement this? if you do, 
> could you point me the link?

Sure, here are some quick notes on how to set up aberrant behaviour
detection for a data value. My example is based on actual monitoring
of a network link with somewhat strong periodical behaviour; that is,
you can easily identify a repeating (daily) pattern in the traffic
graph.

This is the rrdtool create command I use. I've added comments to
some of the lines:

rrdtool create network-uplink.rrd \
--start 1166600000 \
--step 120 \ # sample every 2 minutes
DS:pktsin:DERIVE:180:0:4294967295 \    #maintain counters for packets
DS:bytesin:DERIVE:180:0:4294967295 \   #and bytes, inbound and outbound
DS:pktsout:DERIVE:180:0:4294967295 \
DS:bytesout:DERIVE:180:0:4294967295 \
RRA:AVERAGE:0.5:1:840 \   #day graph
RRA:AVERAGE:0.5:15:384 \  #week graph
RRA:AVERAGE:0.5:60:384 \  #month graph
RRA:AVERAGE:0.5:720:400 \ #year graph
RRA:HWPREDICT:1440:0.05:0.0035:720:6 \ #Detailed notes below
RRA:SEASONAL:720:0.01:5 \
RRA:DEVSEASONAL:720:0.01:5 \
RRA:DEVPREDICT:1440:7 \
RRA:FAILURES:1440:5:25:7

About the last RRAs:
- HWPREDICT is set up to use a seasonal period of 720 datapoints.
  720 datapoints with intervals of 2 minutes equals exactly 24 hours.
  I.e., the traffic pattern repeats every day. You might want to use
  an entire week as the seasonal period, depending on your patterns.
- The alpha,beta,gamma values are not all that easy to tune properly
  to your data source, in my opinion. I've chosen fairly generic
  values, based on those found in
  http://cricket.sourceforge.net/aberrant/rrd_hw.htm
- HWPREDICT has index 6, SEASONAL has 5, and so on. This is the rra-num
  index number, and was not entirely easy to figure out based on the
  documentation, which states "The rra-num argument is the 1-based
  index in the order of RRA creation (that is, the order they appear
  in the create command)." It simply refers to the index number of
  the RRAs, counting from 1 (this includes *all* RRAs, AVERAGE too!)
  HWPREDICT should refer to the SEASONAL   index, SEASONAL to HWPREDICT,
  DEVSEASONAL to HWPREDICT, DEVPREDICT to DEVSEASONAL and FAILURES
  to DEVSEASONAL.

The next step that I would have easily worked out if I read the
documentation properly, is to adjust the positive and negative
confidence band factors. The default is 2, which I find a bit too
unforgiving for my scenario. To adjust it to 5, run:

  rrdtool tune network-uplink.rrd --deltapos 5 --deltaneg 5

Here's how I graph the daily graph for the inbound byte counter:

rrdtool graph \
daily.png \
--font LEGEND:7 \
--font UNIT:7 \
--font AXIS:7 \
--base 1024 \
-l 0 -r \
-w 400 \
-h 125 \
--start end-100800 \
-E \
--title "Network traffic, by day" \
--vertical-label "Bytes/sec" \
--x-grid "HOUR:1:DAY:1:HOUR:4:0:%H:%M" \
\
DEF:a_avg=network-uplink.rrd:bytesin:AVERAGE \
DEF:a_pred=network-uplink.rrd:bytesin:HWPREDICT \
DEF:a_dev=network-uplink.rrd:bytesin:DEVPREDICT \
DEF:a_fail=network-uplink.rrd:bytesin:FAILURES \
\
CDEF:a_normavg=a_avg \
CDEF:dev_lower=a_pred,a_dev,5,*,- \  # Note we're using 5 as the scaling factor
CDEF:dev_upper=a_pred,a_dev,5,*,+ \  # when graphing! Same as in the tune command.
CDEF:dev_area=dev_upper,dev_lower,- \
\
VDEF:a_last=a_avg,LAST \
VDEF:a_average=a_avg,AVERAGE \
\
AREA:dev_lower#ffffff \
AREA:dev_area#ccffcc::STACK \
TICK:a_fail#ff9999:1.0 \
LINE1:dev_lower#66ff66 \
LINE1:dev_upper#66ff66 \
LINE3:a_pred#66ff66 \
LINE1:a_normavg#666699 \
COMMENT:"Current\:" GPRINT:a_last:%6.2lf \
COMMENT:"Average\:" GPRINT:a_average:%6.2lf \
COMMENT:"\n" \
COMMENT:"Last update\: `date \"+%Y-%m-%d %H\\:%M\\:%S %Z\"`"\\r

Next, to actually have it report aberrant behaviour in real-time,
as opposed to post-mortem, you'll need a wrapper script to run
'rrdtool updatev' and parse the output. There are probably fancy
bindings in perl for this, or some other graceful way of doing it.
My way is a quick python script that parses the output looking for
'FAILURES', and then determining if the corresponding value is
greater than 0.0.

Well, that's pretty much it. Good luck!

Sven

>>     I use the aberrant behaviour detection in rrdtool and I find
>>     it quite handy. To detect problems, i use the 'rrdtool updatev'
>>     command, which will output FAILURE=1.0 (different syntax), if
>>     it detects failures. FAILURE=0.0 if not. In other words, I parse
>>     the output of the command, and trigger alerts based on it. You
>>     should probably implement a wrapper around the parsing/alarming,
>>     so that you won't get flooded with mails/SMS messages every five
>>     minutes while a deviation is happening.
>> 
>>     Sven