[rrd-users] False positives with aberrant behavior detection

Mon Aug 16 18:02:25 CEST 2010

Hi Mike,

On Mon, Aug 16, 2010 at 07:27:14AM -0700, Mike Schilli wrote:
> 
> I'm trying to get aberrant behavior detection to work with rrdtool
> 1.3.8, but can't find a combination of alpha, beta, and gamma that gets
> me proper detection without an unacceptable number of false positives.
> 
> Here's the graph I got so far:
> 
>      http://perlmeister.com/tmp/rrdhelp.jpg

This would be easier for you to understand (why it's doing what it
does) if you plot the confidence band - i.e., the line above and
below the hwpreduct value that the observations must exceed to be
considered a violation.

> The data is from a temperature sensor, which has a resolution of .5
> degrees Celsius. The data covers 7 days [1] and the rrdtool commands
> I've used are available at [2]. For this example, I've used alpha=0.5,
> beta=0.5, gamma=0.5, with a seasonal period of 60*24 (one day in
> one-minute steps).
> 
> What I've noticed so far:
> 
> * The green line (rrdtool's prediction) is only available after the 3rd
>    day. What's the reason for that?

Prediction, i.e, the "hwpredict" value, is based on past observations;
the algorithm needs prior data points to predict, therefore there is
some time to bootstrap it for operations.  Once the HWPREDICT RRA is
populated though, you won't have to wait again (as long as you don't
have gaps in your data points/observations.)

> * There's a clear jump in the middle of the graph which goes undetected.

This can happen (by design) if you have the H-W RRD attributes set to
only consider it errant if `n' samples fall outside the expected range
within the configured window of points - since this is a very short
duration anomaly (perhaps only one data point), it is not reported
as an error.  That's configurable - see the "threshold" value you
set in the FAILURES RRA.  The default is that 7 observations of 9
must be out of the confidence band before it is reported as a failure
(vs. the predicition).

> * There's a high number of false positives, starting after the spike,
> and continuing until the end of the graph. I've tried various
> combinations of alpha, beta, and gamma to get rid of them but without
> success.

This would be easier to understand if you plot the confidence band.
It looks to me like your band is way too tight.

If you haven't already, I suggest reading Jake Brutlag's orginal
paper, available online from the LISA 2000 Conference:

   "Aberrant Behavior Detection in Time Series for Network Service Monitoring"
   http://www.usenix.org/events/lisa00/brutlag.html

I've also done some work in which we used this H-W implentation
for evaluation of our method; might be helpful:

   "A Signal Analysis of Network Traffic Anomalies"
   http://pages.cs.wisc.edu/~pb/paper_imw_02.pdf (sample parameters page 11 - 300 second step, IIRC)

   "Traffic Anomaly Detection at Fine Timescales with Bayes Nets"
   http://pages.cs.wisc.edu/~pb/icimp08_final.pdf (sample parameters page 8 - 1 second step)

Note that the HW parameters can be very sensitive to your "step" value.
So, don't expect defaults to work if they were meant for a 300 second
step, and you're using a 60 second step... as usual, it's best to
understand them completely to choose reasonable values.

Dave

-- 
plonka at cs.wisc.edu  http://net.doit.wisc.edu/~plonka/  Madison, WI