[rrd-users] Getting an overview of many statistics...

Fri May 29 15:14:21 CEST 2015

On Fri, 29 May 2015, Simon Hobson wrote:

> Peter Valdemar Mørch <peter at morch.com> wrote:
>
>> Looking at average and standard deviation is a possibility, but most of my users (and I) really have no good intuitive feeling for what standard deviation really "means".
>
> +1, I don't either

I recommend "Full House", by Stephen Jay Gould, or other essays of his.

Summary of one of his most well-known explanations: Why are there no more 
.400 hitters in baseball? Has the average quality of batters gone down, or 
the average quality of pitchers gone up, or some change to the rules that 
makes batting harder in general? No, none of those. What has happened is 
that the variability of batting has shrunk. So there is less distance 
between the very top batters and the rest of the (major league, already a 
select group) batters.

Standard deviation is a measure of variability; I think of it as the range 
in which an observed value is about 68% likely to be the result of random 
chance (as opposed to being different from the expected value because of 
some real cause).

If Babe Ruth bats .300 in 1915 and .320 in 1916 (I am making up these 
numbers), you would not think it was a big deal, because a .20 difference 
is batting average is pretty small compared to the standard deviation of 
player batting averages at the time. Whereas if David Ortiz bats .300 in 
2015 and .320 in 2016, you might be justified in thinking this is the 
result of something he is doing differently, because the .20 difference is 
big compared to the standard deviation of player batting averages in 2015.

Anyway, I wanted to respond to the OP with a script I wrote, attached. The 
documentation is very scanty, but you never know when something will be 
useful to someone.

   - Alex Aminoff
     BaseSpace.net
     National Bureau of Economic Research (nber.org)
-------------- next part --------------
#!/usr/bin/perl

=head1 NAME

sdna -- Statistical Detection of Network Abberrance

=head1 SYNOPSIS

# as a cron job, every 10 minutes

sdna --query --read --quiet

# command line

sdna --grep nonzeroerrors switch1 switch2
sdna --read

=head1 OPTIONS

 --debug   debug
 --query   Queries all targets and saves collected date to RRD files
           in the RRD directory
 --grep <shortcut> Grep mode. Collections of include and exclude regexps are
           hard coded. Implies --query.
 --read    Read RRD files, calculate stats, display most abberrant values
 --quiet   In read mode, disply no output unless network abberrance is above a threshold.
 --config <file> Config file to read

=head1 DESCRIPTION

sdna is intended to run periodically from cron. It calls snmpbulkwalk
to collect all SNMP values from each target IP address, and stores
each in a RRD (Round-Robin Database) file.

sdna makes use of RRDTool's Abberrant Behavior Detection
functionality.  For each value, we derive an estimate of how abberrant
that value is, which is basically a Z value, or the number of standard
deviations out from our estimated mean for the value.

Then, we aggrgate all the abberrances of all the values to get a grand
estimate of how unusual or abberrant the current state of the network
as a whole is. If greater than a threshold, we send an alert to an
operator.

sdna can also be used from the command line to produce a list of the
most deviant SNMP variables across the entire network. This might be
used to find which switch port a misbehaving device is on.

A feature of this system is that we try to be agnostic about what each
SNMP variable represents. It does not matter if it is bandwidth or
packet loss or the speed of the link - all we care about is how
different it is from its predicted value based on history. In practice
we can not quite be pure about this, see $SKIP_PATTERN

=head1 SEE ALSO

L<RRDs>,L<rrdtool(1)>,L<snmpbulkwalk(1)>

=head1 AUTHOR

Alex Aminoff, alex_aminoff at alum.mit.edu

=head1 COPYRIGHT

Copyright 2013, shared by National Bureau of Economic Research
and Alexander Aminoff

=cut

use Getopt::Long;
use RRDs;

my %byshortcut = ( 
    nonzeroerrors => [ [ qr/Error/o,1],
		       [ qr/: 0/o,0],
    ],
    );
#my $DIR = '/homes/nber/aminoff/DUMPHERE/nbersnmpdata/';
my $DIR = '/var/db/sdna/';

my $SKIP_PATTERN = qr/(SNMPv2-SMI::mib-2|SNMPv2-SMI::transmission|SNMPv2-MIB::snmp|IP-MIB::ipNetToMediaIfIndex|66\.251\.7|198\.71\.[67])/o;

my $debug = 0;
my $grep = '';
my $eachthreshold = 2; # threshold Z score to be counted as abberrant
my $masterthreshold = .1; # threshold of proportion abberant tests for alarm
my ($query,$read,$quiet) = (0,0,0);
my $config = '';
my $nofork = 0;

GetOptions('query' => \$query,
	   'read' => \$read,
	   'grep=s' => \$grep,
	   'debug+' => \$debug,
	   'quiet' => \$quiet,
	   'nofork' => \$nofork,
	   'config=s' => \$config,
    );

if (! $query && ! $read && ! $grep) {
    # default operation
    $read=1;
}

if ($debug) {
    print "After cmd line args:\n debug:$debug query:$query read:$read quiet:$quiet grep:$grep \n";
}

if ($config) {
    die "Reading config file not yet implemented";
#'/etc/sdna.conf';
}

my @pats = ();
if ($grep) {
    print "grep is $grep\n" if $debug;
    @pats = @{ $byshortcut{$grep} };
}

# ms3closet ms4store48
#my @t24 = qw/ ms4store24 ms3b ms4bluebox24 ms3core ms323table ms323rack24 ms323rack24netgear /;
my @t24 = qw/ ms4store24 ms3b ms4bluebox24 ms3core ms323table ms323rack24 /;
my @t48 = qw/ms2upper ms2lower ms4bluebox48 /;
# removed for now 2014-03-14
# ms323rack48 /;

my @all = (@t24, at t48);
#my @all = ('ms3core','ms3b');
my @targets = @ARGV;
@targets = @all unless scalar(@targets);

print "targets: " . join(',', at targets) . "\n" if $debug; 

my $tenmin = 60 * 10; # 10 minutes
my $hour = 60 * 60;
my $perday_slices = 6 * 24; # number of 10min intervals in a day;
my $fiveweek_slices = 40 * $perday_slices;
my $year_hours = 365 * 24;
my $alpha = .06; # half-life is 12 observations, 2 hours
my $beta = .0035; # recommended by man page

$|=1;

if ($query || $grep) {
    my %children = ();
    my %filehandles = ();
    foreach my $target (@targets) {
	my $pid;
	$pid = fork() unless $nofork;
	if($pid) {
	    # we are the parent
	    print "kicked off child process $pid for target $target.\n" if $debug;
	    $children{$pid}=$target;
	} else {
	    # we are the child, or there was no fork
	    print "This is the child process for $target.\n" if $debug && ! $nofork;
	    do_one_target($target);
	    exit 0 unless $nofork;
	}
    }
    unless($nofork) {
	while (my $pid = wait()) {
	    last if $pid == -1;
	    my $sw = $children{$pid};
	    print "finished with child process $pid for target $sw.\n" if $debug;
	    delete $children{$pid};
	    print "Remaining children: " . join(" " , map { $children{$_} } keys %children) . "\n" if $debug;
	}
	print "all query child processes finished.\n" if $debug;
    }
}

if ($read) {
    my $count = 0;
    my $abberrantcount = 0;
    my $calmcount = 0;
    my $fixedcount = 0;
    my $missingcount = 0;
    my %humanreadable = ();

    opendir DH, $DIR;
    while(my $file = readdir(DH)) {
	next if $file =~ m/^\./;
	# now we analyze
	my $real = getrrd($DIR . $file,'AVERAGE');
	my $model = getrrd($DIR . $file,'MHWPREDICT');
	my $deviation = getrrd($DIR . $file,'DEVPREDICT');

	print "read file $file real:$real model:$model dev:$deviation\n" if $debug;

	if (defined($real) && defined($model) && $deviation) {
	    # Statistics
	    my $sigma = abs($real - $model) / $deviation;
	    if ($sigma > $eachthreshold) {
		$abberrantcount += 1;
	    } else {
		$calmcount += 1;
	    }

	    my $key = $file;
	    $key =~ s/\.rrd$//;
	    my $pp = sprintf("%6.1f %-56s = ",$sigma,$key) .
		($real > 1 ? sprintf("%10.1f",$real) : sprintf("%1.4f",$real));

	    $humanreadable{ $pp } = $sigma;
	} elsif (defined($real) && $real == 0) {
	    # real is zero and other stuff not defined.
	    # almost certainly a fixed value
	    $fixedcount += 1;
	} else {
	    # what to do for missing values?
	    $missingcount += 1;
	}

    } # end scanning directory

    unless($calmcount) {
	print "No rrd data found yet.\n";
	exit 0;
    }

    print "Counts: calm:$calmcount abberrant:$abberrantcount fixed:$fixedcount missing:$missingcount\n" if $debug;

# Big stats test
# needs review by theoretical stats expert
    my $abberrance = undef;
    $abberrance = $abberrantcount / ($abberrantcount + $calmcount);
#    my $abberrance_missing = ($abberrantcount + $missingcount)
#	/ ($abberrantcount + $calmcount + $missingcount);

#    $abberrance = $abberrance_missing if  $abberrance_missing > $abberrance;

# so abberrance is just the proportion of abberrant streams in the sample

    if ($abberrance > $masterthreshold) {
	# Alarm
	$quiet = 0;
	print "\n\nAbberrant network condition detected: "
	    . sprintf("%.4f",$abberrance)
	    . " > $masterthreshold \n\n";
    }

    exit 0 if $quiet;

    print "Abberrance: ";
    print $abberrance ? sprintf("%.4f\n",$abberrance) : "undefined\n";

    my @fewer = grep { $humanreadable{$_} > $eachthreshold } keys %humanreadable;
    foreach my $h ( sort { $humanreadable{$a} <=> $humanreadable{$b} } @fewer ) {
	print $h,"\n";
    }
}

sub do_one_target {
    my $target = shift;
    my $tname = sprintf("%-12s",$target);

    my($lines,$created,$wrote,$deleted,$errors)=(0,0,0,0,0);

    my $cmd = "/usr/local/bin/snmpbulkwalk -v 2c -c public $target";
    open(my $infh,'-|',$cmd);
  line:
    while(<$infh>) {
	$lines++;
	if(m/$SKIP_PATTERN/) {
	    exit 0 if $lines > 4000;
	    next line;
	}
	if ($grep) {
	    foreach my $row (@pats) {
		my($pat,$inout) = @{ $row };
		if ($inout) {
		    next line unless m/$pat/;
		} else {
		    next line if m/$pat/;
		}
	    }
	    print $tname,$_;
	} elsif ($query) {
	    my($key,$value) = split(/ *= */,$_);
	    my $item = $target . '.' . $key;
	    my $file = $DIR . $item . '.rrd';

	    $value =~ m/(\S+): (\d+\.?\d*)\s*(\S.*)?/ or next line; # ignore funny values
	    my($type,$realvalue,$rest)=($1,$2,$3);
	    next line if $type =~ m/(hex|BITS|STRING|IpAddr)/i;
	    $rest =~ s/^\s+//;
#	    next line if $rest =~ m/\s+/;
	    print "updating/creating $item = $realvalue, rest is $rest\n" if $debug;
	    if ( -e $file && ! -s $file) {
		# zero size!!! Delete it, it got corrupted.
		print "ERROR: Deleted file $file due to zero size\n";
		$deleted++;
		unlink($file);
	    }
	    if( ! -e $file) {
		# initialize RRD
		my $rrdtype = 'GAUGE';
		my($min,$max)=('U','U');
		if ($type =~ m/^counter/i) {
		    # create COUNTER type
		    $rrdtype = 'COUNTER';
		    $min=0;
		}
		if (! $rest || $rest =~ m/\s/) {
		    $rest = 'units';
		}
		print "INFO: Creating RRD of type $rrdtype for $file type $type unit $rest\n";
		my $ret = RRDs::create(
		    $file,
		    '--step',$tenmin,
		    "DS:$rest:$rrdtype:$hour:$min:$max",
		    "RRA:AVERAGE:0.3:1:$fiveweek_slices",
		    "RRA:MAX:0.3:1:$fiveweek_slices",
		    "RRA:AVERAGE:0.3:6:$year_hours",
		    "RRA:MAX:0.3:6:$year_hours",
		    "RRA:MHWPREDICT:$fiveweek_slices:$alpha:$beta:$perday_slices",
		    );
		my $err = RRDs::error;
		if ($err) {
		    print "ERROR creating $file type:$rrdtype ret:$ret error:$err\n";
		    $errors++;
		    next line;
		}
		$created++;
	    }
	    # add to RRD
	    my $ret = RRDs::update($file,"N:$realvalue");
	    my $err = RRDs::error;
	    if ($err =~ /Cannot allocate memory/) {
		# retry once
		sleep 2;
		$ret = RRDs::update($file,"N:$realvalue");
		$err = RRDs::error;
	    }
	    print "ERROR updating $file with value $realvalue ret:$ret err:$err\n" if $err;
	    $err ? $errors++ : $wrote++;
	}
    }
    close($infh);
    print "lines:$lines created:$created wrote:$wrote errors:$errors deleted:$deleted\n" if $debug;
    return;
}

sub getrrd {
    my($file,$CF)=@_;
    # returns the most recent value that is not undefined.
    my ($start,$step,$names,$data) = RRDs::fetch($file,$CF,'-s','-600');
    my $ret = undef;
    for my $line (@$data) {
	my $val = $line->[0];
	$ret = $val if defined($val);
    }
    return $ret;
}