[rrd-users] Getting an overview of many statistics...
Alex Aminoff
aminoff at nber.org
Fri May 29 15:14:21 CEST 2015
On Fri, 29 May 2015, Simon Hobson wrote:
> Peter Valdemar Mørch <peter at morch.com> wrote:
>
>> Looking at average and standard deviation is a possibility, but most of my users (and I) really have no good intuitive feeling for what standard deviation really "means".
>
> +1, I don't either
I recommend "Full House", by Stephen Jay Gould, or other essays of his.
Summary of one of his most well-known explanations: Why are there no more
.400 hitters in baseball? Has the average quality of batters gone down, or
the average quality of pitchers gone up, or some change to the rules that
makes batting harder in general? No, none of those. What has happened is
that the variability of batting has shrunk. So there is less distance
between the very top batters and the rest of the (major league, already a
select group) batters.
Standard deviation is a measure of variability; I think of it as the range
in which an observed value is about 68% likely to be the result of random
chance (as opposed to being different from the expected value because of
some real cause).
If Babe Ruth bats .300 in 1915 and .320 in 1916 (I am making up these
numbers), you would not think it was a big deal, because a .20 difference
is batting average is pretty small compared to the standard deviation of
player batting averages at the time. Whereas if David Ortiz bats .300 in
2015 and .320 in 2016, you might be justified in thinking this is the
result of something he is doing differently, because the .20 difference is
big compared to the standard deviation of player batting averages in 2015.
Anyway, I wanted to respond to the OP with a script I wrote, attached. The
documentation is very scanty, but you never know when something will be
useful to someone.
- Alex Aminoff
BaseSpace.net
National Bureau of Economic Research (nber.org)
-------------- next part --------------
#!/usr/bin/perl
=head1 NAME
sdna -- Statistical Detection of Network Abberrance
=head1 SYNOPSIS
# as a cron job, every 10 minutes
sdna --query --read --quiet
# command line
sdna --grep nonzeroerrors switch1 switch2
sdna --read
=head1 OPTIONS
--debug debug
--query Queries all targets and saves collected date to RRD files
in the RRD directory
--grep <shortcut> Grep mode. Collections of include and exclude regexps are
hard coded. Implies --query.
--read Read RRD files, calculate stats, display most abberrant values
--quiet In read mode, disply no output unless network abberrance is above a threshold.
--config <file> Config file to read
=head1 DESCRIPTION
sdna is intended to run periodically from cron. It calls snmpbulkwalk
to collect all SNMP values from each target IP address, and stores
each in a RRD (Round-Robin Database) file.
sdna makes use of RRDTool's Abberrant Behavior Detection
functionality. For each value, we derive an estimate of how abberrant
that value is, which is basically a Z value, or the number of standard
deviations out from our estimated mean for the value.
Then, we aggrgate all the abberrances of all the values to get a grand
estimate of how unusual or abberrant the current state of the network
as a whole is. If greater than a threshold, we send an alert to an
operator.
sdna can also be used from the command line to produce a list of the
most deviant SNMP variables across the entire network. This might be
used to find which switch port a misbehaving device is on.
A feature of this system is that we try to be agnostic about what each
SNMP variable represents. It does not matter if it is bandwidth or
packet loss or the speed of the link - all we care about is how
different it is from its predicted value based on history. In practice
we can not quite be pure about this, see $SKIP_PATTERN
=head1 SEE ALSO
L<RRDs>,L<rrdtool(1)>,L<snmpbulkwalk(1)>
=head1 AUTHOR
Alex Aminoff, alex_aminoff at alum.mit.edu
=head1 COPYRIGHT
Copyright 2013, shared by National Bureau of Economic Research
and Alexander Aminoff
=cut
use Getopt::Long;
use RRDs;
my %byshortcut = (
nonzeroerrors => [ [ qr/Error/o,1],
[ qr/: 0/o,0],
],
);
#my $DIR = '/homes/nber/aminoff/DUMPHERE/nbersnmpdata/';
my $DIR = '/var/db/sdna/';
my $SKIP_PATTERN = qr/(SNMPv2-SMI::mib-2|SNMPv2-SMI::transmission|SNMPv2-MIB::snmp|IP-MIB::ipNetToMediaIfIndex|66\.251\.7|198\.71\.[67])/o;
my $debug = 0;
my $grep = '';
my $eachthreshold = 2; # threshold Z score to be counted as abberrant
my $masterthreshold = .1; # threshold of proportion abberant tests for alarm
my ($query,$read,$quiet) = (0,0,0);
my $config = '';
my $nofork = 0;
GetOptions('query' => \$query,
'read' => \$read,
'grep=s' => \$grep,
'debug+' => \$debug,
'quiet' => \$quiet,
'nofork' => \$nofork,
'config=s' => \$config,
);
if (! $query && ! $read && ! $grep) {
# default operation
$read=1;
}
if ($debug) {
print "After cmd line args:\n debug:$debug query:$query read:$read quiet:$quiet grep:$grep \n";
}
if ($config) {
die "Reading config file not yet implemented";
#'/etc/sdna.conf';
}
my @pats = ();
if ($grep) {
print "grep is $grep\n" if $debug;
@pats = @{ $byshortcut{$grep} };
}
# ms3closet ms4store48
#my @t24 = qw/ ms4store24 ms3b ms4bluebox24 ms3core ms323table ms323rack24 ms323rack24netgear /;
my @t24 = qw/ ms4store24 ms3b ms4bluebox24 ms3core ms323table ms323rack24 /;
my @t48 = qw/ms2upper ms2lower ms4bluebox48 /;
# removed for now 2014-03-14
# ms323rack48 /;
my @all = (@t24, at t48);
#my @all = ('ms3core','ms3b');
my @targets = @ARGV;
@targets = @all unless scalar(@targets);
print "targets: " . join(',', at targets) . "\n" if $debug;
my $tenmin = 60 * 10; # 10 minutes
my $hour = 60 * 60;
my $perday_slices = 6 * 24; # number of 10min intervals in a day;
my $fiveweek_slices = 40 * $perday_slices;
my $year_hours = 365 * 24;
my $alpha = .06; # half-life is 12 observations, 2 hours
my $beta = .0035; # recommended by man page
$|=1;
if ($query || $grep) {
my %children = ();
my %filehandles = ();
foreach my $target (@targets) {
my $pid;
$pid = fork() unless $nofork;
if($pid) {
# we are the parent
print "kicked off child process $pid for target $target.\n" if $debug;
$children{$pid}=$target;
} else {
# we are the child, or there was no fork
print "This is the child process for $target.\n" if $debug && ! $nofork;
do_one_target($target);
exit 0 unless $nofork;
}
}
unless($nofork) {
while (my $pid = wait()) {
last if $pid == -1;
my $sw = $children{$pid};
print "finished with child process $pid for target $sw.\n" if $debug;
delete $children{$pid};
print "Remaining children: " . join(" " , map { $children{$_} } keys %children) . "\n" if $debug;
}
print "all query child processes finished.\n" if $debug;
}
}
if ($read) {
my $count = 0;
my $abberrantcount = 0;
my $calmcount = 0;
my $fixedcount = 0;
my $missingcount = 0;
my %humanreadable = ();
opendir DH, $DIR;
while(my $file = readdir(DH)) {
next if $file =~ m/^\./;
# now we analyze
my $real = getrrd($DIR . $file,'AVERAGE');
my $model = getrrd($DIR . $file,'MHWPREDICT');
my $deviation = getrrd($DIR . $file,'DEVPREDICT');
print "read file $file real:$real model:$model dev:$deviation\n" if $debug;
if (defined($real) && defined($model) && $deviation) {
# Statistics
my $sigma = abs($real - $model) / $deviation;
if ($sigma > $eachthreshold) {
$abberrantcount += 1;
} else {
$calmcount += 1;
}
my $key = $file;
$key =~ s/\.rrd$//;
my $pp = sprintf("%6.1f %-56s = ",$sigma,$key) .
($real > 1 ? sprintf("%10.1f",$real) : sprintf("%1.4f",$real));
$humanreadable{ $pp } = $sigma;
} elsif (defined($real) && $real == 0) {
# real is zero and other stuff not defined.
# almost certainly a fixed value
$fixedcount += 1;
} else {
# what to do for missing values?
$missingcount += 1;
}
} # end scanning directory
unless($calmcount) {
print "No rrd data found yet.\n";
exit 0;
}
print "Counts: calm:$calmcount abberrant:$abberrantcount fixed:$fixedcount missing:$missingcount\n" if $debug;
# Big stats test
# needs review by theoretical stats expert
my $abberrance = undef;
$abberrance = $abberrantcount / ($abberrantcount + $calmcount);
# my $abberrance_missing = ($abberrantcount + $missingcount)
# / ($abberrantcount + $calmcount + $missingcount);
# $abberrance = $abberrance_missing if $abberrance_missing > $abberrance;
# so abberrance is just the proportion of abberrant streams in the sample
if ($abberrance > $masterthreshold) {
# Alarm
$quiet = 0;
print "\n\nAbberrant network condition detected: "
. sprintf("%.4f",$abberrance)
. " > $masterthreshold \n\n";
}
exit 0 if $quiet;
print "Abberrance: ";
print $abberrance ? sprintf("%.4f\n",$abberrance) : "undefined\n";
my @fewer = grep { $humanreadable{$_} > $eachthreshold } keys %humanreadable;
foreach my $h ( sort { $humanreadable{$a} <=> $humanreadable{$b} } @fewer ) {
print $h,"\n";
}
}
sub do_one_target {
my $target = shift;
my $tname = sprintf("%-12s",$target);
my($lines,$created,$wrote,$deleted,$errors)=(0,0,0,0,0);
my $cmd = "/usr/local/bin/snmpbulkwalk -v 2c -c public $target";
open(my $infh,'-|',$cmd);
line:
while(<$infh>) {
$lines++;
if(m/$SKIP_PATTERN/) {
exit 0 if $lines > 4000;
next line;
}
if ($grep) {
foreach my $row (@pats) {
my($pat,$inout) = @{ $row };
if ($inout) {
next line unless m/$pat/;
} else {
next line if m/$pat/;
}
}
print $tname,$_;
} elsif ($query) {
my($key,$value) = split(/ *= */,$_);
my $item = $target . '.' . $key;
my $file = $DIR . $item . '.rrd';
$value =~ m/(\S+): (\d+\.?\d*)\s*(\S.*)?/ or next line; # ignore funny values
my($type,$realvalue,$rest)=($1,$2,$3);
next line if $type =~ m/(hex|BITS|STRING|IpAddr)/i;
$rest =~ s/^\s+//;
# next line if $rest =~ m/\s+/;
print "updating/creating $item = $realvalue, rest is $rest\n" if $debug;
if ( -e $file && ! -s $file) {
# zero size!!! Delete it, it got corrupted.
print "ERROR: Deleted file $file due to zero size\n";
$deleted++;
unlink($file);
}
if( ! -e $file) {
# initialize RRD
my $rrdtype = 'GAUGE';
my($min,$max)=('U','U');
if ($type =~ m/^counter/i) {
# create COUNTER type
$rrdtype = 'COUNTER';
$min=0;
}
if (! $rest || $rest =~ m/\s/) {
$rest = 'units';
}
print "INFO: Creating RRD of type $rrdtype for $file type $type unit $rest\n";
my $ret = RRDs::create(
$file,
'--step',$tenmin,
"DS:$rest:$rrdtype:$hour:$min:$max",
"RRA:AVERAGE:0.3:1:$fiveweek_slices",
"RRA:MAX:0.3:1:$fiveweek_slices",
"RRA:AVERAGE:0.3:6:$year_hours",
"RRA:MAX:0.3:6:$year_hours",
"RRA:MHWPREDICT:$fiveweek_slices:$alpha:$beta:$perday_slices",
);
my $err = RRDs::error;
if ($err) {
print "ERROR creating $file type:$rrdtype ret:$ret error:$err\n";
$errors++;
next line;
}
$created++;
}
# add to RRD
my $ret = RRDs::update($file,"N:$realvalue");
my $err = RRDs::error;
if ($err =~ /Cannot allocate memory/) {
# retry once
sleep 2;
$ret = RRDs::update($file,"N:$realvalue");
$err = RRDs::error;
}
print "ERROR updating $file with value $realvalue ret:$ret err:$err\n" if $err;
$err ? $errors++ : $wrote++;
}
}
close($infh);
print "lines:$lines created:$created wrote:$wrote errors:$errors deleted:$deleted\n" if $debug;
return;
}
sub getrrd {
my($file,$CF)=@_;
# returns the most recent value that is not undefined.
my ($start,$step,$names,$data) = RRDs::fetch($file,$CF,'-s','-600');
my $ret = undef;
for my $line (@$data) {
my $val = $line->[0];
$ret = $val if defined($val);
}
return $ret;
}
More information about the rrd-users
mailing list