[smokeping-users] Querying rrd's directly.
Gregory Sloop
gregs at sloop.net
Fri May 24 19:45:36 CEST 2019
Tobi - I hate to take more of your time, you've already graced us with this great tool - but I can't move forward to write the OMD/Nagios plug-in I'd like to write until I understand the RRD database for smokeping.
Could you take a brief look and see if you can give me some quick guidance?
I posted about it a while back, and the quoted message below is a follow-up.
In short, I like to handle reporting via OMD/Nagios. Further, I'd like to query the RRD's for loss/latency/jitter over - say - several hours, but use very tight thresholds. [For example latency is normally 20-30ms - using a four hour average, I could flag an average increase in latency, say at only 40ms. Or loss, over 4 hours of an average of 4%.] These kinds of alert limits would be nuts to show, for even 20-30m, because they'll probably get tripped a lot. But they are useful for me, because many of the locations I poll may have systemic problems that might be generally small, but persistent. Being alerted to them, but also avoiding "false" alerts is the goal.
So, my goal is to write a OMD/Nagios plug-in that will query the smokeping RRD's and get averages for RTT/Loss/jitter over a long user-defined time-period.
This isn't very do-able in the smokeping alerts, and the smokeping OMD/Nagios plug-ins I have seen are pretty lame, and can only query a single [most recent] RRD data-point- not averages.
However as I look at the data in the RRD's, I'm pretty confused.
See the quoted message below for my confusion. In short, the RRD "loss" column doesn't seem to agree with the returned results in ping1...pingN, and the colors in a smokeping output graph don't seem to match either. [All three seem to disagree with each other.]
I need to understand why, so my plug-in can work as accurately as possible.
-Greg
It's been a while since I had time to dedicate to this idea - but now I'm part way through it.
Thanks Darren for the offer to look at what I was doing wrong, querying the RRD's. I think I've made some progress, and get what I expect now. [Well, mostly.]
So, when I use the CLI RRD tools - fetch, it returns a header like this:
uptime loss median ping1 ping2 .. ping20
If I look at the matching data it returns, it appears that there's no header for the first column.
This is the epoch time [seconds since some date long ago] That's good.
Then Uptime. I assume that is the second column/value. IME, it's always "null" or NaN. That seems good too, though I'm not sure why it's there - but oh, well.
**Loss. I'd have thought this is "packet loss" or how many of the fpings [in my case] that were returned. But that doesn't seem to be the case.'
Median is the median of something - I'd guess it's the "middle" value. [Not the "average" but the actual "median" of the RTT's in this sample. That seems fine.
And then the rest of the pings all seem reasonable.
So, out of the columns I really have problems with, it's the "loss" column that's just not comprehensible.
If I look at a smokeping graph, and by the color values in the graph, I can get a rough idea how many packets were lost. [At least according to the graph.]
In one sample, in a four minute period I see the graph showing ~25% loss the first minute, ~50% the next, 10% the next and 0% in final minute. [step=60s.]
However, if I fetch that data from the RRD [in full resolution] using something like this:
rrdtool fetch /var/lib/smokeping/some.rrd AVERAGE -s -240
...I get a data table like this.
Epoch time uptime loss median ping1 ping2 ... ping20
---
1557359880: nan 3.966667 0.001019 0.001010 0.001010 ...
1557359940: nan 5.300000 0.001021 nan nan
1557360000: nan 1.733333 0.001024 0.001000 0.001000
1557360060: nan 0.000000 0.001026 0.001000 0.001006
The second minute[1557359940] has four ping samples that return NaN - which I assume is lost packets.
But that doesn't match the value in the "loss" column - it's 5.3.
And the graph showed ~50% loss - yet the actual samples show 4/20 [4 NaN, and 16 samples with valid values.] or 20%.
The first and third minutes [1557359880 & 1557360000] have millisecond values for every ping, 1-20 - which seems, to me, to mean there was NO packet loss.
Yet the graph shows ~25% and ~10% respectively.
And even more confusing, yet again, is the loss column - showing 3.96667 and 1.73333 respectively.
Can some one please explain what is really in that third smokeping column [seemingly labelled "loss"] and how it's calculations are done? And why do the graphs, the loss column and the ping returned values columns simply not agree with each other?
I just really need to understand what's going on, because I don't want to write a plugin that's going to return data/state incorrectly!
TIA
-Greg
So, I know querying the RRD isn't exactly a smokeping problem - but I think it's an appropriate place to start.
I'm attempting to write a Nagios/OMD plugin.
Yes, there is a smokeping plug-in currently, but the problem I'm trying to solve is this...
I've had cases where latency or packet loss goes up, consistently, and I'd like to get alerts.
But I don't want alerts when a single sample gets, say 3% loss, or latency jumps 30%. But if I measured that over say, 20 minutes, or an hour, or four hours - well then I could set limits that would be a lot tighter than I would for a single sample.
For example, if packet loss is greater than 2% for an hour - well we've probably got a problem. Same with latency. It might go up for someone's upload/download - but if it climbs 40% for four hours, then it's a problem we ought to look at.
With the smokeping plugin or Nagios's TCP probe - you can really only look at the result for a single sample [essentially], not an average.
Thus, you end up setting limits that are far outside of what might actually constitute a problem, because you might have that happen for a few minutes - perhaps a few times a day - and you don't want nagios [or smokeping] to alert on all those instances. So, that means you inevitably miss events that are important.
So, I'm wanting a smokeping plug-in that you can set it to average the last X number of minutes/hours/whatever of loss/latency/jitter and generate warnings/critical events.
So, I need to query the RRD's and pull stats.
Ok, now that I've got you so far [Thanks by the way!] - here's the problem I've got.
[I'm a terrible coder, I have a short attention span, I am even worse at perl, and I hate details! So, be patient with me!]
Code snippet: [I stole this off the web somewhere, I don't recall where...]
---
#!/usr/bin/perl -W
#
#
use lib qw( /usr/lib/arm-linux-gnueabihf/perl5/5.20 ../lib/perl );
use RRDs;
use POSIX qw(strftime);
#start_time is the oldest data-point, and end_time is the newest.
my $cur_time = time(); # set current time
my $end_time = $cur_time - 60; # set end time to 1m ago
my $start_time = $end_time - 600; # set start 10m in the past
my $rrd_res = 60;
my $temp_var = "";
#$f_cur_time = ctime($cur_time);
#$f_end_time = ctime($end_time);
#$f_start_time = ctime($start_time);
#$f_end_time = ctime($end_time);
print "CT: $cur_time \n";
print strftime("%m/%d/%Y %H:%M:%S",localtime($cur_time));
print "\n \n";
print "ET: $end_time \n";
print strftime("%m/%d/%Y %H:%M:%S",localtime($end_time));
print "\n \n";
print "ST: $start_time \n";
print strftime("%m/%d/%Y %H:%M:%S",localtime($start_time));
print "\n \n";
#exit;
# fetch average values from the RRD database between start and end time
my ($start,$step,$ds_names,$data) =
RRDs::fetch("/var/lib/smokeping/Some-CPE.rrd", "AVERAGE",
"-r", "$rrd_res", "-s", "$start_time", "-e", "$end_time");
# save fetched values in a 2-dimensional array
my $rows = 0;
my $columns = 0;
my $time_variable = $start;
print "Start: $start : ";
print strftime("%m/%d/%Y %H:%M:%S",localtime($start));
print "\n \n";
print "step: $step \n";
print "start loop \n";
print " --- \n";
foreach $line (@$data) {
$vals[$rows][$columns] = $time_variable;
$temp_var = $time_variable;
print strftime("%m/%d/%Y %H:%M:%S",localtime($temp_var));
print "\n";
$time_variable = $time_variable + $step;
$temp_var = $time_variable;
print strftime("%m/%d/%Y %H:%M:%S",localtime($temp_var));
print "\n";
foreach $val (@$line) {
print " --- \n";
print "row: $rows - col: $columns \n";
print "Val: $val ";
$vals[$rows][++$columns] = $val;
print "VC: $vals[$rows][$columns] \n";
print " --- \n";
}
$rows++;
$columns = 0;
}
exit;
---
I've put in a bunch of print statements so I can try to figure out what's going on. [You can ignore all that...]
There's also some errors in the for loop, because it parses more rows than exist in the fetch - but ignore that too. [At least for now. Or you can tell me why - if you like. I'm pretty sure I'll figure it out.]
But what's interesting [at least right now] is that the first two columns have issues.
Column one [or the first returned value from every row] appears to always be null.
And the second always appears to be zero.
[At least in my case, with my RRDs.]
But I'm pretty sure it's the same with any RRD from smokeping.
I may not understand [almost certainly don't] what's going on, but I'd have expected the values in the columns 3-23 to start at 1 and go through 20. [I do 20 samples in this RRD per row.]
So, can someone explain why the first value [column] is always null, and the second is always zero? [These are all full resolution samples, no aggregation has occurred.]
Thanks for anyone who takes a stab at it.
And if you're reading Tobi, I'd be glad for your input and/or thoughts.
Thanks!
-Greg
--
Gregory Sloop, Principal: Sloop Network & Computer Consulting
Voice: 503.251.0452 x82
EMail: gregs at sloop.net
http://www.sloop.net
---
--
Gregory Sloop, Principal: Sloop Network & Computer Consulting
Voice: 503.251.0452 x82
EMail: gregs at sloop.net
http://www.sloop.net
---
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.oetiker.ch/pipermail/smokeping-users/attachments/20190524/24198657/attachment.html>
More information about the smokeping-users
mailing list