[smokeping-users] Querying rrd's directly.

Thu Jun 13 19:57:52 CEST 2019

I'm re-posting this, in the hope that someone can clairify the smokeping RRD structure/meaning.
If you know the smokeping RRD's, would you please look at it? I don't think it will take a lot of your time, and it would be really helpful to me. [And the OMD/Naemon/etc plug-in community, since I plan to release this as a plug-in.]

---
It's been a while since I had time to dedicate to this idea - but now I'm part way through it.
Thanks Darren for the offer to look at what I was doing wrong, querying the RRD's. I think I've made some progress, and get what I expect now. [Well, mostly.]

So, when I use the CLI RRD tools - fetch, it returns a header like this:
uptime loss median ping1 ping2 .. ping20

If I look at the matching data it returns, it appears that there's no header for the first column.

This is the epoch time [seconds since some date long ago] That's good.
Then Uptime. I assume that is the second column/value. IME, it's always "null" or NaN. That seems good too, though I'm not sure why it's there - but oh, well.

**Loss. I'd have thought this is "packet loss" or how many of the fpings [in my case] that were returned. But that doesn't seem to be the case.'
Median is the median of something - I'd guess it's the "middle" value. [Not the "average" but the actual "median" of the RTT's in this sample. That seems fine.
And then the rest of the pings all seem reasonable.

So, out of the columns I really have problems with, it's the "loss" column that's just not comprehensible.

If I look at a smokeping graph, and by the color values in the graph, I can get a rough idea how many packets were lost. [At least according to the graph.]
In one sample, in a four minute period I see the graph showing ~25% loss the first minute, ~50% the next, 10% the next and 0% in final minute. [step=60s.]

However, if I fetch that data from the RRD [in full resolution] using something like this:

rrdtool fetch /var/lib/smokeping/some.rrd AVERAGE -s -240 
...I get a data table like this.

Epoch time        uptime        loss                median                ping1                ping2 ... ping20
---
1557359880:        nan        3.966667        0.001019        0.001010        0.001010 ... 
1557359940:        nan        5.300000        0.001021        nan                nan
1557360000:        nan        1.733333        0.001024        0.001000        0.001000
1557360060:        nan        0.000000        0.001026        0.001000        0.001006

The second minute[1557359940] has four ping samples that return NaN - which I assume is lost packets.
But that doesn't match the value in the "loss" column - it's 5.3.
And the graph showed ~50% loss - yet the actual samples show 4/20 [4 NaN, and 16 samples with valid values.] or 20%.

The first and third minutes [1557359880 & 1557360000] have millisecond values for every ping, 1-20 - which seems, to me, to mean there was NO packet loss.
Yet the graph shows ~25% and ~10% respectively.
And even more confusing, yet again, is the loss column - showing 3.96667 and 1.73333 respectively.

Can some one please explain what is really in that third smokeping column [seemingly labelled "loss"] and how it's calculations are done? And why do the graphs, the loss column and the ping returned values columns simply not agree with each other?

I just really need to understand what's going on, because I don't want to write a plugin that's going to return data/state incorrectly!

TIA
-Greg

So, I know querying the RRD isn't exactly a smokeping problem - but I think it's an appropriate place to start.

I'm attempting to write a Nagios/OMD plugin.
Yes, there is a smokeping plug-in currently, but the problem I'm trying to solve is this...

I've had cases where latency or packet loss goes up, consistently, and I'd like to get alerts.
But I don't want alerts when a single sample gets, say 3% loss, or latency jumps 30%. But if I measured that over say, 20 minutes, or an hour, or four hours - well then I could set limits that would be a lot tighter than I would for a single sample.

For example, if packet loss is greater than 2% for an hour - well we've probably got a problem. Same with latency. It might go up for someone's upload/download - but if it climbs 40% for four hours, then it's a problem we ought to look at.

With the smokeping plugin or Nagios's TCP probe - you can really only look at the result for a single sample [essentially], not an average. 

Thus, you end up setting limits that are far outside of what might actually constitute a problem, because you might have that happen for a few minutes - perhaps a few times a day - and you don't want nagios [or smokeping] to alert on all those instances. So, that means you inevitably miss events that are important.

So, I'm wanting a smokeping plug-in that you can set it to average the last X number of minutes/hours/whatever of loss/latency/jitter and generate warnings/critical events.

So, I need to query the RRD's and pull stats.

Ok, now that I've got you so far [Thanks by the way!] - here's the problem I've got.
[I'm a terrible coder, I have a short attention span, I am even worse at perl, and I hate details! So, be patient with me!]

Code snippet: [I stole this off the web somewhere, I don't recall where...]
---
#!/usr/bin/perl -W
# 
# 

use lib qw( /usr/lib/arm-linux-gnueabihf/perl5/5.20 ../lib/perl );
use RRDs;
use POSIX qw(strftime);

#start_time is the oldest data-point, and end_time is the newest.
my $cur_time = time();                # set current time
my $end_time = $cur_time - 60;     # set end time to 1m ago
my $start_time = $end_time - 600; # set start 10m in the past
my $rrd_res = 60;
my $temp_var = "";

#$f_cur_time = ctime($cur_time);
#$f_end_time = ctime($end_time);
#$f_start_time = ctime($start_time);
#$f_end_time = ctime($end_time);

print "CT: $cur_time \n";
print strftime("%m/%d/%Y %H:%M:%S",localtime($cur_time));
print "\n \n";

print "ET: $end_time \n";
print strftime("%m/%d/%Y %H:%M:%S",localtime($end_time));
print "\n \n";

print "ST: $start_time \n";
print strftime("%m/%d/%Y %H:%M:%S",localtime($start_time));
print "\n \n";

#exit;
# fetch average values from the RRD database between start and end time
my ($start,$step,$ds_names,$data) =
  RRDs::fetch("/var/lib/smokeping/Some-CPE.rrd", "AVERAGE",
              "-r", "$rrd_res", "-s", "$start_time", "-e", "$end_time");

# save fetched values in a 2-dimensional array
my $rows = 0;
my $columns = 0;
my $time_variable = $start;

print "Start: $start : ";
print strftime("%m/%d/%Y %H:%M:%S",localtime($start));
print "\n \n";
print "step: $step \n";

print "start loop \n";
print " --- \n";
foreach $line (@$data) {
$vals[$rows][$columns] = $time_variable;
$temp_var = $time_variable;
print strftime("%m/%d/%Y %H:%M:%S",localtime($temp_var));
print "\n";  

$time_variable = $time_variable + $step;
$temp_var = $time_variable;
print strftime("%m/%d/%Y %H:%M:%S",localtime($temp_var));
print "\n";  

foreach $val (@$line) {
                      print " --- \n";
                       print "row: $rows - col: $columns \n";
                       print "Val: $val ";
                        $vals[$rows][++$columns] = $val;
                       print "VC: $vals[$rows][$columns] \n";
                       print " --- \n";
                       }
$rows++;
$columns = 0;
}

exit;
---

I've put in a bunch of print statements so I can try to figure out what's going on. [You can ignore all that...]
There's also some errors in the for loop, because it parses more rows than exist in the fetch - but ignore that too. [At least for now. Or you can tell me why - if you like. I'm pretty sure I'll figure it out.]

But what's interesting [at least right now] is that the first two columns have issues.
Column one [or the first returned value from every row] appears to always be null.
And the second always appears to be zero.
[At least in my case, with my RRDs.]
But I'm pretty sure it's the same with any RRD from smokeping.

I may not understand [almost certainly don't] what's going on, but I'd have expected the values in the columns 3-23 to start at 1 and go through 20. [I do 20 samples in this RRD per row.]

So, can someone explain why the first value [column] is always null, and the second is always zero? [These are all full resolution samples, no aggregation has occurred.]

Thanks for anyone who takes a stab at it.
And if you're reading Tobi, I'd be glad for your input and/or thoughts.

Thanks!
-Greg
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.oetiker.ch/pipermail/smokeping-users/attachments/20190613/d3603f15/attachment-0002.html>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.oetiker.ch/pipermail/smokeping-users/attachments/20190613/d3603f15/attachment-0003.html>