<html><head><title>Re: [smokeping-users] Querying rrd's directly.</title>

</head>

<body>

<span style=" font-family:'Courier New'; font-size: 9pt;">Tobi - I hate to take more of your time, you've already graced us with this great tool - but I can't move forward to write the OMD/Nagios plug-in I'd like to write until I understand the RRD database for smokeping.<br>

<br>

Could you take a brief look and see if you can give me some quick guidance?<br>

<br>

I posted about it a while back, and the quoted message below is a follow-up. <br>

<br>

In short, I like to handle reporting via OMD/Nagios. Further, I'd like to query the RRD's for loss/latency/jitter over - say - several hours, but use very tight thresholds. [For example latency is normally 20-30ms - using a four hour average, I could flag an average increase in latency, say at only 40ms. Or loss, over 4 hours of an average of 4%.] These kinds of alert limits would be nuts to show, for even 20-30m, because they'll probably get tripped a lot. But they are useful for me, because many of the locations I poll may have systemic problems that might be generally small, but persistent. Being alerted to them, but also avoiding "false" alerts is the goal.<br>

<br>

So, my goal is to write a OMD/Nagios plug-in that will query the smokeping RRD's and get averages for RTT/Loss/jitter over a long user-defined time-period. <br>

<br>

This isn't very do-able in the smokeping alerts, and the smokeping OMD/Nagios plug-ins I have seen are pretty lame, and can only query a single [most recent] RRD data-point- not averages.<br>

<br>

However as I look at the data in the RRD's, I'm pretty confused.<br>

<br>

See the quoted message below for my confusion. In short, the RRD "loss" column doesn't seem to agree with the returned results in ping1...pingN, and the colors in a smokeping output graph don't seem to match either. [All three seem to disagree with each other.]<br>

<br>

I need to understand why, so my plug-in can work as accurately as possible.<br>

<br>

-Greg<br>

<br>

</span><table>

<tr>

<td width=2 bgcolor= #0000ff><br>

</td>

<td><span style=" font-family:'courier new'; font-size: 9pt;">It's been a while since I had time to dedicate to this idea - but now I'm part way through it.<br>

Thanks Darren for the offer to look at what I was doing wrong, querying the RRD's. I think I've made some progress, and get what I expect now. [Well, mostly.]<br>

<br>

So, when I use the CLI RRD tools - fetch, it returns a header like this:<br>

uptime loss median ping1 ping2 .. ping20<br>

<br>

If I look at the matching data it returns, it appears that there's no header for the first column.<br>

<br>

This is the epoch time [seconds since some date long ago] That's good.<br>

Then Uptime. I assume that is the second column/value. IME, it's always "null" or NaN. That seems good too, though I'm not sure why it's there - but oh, well.<br>

<br>

**Loss. I'd have thought this is "packet loss" or how many of the fpings [in my case] that were returned. But that doesn't seem to be the case.'<br>

Median is the median of something - I'd guess it's the "middle" value. [Not the "average" but the actual "median" of the RTT's in this sample. That seems fine.<br>

And then the rest of the pings all seem reasonable.<br>

<br>

So, out of the columns I really have problems with, it's the "loss" column that's just not comprehensible.<br>

<br>

If I look at a smokeping graph, and by the color values in the graph, I can get a rough idea how many packets were lost. [At least according to the graph.]<br>

In one sample, in a four minute period I see the graph showing ~25% loss the first minute, ~50% the next, 10% the next and 0% in final minute. [step=60s.]<br>

<br>

However, if I fetch that data from the RRD [in full resolution] using something like this:<br>

<br>

rrdtool fetch /var/lib/smokeping/some.rrd AVERAGE -s -240 <br>

...I get a data table like this.<br>

<br>

Epoch time        uptime        loss                median                ping1                ping2 ... ping20<br>

---<br>

1557359880:        nan        3.966667        0.001019        0.001010        0.001010 ... <br>

1557359940:        nan        5.300000        0.001021        nan                nan<br>

1557360000:        nan        1.733333        0.001024        0.001000        0.001000<br>

1557360060:        nan        0.000000        0.001026        0.001000        0.001006<br>

<br>

The second minute[1557359940] has four ping samples that return NaN - which I assume is lost packets.<br>

But that doesn't match the value in the "loss" column - it's 5.3.<br>

And the graph showed ~50% loss - yet the actual samples show 4/20 [4 NaN, and 16 samples with valid values.] or 20%.<br>

<br>

The first and third minutes [1557359880 & 1557360000] have millisecond values for every ping, 1-20 - which seems, to me, to mean there was NO packet loss.<br>

Yet the graph shows ~25% and ~10% respectively.<br>

And even more confusing, yet again, is the loss column - showing 3.96667 and 1.73333 respectively.<br>

<br>

Can some one please explain what is really in that third smokeping column [seemingly labelled "loss"] and how it's calculations are done? And why do the graphs, the loss column and the ping returned values columns simply not agree with each other?<br>

<br>

I just really need to understand what's going on, because I don't want to write a plugin that's going to return data/state incorrectly!<br>

<br>

TIA<br>

-Greg<br>

<br>

<br>

</span><table>

<tr>

<td width=2 bgcolor= #0000ff><br>

</td>

<td><span style=" font-family:'courier new'; font-size: 9pt;">So, I know querying the RRD isn't exactly a smokeping problem - but I think it's an appropriate place to start.<br>

<br>

I'm attempting to write a Nagios/OMD plugin.<br>

Yes, there is a smokeping plug-in currently, but the problem I'm trying to solve is this...<br>

<br>

I've had cases where latency or packet loss goes up, consistently, and I'd like to get alerts.<br>

But I don't want alerts when a single sample gets, say 3% loss, or latency jumps 30%. But if I measured that over say, 20 minutes, or an hour, or four hours - well then I could set limits that would be a lot tighter than I would for a single sample.<br>

<br>

For example, if packet loss is greater than 2% for an hour - well we've probably got a problem. Same with latency. It might go up for someone's upload/download - but if it climbs 40% for four hours, then it's a problem we ought to look at.<br>

<br>

With the smokeping plugin or Nagios's TCP probe - you can really only look at the result for a single sample [essentially], not an average. <br>

<br>

Thus, you end up setting limits that are far outside of what might actually constitute a problem, because you might have that happen for a few minutes - perhaps a few times a day - and you don't want nagios [or smokeping] to alert on all those instances. So, that means you inevitably miss events that are important.<br>

<br>

So, I'm wanting a smokeping plug-in that you can set it to average the last X number of minutes/hours/whatever of loss/latency/jitter and generate warnings/critical events.<br>

<br>

So, I need to query the RRD's and pull stats.<br>

<br>

Ok, now that I've got you so far [Thanks by the way!] - here's the problem I've got.<br>

[I'm a terrible coder, I have a short attention span, I am even worse at perl, and I hate details! So, be patient with me!]<br>

<br>

Code snippet: [I stole this off the web somewhere, I don't recall where...]<br>

---<br>

#!/usr/bin/perl -W<br>

# <br>

# <br>

<br>

use lib qw( /usr/lib/arm-linux-gnueabihf/perl5/5.20 ../lib/perl );<br>

use RRDs;<br>

use POSIX qw(strftime);<br>

<br>

#start_time is the oldest data-point, and end_time is the newest.<br>

my $cur_time = time();                # set current time<br>

my $end_time = $cur_time - 60;     # set end time to 1m ago<br>

my $start_time = $end_time - 600; # set start 10m in the past<br>

my $rrd_res = 60;<br>

my $temp_var = "";<br>

<br>

#$f_cur_time = ctime($cur_time);<br>

#$f_end_time = ctime($end_time);<br>

#$f_start_time = ctime($start_time);<br>

#$f_end_time = ctime($end_time);<br>

<br>

print "CT: $cur_time \n";<br>

print strftime("%m/%d/%Y %H:%M:%S",localtime($cur_time));<br>

print "\n \n";<br>

<br>

print "ET: $end_time \n";<br>

print strftime("%m/%d/%Y %H:%M:%S",localtime($end_time));<br>

print "\n \n";<br>

<br>

print "ST: $start_time \n";<br>

print strftime("%m/%d/%Y %H:%M:%S",localtime($start_time));<br>

print "\n \n";<br>

<br>

#exit;<br>

# fetch average values from the RRD database between start and end time<br>

my ($start,$step,$ds_names,$data) =<br>

  RRDs::fetch("/var/lib/smokeping/Some-CPE.rrd", "AVERAGE",<br>

              "-r", "$rrd_res", "-s", "$start_time", "-e", "$end_time");<br>

<br>

# save fetched values in a 2-dimensional array<br>

my $rows = 0;<br>

my $columns = 0;<br>

my $time_variable = $start;<br>

<br>

print "Start: $start : ";<br>

print strftime("%m/%d/%Y %H:%M:%S",localtime($start));<br>

print "\n \n";<br>

print "step: $step \n";<br>

<br>

print "start loop \n";<br>

print " --- \n";<br>

foreach $line (@$data) {<br>

$vals[$rows][$columns] = $time_variable;<br>

$temp_var = $time_variable;<br>

print strftime("%m/%d/%Y %H:%M:%S",localtime($temp_var));<br>

print "\n";  <br>

<br>

$time_variable = $time_variable + $step;<br>

$temp_var = $time_variable;<br>

print strftime("%m/%d/%Y %H:%M:%S",localtime($temp_var));<br>

print "\n";  <br>

<br>

foreach $val (@$line) {<br>

                      print " --- \n";<br>

                       print "row: $rows - col: $columns \n";<br>

                       print "Val: $val ";<br>

                        $vals[$rows][++$columns] = $val;<br>

                       print "VC: $vals[$rows][$columns] \n";<br>

                       print " --- \n";<br>

                       }<br>

$rows++;<br>

$columns = 0;<br>

}<br>

<br>

exit;<br>

---<br>

<br>

I've put in a bunch of print statements so I can try to figure out what's going on. [You can ignore all that...]<br>

There's also some errors in the for loop, because it parses more rows than exist in the fetch - but ignore that too. [At least for now. Or you can tell me why - if you like. I'm pretty sure I'll figure it out.]<br>

<br>

But what's interesting [at least right now] is that the first two columns have issues.<br>

Column one [or the first returned value from every row] appears to always be null.<br>

And the second always appears to be zero.<br>

[At least in my case, with my RRDs.]<br>

But I'm pretty sure it's the same with any RRD from smokeping.<br>

<br>

I may not understand [almost certainly don't] what's going on, but I'd have expected the values in the columns 3-23 to start at 1 and go through 20. [I do 20 samples in this RRD per row.]<br>

<br>

So, can someone explain why the first value [column] is always null, and the second is always zero? [These are all full resolution samples, no aggregation has occurred.]<br>

<br>

Thanks for anyone who takes a stab at it.<br>

And if you're reading Tobi, I'd be glad for your input and/or thoughts.<br>

<br>

Thanks!<br>

-Greg<br>

</td>

</tr>

</table>

<br><br>

<br>

<span style=" font-family:'arial'; font-size: 9pt; color: #c0c0c0;"><i>-- <br>

Gregory Sloop, Principal: Sloop Network & Computer Consulting<br>

Voice: 503.251.0452 x82<br>

EMail: </i></span><a style=" font-family:'arial'; font-size: 9pt;" href="mailto:gregs@sloop.net">gregs@sloop.net</a><br>

<a style=" font-family:'arial'; font-size: 9pt;" href="http://www.sloop.net">http://www.sloop.net</a><br>

<span style=" font-family:'arial'; font-size: 9pt; color: #c0c0c0;"><i>---</td>

</tr>

</table>

<br><br>

<span style=" font-family:'arial'; color: #c0c0c0;"><i>-- <br>

Gregory Sloop, Principal: Sloop Network & Computer Consulting<br>

Voice: 503.251.0452 x82<br>

EMail: </i></span><a style=" font-family:'arial';" href="mailto:gregs@sloop.net">gregs@sloop.net</a><br>

<a style=" font-family:'arial';" href="http://www.sloop.net">http://www.sloop.net</a><br>

<span style=" font-family:'arial'; color: #c0c0c0;"><i>---</body></html>