[rrd-developers] Unexpected behaviour of PREDICT and PREDICTSIGMA

Sun Apr 27 08:44:47 CEST 2014

Hi Steve!

I would ask on the list which behavior is most often used and decide then... We could even say, we just take out the "short"/negative functionality as a whole.

But in principle I would keep 0 as the special case which probably should return nan.

As of hw rra: my concern is that with hw-rra you can only select single data sources and then you have to wait for some time to get the data - you even have to modify the rrd file to add it. Which IMO does not make it very practical and intuitive.

The cdef approach instead allows an immediate response for the request to see what prediction looks like and if it is sensitive to apply it. And that is especially true if you have 500k data points available and only look at them rarely with prediction - then putting the cost into the rendering is the cheaper solution from a CPU and disk perspective than running hw every 5 minutes for 500k data points...

Also I have to admit I have never looked in more detail at hw because of those limitations plus the fact that I do not fully understand the mathematics and their reasoning behind it - and how to make use of the data and display the certainty/uncertainty values.

That is why I started implementing the predict part, which we had running before outside of rrd as a separate script creating/filling a prediction rrd file, which had the one advantage of having immutability when switching between resolutions, but again at the cost of disk space and computations done every 5 minutes (but only for a subset of data).

As for rol, ror I am not sure what you would like to do exactly, but if it is just: shift data time by x seconds into the future/past, then it is possible. But even then to the shift +window average functionality would probably take 146 RPN arguments (48 (=1800/300*8) times "x,shift,ror" with different shifts plus "48,average" ) to achieve the same thing as "86400,-8,1800,x,PREDICT" - So even less efficient...

But in general, this would allow to compare a weeks traffic easily, but then - if I remember correctly - you can achieve the same thing with def, where you can shift the data, but that would be even more complex from an number of arguments to rrd graph perspective... Also you can use rol/ror on cdefs and not only on the raw data, so you can add up defs first and only then do the shifts...

As an afterthought: the approach of using step count=1 does not make much sense besides the direct implementation of rol/ror (the way I understand it) with or without some averaging. 

So your "86400,x,ror" can get written as: "86400,1,300,predict" (not sure about the 300 for the window, you might need to replace it with 1), sigma will give nan here!. 
Similar thing with running average calculation of 1 hour in the past: 0,1,3600,predict.
( at least if I remember the way that calculations are done correctly)

So it is very flexible in what you can really do with it, if you get creative - maybe we should add these in the documentation after verification that it works as expected...

We might also create some RPN- aliases for those to shorten the arguments needed and make it easier to read...

When you have decided on the final format of the negative step count approach for predict, I might create a patch to do also the percentile calculations.

 But one concern I start to have is that if you apply that to data that is of much lower resolution - say when graphing a year with percentile-predictions, then the numbers will change dramatically when you switch resolutions, as then you will no longer have 48 data points from which to calculate your percentile (at 2% resolution) but say only 8 values (at 14% resolution) - depending on exact rra definitions.

Also from a mathematical perspective I have the concern if averaging of rrd tool consolidation function + percentile computation on top of that really plays well together and give sensible results... 

Somehow I fear that this could result in miss-interpretations of data by people not too deeply trained in statistics - and even I myself seem not well enough educated to say if there is a risk... (Besides a hunch that this could produce unexpected artifacts...)

But if you look short term at highest resolutions it should not be an issue...

Ciao, Martin

Sent from my iPad

> On 26.04.2014, at 23:27, Steve Shipway <s.shipway at auckland.ac.nz> wrote:
> 
> Thanks for the clarification... I'm still not sure that I think the 0-shift should be included when using a negative count but that's likely personal preference and, as you say, the explicit list is always available anyway.
> 
> Maybe it should also allow the use of 0 to mean a single 0-shift -- ie
>  0,x,PREDICT 
> to be the same as 
>  0,1,x,PREDICT
> 
> Of course, currently with the negative setup  not including n, this is also the same as
>  s,-1,x,PREDICT 
> for any value of s, which is a bit pointless and the main reason I felt something was amiss.
> 
> I have a paragraph written up on this behaviour for the RRDTool book I'm working on (yes! it lives! like a zombie it pulls itself from the 2-year-old grave...) so I'll email this to Tobi for inclusion in the online manual if he wants.
> 
> As for a predict_median and predict_percentile -- as you say, they could be very CPU-expensive, and also if you're going that far it's likely that you'd just set up a few HW RRAs to use instead.  Still, it never hurts to have more tools in your toolbox...
> 
> If we're adding new operations to the RPN, my choice would be for ROL and ROR (rotate top 3 stack items) and some date calculations (given an epoch time, extract day of week, hour of day, etc in local timezone).  Maybe I'll download the latest dev snapshot and see what I can do.
> 
> Steve
> 
> Steve Shipway
> University of Auckland ITS
> UNIX Systems Design Lead
> s.shipway at auckland.ac.nz
> Ph: +64 9 373 7599 ext 86487