[rrd-users] Re: percentile calculations

Simon Hobson linux at thehobsons.co.uk
Tue Sep 5 20:56:37 MEST 2006

Alex van den Bogaerdt wrote:

>  > However, when I'm displaying the graph for the current month, the
>>  PERCENT function is using all the unknown future values in the
>>  calculation
>>               causing it to be incorrect.

Seems obvious enough to me !

>You seem to know about your "unknown" data.  That means it isn't
>as unknown as the name suggests...

Just because you know that it's unknown doesn't make it any less 
unknown in value - without wanting to sound like a politician talking 
crap about unknown unknowns and known unknowns !

>  > As a very simplified example, say I'm 10 days into a month (with 20 days
>>  remaining) and the values so far look like this:
>>  1,2,3,4,5,6,7,8,9,10
>>  The 90th percentile should be 9
>according to what/who ?

Common usage ?

Seems obvious enough to me that 9 is the value which 90% of the 
values in the list are equal or less than. Isn't that what a 
percentile is about. OK, it's a bit coarse with so few samples.

>  > I have looked through the documentation and can't find any mechanism
>>  which would allow me to restrict the PERCENT function to a specific date
>>  range (to exclude values in the future), or exclude NaN values.
>Why graph values in the future, you know this won't include useful data.

But it may well produce useful graphs ! For some reason accountants 
seem to like pigeonholing numbers into arbitrary calender units 
unrelated to what's actually going on in a business. One example that 
comes easily to mind is an accountant who wants the sales figure for 
the current month graphing - not the last 30 days, but the current 
calendar month. Unless you have found a working crystal ball, at any 
point before the end of the month you will unknown values in the 

If the above samples (ie 1 .. 10) were values for the 1st through 
10th of the month, then the right place to draw the line would be at 
9 - ignoring unknown values for 11 through 28,30,31. If you assumed 
zero for future samples then the line would incorrectly end up at 7. 
Similarly your average would end up at a little under 2 instead of 
5.5 ! I don't think any accountant would accept 2.2 as an average of 
sales so far this month from those numbers.

Changing things a little, suppose there are 10 units of sales on day 
1, would you accept a figure of .33 units/day as the average sales so 
far this month or would you expect 10 ?

As another example, my ISP will give me graphs of my bandwidth usage 
over a billing period. On 10th of each month it starts out with a 
nearly empty graph and it fills up until the 9th of the following 
month. What's useful for me is not an average calculated as "total to 
date/30" but total to "date/day so far".

OK, it would probably be equally (possibly more) useful to show "last 
30 days", but current billing period is what we get.

>try changing unknown into some known value, like zero or a very large
>negative number

I fail to see how that will help - it will just further skew the data.

The answer would appear to be to do the calculations over the range 
"1st of month" to "today" whilst plotting them on an X axis from "1st 
of month" to "end of month" - is that easy to do ?

