# [rrd-users] Re: percentile calculations

Simon Hobson linux at thehobsons.co.uk
Tue Sep 5 20:56:37 MEST 2006

```Alex van den Bogaerdt wrote:

>  > However, when I'm displaying the graph for the current month, the
>>  PERCENT function is using all the unknown future values in the
>>  calculation
>
>sure
>
>>               causing it to be incorrect.
>
>Why?

Seems obvious enough to me !

>You seem to know about your "unknown" data.  That means it isn't
>as unknown as the name suggests...

Just because you know that it's unknown doesn't make it any less
unknown in value - without wanting to sound like a politician talking
crap about unknown unknowns and known unknowns !

>  > As a very simplified example, say I'm 10 days into a month (with 20 days
>>  remaining) and the values so far look like this:
>>
>>  1,2,3,4,5,6,7,8,9,10
>>
>>  The 90th percentile should be 9
>
>according to what/who ?

Common usage ?

Seems obvious enough to me that 9 is the value which 90% of the
values in the list are equal or less than. Isn't that what a
percentile is about. OK, it's a bit coarse with so few samples.

>  > I have looked through the documentation and can't find any mechanism
>>  which would allow me to restrict the PERCENT function to a specific date
>>  range (to exclude values in the future), or exclude NaN values.
>
>Why graph values in the future, you know this won't include useful data.

But it may well produce useful graphs ! For some reason accountants
seem to like pigeonholing numbers into arbitrary calender units
unrelated to what's actually going on in a business. One example that
comes easily to mind is an accountant who wants the sales figure for
the current month graphing - not the last 30 days, but the current
calendar month. Unless you have found a working crystal ball, at any
point before the end of the month you will unknown values in the
graph.

If the above samples (ie 1 .. 10) were values for the 1st through
10th of the month, then the right place to draw the line would be at
9 - ignoring unknown values for 11 through 28,30,31. If you assumed
zero for future samples then the line would incorrectly end up at 7.
Similarly your average would end up at a little under 2 instead of
5.5 ! I don't think any accountant would accept 2.2 as an average of
sales so far this month from those numbers.

Changing things a little, suppose there are 10 units of sales on day
1, would you accept a figure of .33 units/day as the average sales so
far this month or would you expect 10 ?

As another example, my ISP will give me graphs of my bandwidth usage
over a billing period. On 10th of each month it starts out with a
nearly empty graph and it fills up until the 9th of the following
month. What's useful for me is not an average calculated as "total to
date/30" but total to "date/day so far".

OK, it would probably be equally (possibly more) useful to show "last
30 days", but current billing period is what we get.

>try changing unknown into some known value, like zero or a very large
>negative number

I fail to see how that will help - it will just further skew the data.

The answer would appear to be to do the calculations over the range
"1st of month" to "today" whilst plotting them on an X axis from "1st
of month" to "end of month" - is that easy to do ?

--
Unsubscribe mailto:rrd-users-request at list.ee.ethz.ch?subject=unsubscribe
Help        mailto:rrd-users-request at list.ee.ethz.ch?subject=help
Archive     http://lists.ee.ethz.ch/rrd-users