[rrd-users] Data Mining: Correlation Engine

Martin Sperl rrdtool at martin.sperl.org
Tue Nov 11 08:36:20 CET 2008


Hi!

Actually I have been talking to Tobi regarding this quite recently, as 
this again came up during one of our projects.
An example question there was: What is the max number of transactions we 
can hit on a specific server-hardware (correlating CPU usage and TPS).
This actually works quite well and we have been able to differentiate 
between HW Generations  quite easily...

So I have proposed to Tobi to contribute the following over the next few 
month:
* creating a graphing facility with rrd to graph not time-series but a 
scatter plot of 2 data-sources (also CDEFS should be able to act as 
datasources!)
* simple VDEF functions to calculate some "simple" correlations (e.g: 
linear/polynomial fits) and then use CDEFS to calculate+present this 
graph...

I believe based on this one can write an easy framework for correlating 
different data and then presenting it. Still IMHO the most important 
thing is to have visualization for these to work - Actually my approach 
is first to create a website that presents a matrix of correlation 
graphs for different datasources. This way we can find out what is 
significant visually...

But for me there next is the task of a adding a least squares fit engine 
for polynomials and sums of sinuses to rrdtool, so that we can out of 
the box create a prediction for the question: "from what we know now, 
can we predict the value in 6 month time". This is actually much more 
important to our performance-project to start from...

Ciao,
         Martin

fcocquyt wrote:
> Hello,
> First off, big thanks to Tobias for creating RRDTool - the basis for a lot
> of great sysadmin'ing ;)
>
> I searched the forums without an answer - has anyone looked at a data mining
> engine for RRDTool data?
> An example application would be computing the correlation of different
> datasources in the set of all datasources (eg cacti installation).
> In his talk today in he outlined the roadmap, with the RRDcached  the
> distributed model seems to be on its way - lending towards the (background
> compute) datamining approach...
> To my way of thinking much of the untapped value of RRDtool datasets rests
> in the analysis across the rrd files (eg wow, our online transactions
> (sales) drop off dramatically with our backend DB latency - if we upgrade
> our DB for {fixed cost} we can generate much more revenue).
>
> Anyone else see value in such a data mining engine for RRDTool?
>
> thanks,
> Fletcher
>   



More information about the rrd-users mailing list