[rrd-developers] Re: How to get the most performance when using lots of RRD files

Martin Sperl rrdtool at martin.sperl.org
Sun Aug 20 11:11:42 MEST 2006


Hi!
I remember having made similar observations some time ago, so I have 
already written a SQL backend to RRD (look for a libdbi patch) and it 
works quite fine for us with more than 60000 data sources added every 5 
minutes resulting in currently 100M rows of data in the format (time 
stamp,data source-id,value). There is much less IO overhead involved 
adding data to the database than with adding to an rrd file - also you 
can separate this to different machines easily...

The performance observation I have made is that with this huge table 
graphing one data source takes some time to fetch the data initially 
(the index has to be read,...) and then graphing again works very fast. 
But this is naturally correlated to OS and DB caching and all this is 
correlated to the memory-size of the server... So you will have the 
memory side of the problem anyway.

Also with mysql there is a table locking issue: No row-level locking for 
myisam type tables and using InnoDB gives a performance penalty and 
increased size for data storage. But the way around this is to have 2 
(or more) tables:
one (short-term) for entering data and a second one for "historic-read 
only" data, to which data needs to be moved regularly to keep the short 
term table small. This also allows to use different table types for each 
of these  tables (InnoDB for short term and MyISAM for longterm).

Regarding keeping Min,value,Max in the table in one row I believe that 
this will introduce more disk-space overhead than it is worth. I think 
that reducing data can be done as easily by reducing several samples of 
say 30 values to 3 values. These 3 values would be 
Min(Series),Max(Series),AVG(calc), where the average has to be 
calculated like this: AVG(calc)=3*AVG(Series)-MIN(Series)-Max(Series).
This way you have the advantage to get min/max/average by using normal 
"SQL consolidation functions" and can also start averaging more Series 
and still get sensilble results. There is also an extension to 6 values 
if there is the need to store absolute counters in the database and want 
to graph only the "delta/derived" values. This is much more efficient on 
adding data, as there is no need to calculate the delta on "insert-time" 
and there is no need to keep the last absolute value stored somewhere 
for later reference to calculate the delta, which makes the whole setup 
much easier...

Also the SQL backend is written in such a way that you can use almost 
any kind of Table-structure, that seems to fit your personally preferred 
data structure.

The advantage of having one table with several value fields (time stamp, 
value1,value2,value3,...) is that the database index will be much 
smaller, but such a setup is only helpful for specific applications.
The advantage to use (time stamp,data source-id,value) is that it is 
generic and can be applied to kind of datasource - at the cost of a 
bigger index...

For storing values in the database it is IMHO also much more efficient 
not to call rrdtool for storing the data, but adding the data directly 
to the database from your script, as you will normally always have some 
additional data that needs to be fetched from the database anyway. I 
assume that cacti is always doing something like this...

Ciao,
          Martin

P.s: I am currently updating my sql/libdbi patch to fix some performance 
issues and also to allow for the use of several tables instead of one. I 
will send an announce when this updated patch is ready - should be 
fairly soon... The SQL patch also includes a mode for predicting future 
data together with a sigma. This is used for one of our applications to 
show if the current web traffic is "within normal bounds of 
operation"... (This does not need a special setup like the Holst-Winter 
forecasting!) For this there is also the idea to use FFT or 
sine-least-square-fits to get other kinds of prediction instead of the 
current "shift and average" mode of operation, which works very well...

Richard A Steenbergen wrote:
> On Fri, Aug 18, 2006 at 07:05:58AM +0200, Tobias Oetiker wrote:
>   
>> Hi Richard,
>>
>>     
>>> The current design of rrdtool is based around scripts calling tools which
>>> do a transaction using a single .rrd file, and then quit.
>>>       
>> if you have lots of data I guess you would NOT use the cli but
>> rather the perl module ... but besides this ....
>>     
>
> Which then calls the CLI, yes? Using the perl module is one way to manage 
> complexity, writing your own interface to call rrd functions is another. 
> Perl is not a good solution for every problem. :)
>
>   
>>> Note that I'm not suggesting we all run out and start moving our graphing
>>> DBs to SQL, but the necessary architecture to scale to large data sets is
>>> abundantly clear thanks to all those people who spend lots of time and
>>> energy developing databases.
>>>       
>> Have you actually run tests with databases on this ? are they
>> faster when you update hundreds of thousands of diferent 'data
>>     
>
> Are intelligent buffered writes to a structured db multiplexed by a 
> persistent server process more efficient than starting a new process which 
> blocks while it does open, lock, write, close, and exit, for every 
> transaction? Absolutely.
>
>   
>> * ds table
>>   ds-id, name, type, min, max
>>
>> * data table
>>   ds-id, timestamp, value
>>     
>
> Pretty much. There are advantages to having a persistent poller here too, 
> so you can cache the ds id's and just fire off a batch of updates every 
> time your poller cycle hits without needing to query ds status. Same thing 
> for handling counters or absolutes if you want to store data as native 
> rates, you'd want to minimize db transactions, though you could also 
> accomplish this with db server-side functions.
>
>   
>> or would you create a diferent table for each 'datasource' ?
>>     
>
> This is a Bad Idea (tm), and one of the fundamental mistakes that RTG 
> makes. Using table names to index data is not what relational databases 
> were meant to do, and takes you right back to the same problem you have 
> today. :)
>
>   



--
Unsubscribe mailto:rrd-developers-request at list.ee.ethz.ch?subject=unsubscribe
Help        mailto:rrd-developers-request at list.ee.ethz.ch?subject=help
Archive     http://lists.ee.ethz.ch/rrd-developers
WebAdmin    http://lists.ee.ethz.ch/lsg2.cgi



More information about the rrd-developers mailing list