[rrd-developers] Re: How to get the most performance when using lots of RRD files
Martin Sperl
rrdtool at martin.sperl.org
Sun Aug 20 11:11:42 MEST 2006
Hi!
I remember having made similar observations some time ago, so I have
already written a SQL backend to RRD (look for a libdbi patch) and it
works quite fine for us with more than 60000 data sources added every 5
minutes resulting in currently 100M rows of data in the format (time
stamp,data source-id,value). There is much less IO overhead involved
adding data to the database than with adding to an rrd file - also you
can separate this to different machines easily...
The performance observation I have made is that with this huge table
graphing one data source takes some time to fetch the data initially
(the index has to be read,...) and then graphing again works very fast.
But this is naturally correlated to OS and DB caching and all this is
correlated to the memory-size of the server... So you will have the
memory side of the problem anyway.
Also with mysql there is a table locking issue: No row-level locking for
myisam type tables and using InnoDB gives a performance penalty and
increased size for data storage. But the way around this is to have 2
(or more) tables:
one (short-term) for entering data and a second one for "historic-read
only" data, to which data needs to be moved regularly to keep the short
term table small. This also allows to use different table types for each
of these tables (InnoDB for short term and MyISAM for longterm).
Regarding keeping Min,value,Max in the table in one row I believe that
this will introduce more disk-space overhead than it is worth. I think
that reducing data can be done as easily by reducing several samples of
say 30 values to 3 values. These 3 values would be
Min(Series),Max(Series),AVG(calc), where the average has to be
calculated like this: AVG(calc)=3*AVG(Series)-MIN(Series)-Max(Series).
This way you have the advantage to get min/max/average by using normal
"SQL consolidation functions" and can also start averaging more Series
and still get sensilble results. There is also an extension to 6 values
if there is the need to store absolute counters in the database and want
to graph only the "delta/derived" values. This is much more efficient on
adding data, as there is no need to calculate the delta on "insert-time"
and there is no need to keep the last absolute value stored somewhere
for later reference to calculate the delta, which makes the whole setup
much easier...
Also the SQL backend is written in such a way that you can use almost
any kind of Table-structure, that seems to fit your personally preferred
data structure.
The advantage of having one table with several value fields (time stamp,
value1,value2,value3,...) is that the database index will be much
smaller, but such a setup is only helpful for specific applications.
The advantage to use (time stamp,data source-id,value) is that it is
generic and can be applied to kind of datasource - at the cost of a
bigger index...
For storing values in the database it is IMHO also much more efficient
not to call rrdtool for storing the data, but adding the data directly
to the database from your script, as you will normally always have some
additional data that needs to be fetched from the database anyway. I
assume that cacti is always doing something like this...
Ciao,
Martin
P.s: I am currently updating my sql/libdbi patch to fix some performance
issues and also to allow for the use of several tables instead of one. I
will send an announce when this updated patch is ready - should be
fairly soon... The SQL patch also includes a mode for predicting future
data together with a sigma. This is used for one of our applications to
show if the current web traffic is "within normal bounds of
operation"... (This does not need a special setup like the Holst-Winter
forecasting!) For this there is also the idea to use FFT or
sine-least-square-fits to get other kinds of prediction instead of the
current "shift and average" mode of operation, which works very well...
Richard A Steenbergen wrote:
> On Fri, Aug 18, 2006 at 07:05:58AM +0200, Tobias Oetiker wrote:
>
>> Hi Richard,
>>
>>
>>> The current design of rrdtool is based around scripts calling tools which
>>> do a transaction using a single .rrd file, and then quit.
>>>
>> if you have lots of data I guess you would NOT use the cli but
>> rather the perl module ... but besides this ....
>>
>
> Which then calls the CLI, yes? Using the perl module is one way to manage
> complexity, writing your own interface to call rrd functions is another.
> Perl is not a good solution for every problem. :)
>
>
>>> Note that I'm not suggesting we all run out and start moving our graphing
>>> DBs to SQL, but the necessary architecture to scale to large data sets is
>>> abundantly clear thanks to all those people who spend lots of time and
>>> energy developing databases.
>>>
>> Have you actually run tests with databases on this ? are they
>> faster when you update hundreds of thousands of diferent 'data
>>
>
> Are intelligent buffered writes to a structured db multiplexed by a
> persistent server process more efficient than starting a new process which
> blocks while it does open, lock, write, close, and exit, for every
> transaction? Absolutely.
>
>
>> * ds table
>> ds-id, name, type, min, max
>>
>> * data table
>> ds-id, timestamp, value
>>
>
> Pretty much. There are advantages to having a persistent poller here too,
> so you can cache the ds id's and just fire off a batch of updates every
> time your poller cycle hits without needing to query ds status. Same thing
> for handling counters or absolutes if you want to store data as native
> rates, you'd want to minimize db transactions, though you could also
> accomplish this with db server-side functions.
>
>
>> or would you create a diferent table for each 'datasource' ?
>>
>
> This is a Bad Idea (tm), and one of the fundamental mistakes that RTG
> makes. Using table names to index data is not what relational databases
> were meant to do, and takes you right back to the same problem you have
> today. :)
>
>
--
Unsubscribe mailto:rrd-developers-request at list.ee.ethz.ch?subject=unsubscribe
Help mailto:rrd-developers-request at list.ee.ethz.ch?subject=help
Archive http://lists.ee.ethz.ch/rrd-developers
WebAdmin http://lists.ee.ethz.ch/lsg2.cgi
More information about the rrd-developers
mailing list