[rrd-developers] Re: How to get the most performance when using lots of RRD files

Sun Aug 20 00:25:34 MEST 2006

On Fri, Aug 18, 2006 at 07:05:58AM +0200, Tobias Oetiker wrote:
> Hi Richard,
> 
> > The current design of rrdtool is based around scripts calling tools which
> > do a transaction using a single .rrd file, and then quit.
> 
> if you have lots of data I guess you would NOT use the cli but
> rather the perl module ... but besides this ....

Which then calls the CLI, yes? Using the perl module is one way to manage 
complexity, writing your own interface to call rrd functions is another. 
Perl is not a good solution for every problem. :)

> > Note that I'm not suggesting we all run out and start moving our graphing
> > DBs to SQL, but the necessary architecture to scale to large data sets is
> > abundantly clear thanks to all those people who spend lots of time and
> > energy developing databases.
> 
> Have you actually run tests with databases on this ? are they
> faster when you update hundreds of thousands of diferent 'data

Are intelligent buffered writes to a structured db multiplexed by a 
persistent server process more efficient than starting a new process which 
blocks while it does open, lock, write, close, and exit, for every 
transaction? Absolutely.

> * ds table
>   ds-id, name, type, min, max
> 
> * data table
>   ds-id, timestamp, value

Pretty much. There are advantages to having a persistent poller here too, 
so you can cache the ds id's and just fire off a batch of updates every 
time your poller cycle hits without needing to query ds status. Same thing 
for handling counters or absolutes if you want to store data as native 
rates, you'd want to minimize db transactions, though you could also 
accomplish this with db server-side functions.

> or would you create a diferent table for each 'datasource' ?

This is a Bad Idea (tm), and one of the fundamental mistakes that RTG 
makes. Using table names to index data is not what relational databases 
were meant to do, and takes you right back to the same problem you have 
today. :)

> well the rrd_update example is nice, but how would you go for
> something like rrd_create, or rrd_graph ?

Well rrd_update() is probably the most important in terms of reducing 
overhead and needing a good C API, so the fact that it is the simpliest is 
a bonus feature. :) But as far as other functions go... Really all you 
need to do is look to how you implement these things yourself, and then 
organize it so that users can do the same.

For example, lets take graphing... What are the logical steps involved? 
You need to load data from an rrd file, then you need to define any 
necessary cdef expressions using that data, and the elements you want to 
graph, and then you render that graph based on a bunch of parameters (some 
of which are required and fixed into the API, some of which are optional, 
and some of which are just behavioral flags).

Here is an example that I just pulled out of my ass. I don't know enough 
about the internal implementation of rrdtool to say if this is exactly how 
it should be done or not, so there may be plenty of modifications or 
optimizations available, but this is an "artists concept" of what use of a 
proper C API might look like (error checking omitted for simplicity of 
course :P):

struct RRD_DB *rrd_db;
struct RRD_DEF *rrd_def;
struct RRD_CDEF *rrd_cdef;
struct RRD_GRAPH *rrd_graph;
time_t start, end;

struct RRD_GRAPH_CFG rrd_graph_cfg[] = {
	{ RRD_GRAPH_CFG_TITLE,		"Some title"			},
	{ RRD_GRAPH_CFG_FONT,		"/somepath/somefont.ttf"	},
	etc etc
};

rrd_db = rrd_open("/somepath/somefile.rrd");
rrd_def = rrd_def_load(rrd_db, "ifInOctets", RRD_CF_AVERAGE);
rrd_cdef = rrd_cdef_create("%s,8,*", rrd_def);

rrd_graph = rrd_graph_new(640, 480, rrd_graph_cfg);
rrd_graph_config_flags(rrd_graph, RRD_CONFIG_FLAGS_LAZY | RRD_CONFIG_FLAGS_RIGID);
rrd_graph_element_add(rrd_graph, RRD_GRAPH_ELEMENT_LINE1, rrd_def, "#777777", "Legend");
rrd_graph_element_add(rrd_graph, RRD_GRAPH_ELEMENT_AREA, rrd_cdef, "#00aabb", "Blah");

start = rrd_time_expression("-1h");
end = rrd_time_expression("now");
rrd_graph_render_file("/somepath/somefile.png", RRD_GRAPH_TYPE_PNG, start, end);

start = rrd_time_expression("-24h");
rrd_graph_render_file("/somepath/somefile2.png", RRD_GRAPH_TYPE_PNG, start, end);

Graphing the same thing but across different timeranges seems like a 
pretty common operation to me (more so than reuse of pretty much anything 
else), so more than likely you'd want to optimize for that case. I would 
think that you'd probably want the "dynamic calculation" elements like 
cdefs and vdefs to stay symbolic representations of what to do with the 
real data from defs all the way until you do the render, so you only need 
to do the calculations on the specific datapoints you're graphing and not 
the entire DS.

> > Unfortunately I'm involved in about a billion projects right now
> > [...]
> 
> there you go .. and so it ends ...  most of the time

Well by which I mean I don't have the free time to completely rewrite this 
myself, but I can certainly do my part to help. :)

-- 
Richard A Steenbergen <ras at e-gerbil.net>       http://www.e-gerbil.net/ras
GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC)

--
Unsubscribe mailto:rrd-developers-request at list.ee.ethz.ch?subject=unsubscribe
Help        mailto:rrd-developers-request at list.ee.ethz.ch?subject=help
Archive     http://lists.ee.ethz.ch/rrd-developers
WebAdmin    http://lists.ee.ethz.ch/lsg2.cgi