[rrd-users] large dataset considerations

Thu Oct 24 13:11:06 CEST 2013

S Ahmed wrote:
> Is this tool used by any large scale usages?

> What is considered a large database size?

> Scenerio: Say you want to store time series time informaiton in a Saas application.
> I'm guessing there is some sort of threshold where it makes sense to partition your data e.g. by
> a group of customers in order to scale out the usage

I'm not clear on what the question is, and I suspect you haven't really thought about how RRD stores data.

Typically, you'd create an RRD file for a distinct set of related data that can be updated together - but if you have data that is not related or varies in number of items, then you'd put that data in a number of separate RRD files.

Example, monitoring systems.
Within a system, data such as CPU load, number of processes, RAM in use/free and so on are a distinct set - so you might put those in one RRD file. For data such as network I/O, you can put sent and received data into one RRD file, but you'd create a separate RRD file for each interface since the number of interfaces is variable - ditto things like disk space & I/O where the number of disks varies so you'd create one RRD file per filesystem, and one RRD file per physical disk.

Extending that, if you were monitoring multiple systems, the number of systems is typically variable - so you'd create a set of RRD files for each system. If you want combined stats, tehn one option is to do as I've just done for mail queue information : collect from each server separately (in my case using cache daemon to put all the file sin one place), and have a separate program that periodically queries the set of files, combines the data, and updates a separate RRD file.

Now when it comes to scalability, RRD doesn't really impose any great limits itself. There is no central daemon managing things (unless you choose to use the cache daemon but that's not really the same thing) - just different programs that independently update RRD files, and read RRD files to generate information for users.
So it comes down to : have you the disk space to store the data you want to store, do you have the CPU capacity to run the collection programs and output programs, do you have the disk I/O to handle it. Unless you have a very large dataset that doesn't break down into logical chunks, then you have the option of storing and processing the data on multiple systems if you need to split the storage and/or processing.

If that doesn't answer the question, then perhaps you could be a bit more specific about what the question is.