[rrd-developers] Improving RRD tool scalability

Mon Mar 3 20:13:23 MET 2003

Recently we have encountered some "intresting" problems while using
RRDtool derivative for large scale data collection.

Setup:

RRDtool -- based on rrdtool 1.0.28 with several modifications. Bug fixes,
export to database, percentile, STDDEV, moving average function, ability
to use RPN without producing graphs,millisecond resolution
But rrd_update function is essentially the same.

22000 Interfaces. Each interface has 10 datasources (in/out
octets,packets,errors,discards,avail,queue length) Each interface is
stored in separate RRD file. RRD files have custom resolution. Most have
180s step, 1/3 of them have 30s step. About 4M disk space per file. ~100GB
total disk space.
 System runs on Sun V880 with 4Ultra III CPU, 8 GB RAM and 6 disk IBM
Fastt2000
RAID5 controller.

Data collection is done with our own frontend. The frontend is a major
rewrite
of Cricket with lots of cool stuff. The data collection can be done either
with
several processes (~20) or with smaller number of processes(3-5) with
attached
SNMP slaves. Usual turnaround time is 120-300 seconds for all the
interfaces.

You can get our version of rrdtool at http://percival.sourceforge.net

Problem:

At about 17K interfaces we found that collection has slowed to the crawl.
Collection
time for all interfaces become much more then required 300s. Further
investigation
revealed that we are spending most time waiting for disk. After performing
usual
Solaris tunes such as verified DNLC cache, inode cache, etc etc... All to no
avail.
 After the source review we found following problems:

 - for every update we have to open then close file.
 - We have to read metadata information. (static head plus dynamic
definitions)
 - rrdtool using buffered stdio functions. However there is absolutely no
need
 for the buffering since the io is random. Also solaris does not support
 more then 255 open files using stdio functions.
 - Number of seeks and writes per each update can be drastically reduced.

We tried and tested several approaches on both Solaris and Linux. In every
case
file opened only once and closed upon collector exit.

- Improved read/write. Metadata are read once upon file open. pwrite() is
used
to write data back to file. We tested that pwrite is faster then lseek() and
write().

- Improved read/write with metadata mmap'ed. We managed to get it working
only
on Linux. Performance wise this solution is about 20-30% faster then pure
read/pwrite().

- Fully mmaped file. This proved to be the worst possible idea. Again this
was
tried on Linux only. Possible reason is that msync sync full page which
is 4K while pwrite can only write 512 byte sector. This was confirmed with
iostat.

In the end we have upgraded RRDtool, upped number of available descriptors
and the problem magically went away. Our estimation that we can handle
about 30-40K of interfaces on the same hardware.

The bottom line is that RRDtool produces a lot of random io and the
collection
time is bound by disk average seek time multiplied by number of interfaces.
Our
modification reduced number of seeks by several times but it did not
overcome
fundamental problem.

In my opinion further advance in speed will require modification of RRD
datastructure.

Also I am very surprised that SNMP collection/CPU usage did not become a
bottleneck
before the disk. According to RTG article this was supposed to be a major
problem.
On the other hand our collector has about the same performance as RTG even
though it is written in perl.

P.S. Note that our archive size is about 10-20 times bigger then with MRTG
default because we store more data at higher precision.

 Sasha Mikheev

 Avalon Net Ltd, CTO
 +972-50-481787, +972-4-8210307
 http://www.avalon-net.co.il

--
Unsubscribe mailto:rrd-developers-request at list.ee.ethz.ch?subject=unsubscribe
Help        mailto:rrd-developers-request at list.ee.ethz.ch?subject=help
Archive     http://www.ee.ethz.ch/~slist/rrd-developers
WebAdmin    http://www.ee.ethz.ch/~slist/lsg2.cgi