[rrd-users] Fine grained performance data
Mark Seger
Mark.Seger at hp.com
Thu Jul 19 22:58:08 CEST 2007
I'm pretty new to rrdtool but have been performing some experiments that
have pretty exiting results, at least for me. Thanks to William Owen
for talking me into using RRDs instead of RRDp.
Some time back I wrote a very flexible data collection/reporting tool
which can log and/or display hundreds of performance counters with <0.1%
cpu overhead. I recently released as Open Source at
http://sourceforge.net/projects/collectl. It can even generate output
in a form easily plotted with gnuplot! However, my main reason for
mentioning it here is I think there could be some real benefit in
building an rrdtool interface into it and have been doing some
experiments to see if it makes sense by focusing on rrdtool performance
overhead as well as to try and get some sense from the rrdtool community
might find such an effort useful.
I know a lot of people tend to collect data at 5 minute intervals, which
is in fact the rrdtool default, but the default I use for my tool is one
sample every 10 seconds since I believe you need that level of
granularity to see what is really going on and when run interactively my
tools takes a sample once/second. It will also run at sub-second levels
of granularity but I don't think rrdtool supports those kinds of times,
does it? Anyhow I was very pleased to see that rrdtool could load this
amount of data in under a second which means it adds virtually no extra
overhead and what you end up with a wealth of system data and all the
rrdtools at your disposal for plotting it. Here's what collectl reports
when I loaded the database with rrdtool at 15:37:02
#
<--------CPU--------><-----------Disks-----------><-----------Network---------->
#Time cpu sys inter ctxsw KBRead Reads KBWrit Writes netKBi
pkt-in netKBo pkt-out
15:37:00 0 0 1116 315 0 0 0 0 1
13 0 1
15:37:01 0 0 1260 379 0 0 0 0 18
140 0 1
15:37:02 13 6 1233 509 0 0 0 0 4
55 9 76
15:37:03 0 0 1145 345 0 0 192 4 1
22 0 1
15:37:04 0 0 1111 300 0 0 0 0 1
9 0 1
As you can see there is a burst of cpu and then nothing - clearly you'd
never see with a 5 minute sample or even a 5 second one, though a 5
second one would probably show 2-3% load since it's an average! You can
also see a jump in interrupts, context switches, and disk traffic. I
assume the network load is just background noise.
If you'd like to see some of the other types of formats collectl can
display have a look at http://collectl.sourceforge.net or just install
the rpm and start playing with it. I'm always looking for feedback on
how to make it better.
as I said I'm pretty new to rrdtool and would think those with more
knowledge could build some pretty impressive plots. any opinions?
perhaps others are already doing a flavor of this but I'd suspect
without the same breadth of data collectl can supply, which includes
cpu, disk, memory, lustre, network, socket, tcp, inode, nfs and even
high speed interconnects like infiniband. And that's just the summary
data! It can also report on individual cpus, disks, networks as well as
slabs and processes. There may even be one or two things I forgot. 8-)
In any event, if I were to move forward I'd at least like to make sure
I'm doing it in the most efficient manner and that's why I'm asking it
on this list.
-mark
More information about the rrd-users
mailing list