[rrd-users] Fine grained performance data

Mark Seger Mark.Seger at hp.com
Thu Jul 19 22:58:08 CEST 2007


I'm pretty new to rrdtool but have been performing some experiments that 
have pretty exiting results, at least for me.  Thanks to William Owen 
for talking me into using RRDs instead of RRDp.

Some time back I wrote a very flexible data collection/reporting tool 
which can log and/or display hundreds of performance counters with <0.1% 
cpu overhead.  I recently released as Open Source at 
http://sourceforge.net/projects/collectl.  It can even generate output 
in a form easily plotted with gnuplot!  However, my main reason for 
mentioning it here is I think there could be some real benefit in 
building an rrdtool interface into it and have been doing some 
experiments to see if it makes sense by focusing on rrdtool performance 
overhead as well as to try and get some sense from the rrdtool community 
might find such an effort useful.

I know a lot of people tend to collect data at 5 minute intervals, which 
is in fact the rrdtool default, but the default I use for my tool is one 
sample every 10 seconds since I believe you need that level of 
granularity to see what is really going on and when run interactively my 
tools takes a sample once/second.  It will also run at sub-second levels 
of granularity but I don't think rrdtool supports those kinds of times, 
does it?  Anyhow I was very pleased to see that rrdtool could load this 
amount of data in under a second which means it adds virtually no extra 
overhead and what you end up with a wealth of system data and all the 
rrdtools at your disposal for plotting it.  Here's what collectl reports 
when I loaded the database with rrdtool at 15:37:02

#         
<--------CPU--------><-----------Disks-----------><-----------Network---------->
#Time     cpu sys inter  ctxsw KBRead  Reads  KBWrit Writes netKBi 
pkt-in  netKBo pkt-out
15:37:00    0   0  1116    315      0      0       0      0      1     
13       0       1
15:37:01    0   0  1260    379      0      0       0      0     18    
140       0       1
15:37:02   13   6  1233    509      0      0       0      0      4     
55       9      76
15:37:03    0   0  1145    345      0      0     192      4      1     
22       0       1
15:37:04    0   0  1111    300      0      0       0      0      1      
9       0       1

As you can see there is a burst of cpu and then nothing - clearly you'd 
never see with a 5 minute sample or even a 5 second one, though a 5 
second one would probably show 2-3% load since it's an average!  You can 
also see a jump in interrupts, context switches, and disk traffic.  I 
assume the network load is just background noise.

If you'd like to see some of the other types of formats collectl can 
display have a look at http://collectl.sourceforge.net or just install 
the rpm and start playing with it.  I'm always looking for feedback on 
how to make it better.

as I said I'm pretty new to rrdtool and would think those with more 
knowledge could build some pretty impressive plots.  any opinions?  
perhaps others are already doing a flavor of this but I'd suspect 
without the same breadth of data collectl can supply, which includes 
cpu, disk, memory, lustre, network, socket, tcp, inode, nfs and even 
high speed interconnects like infiniband.  And that's just the summary 
data!  It can also report on individual cpus, disks, networks as well as 
slabs and processes.  There may even be one or two things I forgot.  8-)

In any event, if I were to move forward I'd at least like to make sure 
I'm doing it in the most efficient manner and that's why I'm asking it 
on this list.

-mark




More information about the rrd-users mailing list