[rrd-users] rrd and collectl

Wed Jul 23 15:20:27 CEST 2008

Thanks for the pointer.  This is an interesting discussion, but at least 
as I'm reading it, it talks to the rrd db update problem and I'm talking 
about the source of the data.  My main exposure to rrd is via ganglia 
and some experiments I did with loading collectl data awhile back and it 
seems to me that the main role of rrd (or maybe it's just ganglia) it to 
get a sense of the overall health of the system as opposed to trying to 
collect enough data to meaningfully track down a complex system 
problem.  The reason I say this is in response to your statement that 
ganglia collects something like 30 variables.  Maybe it's just the way 
we're counting, but when I look at cpu data I see 7 different timers and 
if you have a dual-socket quad core machine that's 28 variables right 
there, or are you just counting them as 7?  In any event, collectl's 
role in life to to collect as much as possible while staying within 
<0.1% cpu load.  As a result, collectl knows about a whole lot of things 
many other tools don't, in particular Infiniband, Quadrics and Lustre.  
But it also includes other less common measurements as well such as 
interrupts by cpu, tcp counters and even reports nfs data more 
rationally than nfsstat.  I've never actually counted the number of 
variables but they're in the hundreds.

The other thing is monitoring frequency - I don't know what a typical 
data collection interval for ganglia/rrd is but when run as a daemon, 
collectl's default monitoring interval is 10 seconds, though in some 
cases I've run it at 1 second with no noticeable system load.  It also 
collects process and slab statistics every 60 seconds to keep within its 
performance envelope.

As I said in my original email, I would suspect trying to send this much 
data to rrd from hundreds or thousands of nodes would overwhelm it even 
with the accelerator discussed in that thread you pointed me to.

But that brings me back to the overall problem statement.  Given that 
ganglia is looking at an overall cluster picture and not lower level 
details, it probably shouldn't care about all the low-level data 
collectl can collectl, BUT there is data collectl knows about that other 
tools don't and so coming up with a mechanism to pick and choose the 
data you want is in my opinion the only way to go.  I know of at least 
one ganglia user who sources Infiniband and Lustre data from collectl 
and so that's why I thought I'd bring up a collectl to rrd interface on 
this list.  If you think this discussion is more appropriate on a 
ganglia list just point me to it and we can move it there.

Just one last thing I'd like to point out, and this is from a system 
diagnostic perspective.  It also can get somewhat contentious, but I'd 
claim a centralized monitor is not the way to do system  problem 
diagnosis for a couple of reasons, the first being the volume of data 
involved - to say a tad more, collectl generates as much as 2-10MB/day 
of compressed data.  If you uncompress it you're in the 20-50MB range.  
That's a ton of data for each system to be sending upstream but I would 
also claim it's vital for any diagnostic work.

The other point is when a system is in distress, one often sees 
networking problems as well.  If you're dependent on a sick network to 
get time-critical data to a central manager for logging, you're going to 
loose that very data you so desperately need.  I admit keeping it 
locally can be problematic if you're trying to track down a problem 
involving a lot of systems,  but that's where the 2-tier approach can be 
so powerful.  You can still use the rrd data to help point you in the 
right direction and then use the node-specific data to actually dive 
deeper into what was happening at the time in question.

enough rambling...
-mark

Bernard Li wrote:
> Hi Mark!
>
> On Tue, Jul 22, 2008 at 5:40 AM, Mark Seger <Mark.Seger at hp.com> wrote:
>
>   
>> I had posted a note to this list some time ago about collectl
>> http://collectl.sourceforge.net/ and its use as a source of data for rrd
>> and while I have had a few notes of mild interest I thought I'd try
>> again since collectl is becoming better known and is even now part of
>> fedora.  As with all monitoring situations, everyone has different needs
>> and I've always tried to address as many as possible with collectl.
>>
>> For example, rrd recognizes the importance of finer granularity for more
>> recent data but I doubt it could handle what collectl produces -
>> hundreds of samples every 10 seconds or even more frequently, down to
>> fractions of seconds if you prefer.  Maybe for a couple of nodes, but
>> hundreds or thousands?  But collectl has a number of mechanisms to deal
>> with a lot of different situations and perhaps the answer to this
>> situation is to have collectl save all its data locally and only pass a
>> subset (perhaps at a different frequency) up to rrd.  Then someone could
>> use rrd to monitor a cluster and if a problem arises dive deeper into
>> collectl's local data?  Just a thought.  Then again someone who knows
>> rrd better might have a better solution.
>>
>> In any event, collectl has the ability to pass results over a socket or
>> even write a current snapshot to a small file that another tool can pick
>> up at its convenience.  In fact, collectl can even write its output in
>> rrd format if that's what someone is looking for, but as I said I fear
>> there may just be too much data.
>>
>> So if anyone does have any interest, check out what collectl can collect
>> and if there is any interest in using it to feed rrd, let's talk...
>>     
>
> I think this ongoing thread in rrd-developers might interest you:
>
> http://www.mail-archive.com/rrd-developers@lists.oetiker.ch/msg02284.html
>
> Cheers,
>
> Bernard
>