[rrd-users] rrd and collectl
Mark.Seger at hp.com
Wed Jul 23 15:20:27 CEST 2008
Thanks for the pointer. This is an interesting discussion, but at least
as I'm reading it, it talks to the rrd db update problem and I'm talking
about the source of the data. My main exposure to rrd is via ganglia
and some experiments I did with loading collectl data awhile back and it
seems to me that the main role of rrd (or maybe it's just ganglia) it to
get a sense of the overall health of the system as opposed to trying to
collect enough data to meaningfully track down a complex system
problem. The reason I say this is in response to your statement that
ganglia collects something like 30 variables. Maybe it's just the way
we're counting, but when I look at cpu data I see 7 different timers and
if you have a dual-socket quad core machine that's 28 variables right
there, or are you just counting them as 7? In any event, collectl's
role in life to to collect as much as possible while staying within
<0.1% cpu load. As a result, collectl knows about a whole lot of things
many other tools don't, in particular Infiniband, Quadrics and Lustre.
But it also includes other less common measurements as well such as
interrupts by cpu, tcp counters and even reports nfs data more
rationally than nfsstat. I've never actually counted the number of
variables but they're in the hundreds.
The other thing is monitoring frequency - I don't know what a typical
data collection interval for ganglia/rrd is but when run as a daemon,
collectl's default monitoring interval is 10 seconds, though in some
cases I've run it at 1 second with no noticeable system load. It also
collects process and slab statistics every 60 seconds to keep within its
As I said in my original email, I would suspect trying to send this much
data to rrd from hundreds or thousands of nodes would overwhelm it even
with the accelerator discussed in that thread you pointed me to.
But that brings me back to the overall problem statement. Given that
ganglia is looking at an overall cluster picture and not lower level
details, it probably shouldn't care about all the low-level data
collectl can collectl, BUT there is data collectl knows about that other
tools don't and so coming up with a mechanism to pick and choose the
data you want is in my opinion the only way to go. I know of at least
one ganglia user who sources Infiniband and Lustre data from collectl
and so that's why I thought I'd bring up a collectl to rrd interface on
this list. If you think this discussion is more appropriate on a
ganglia list just point me to it and we can move it there.
Just one last thing I'd like to point out, and this is from a system
diagnostic perspective. It also can get somewhat contentious, but I'd
claim a centralized monitor is not the way to do system problem
diagnosis for a couple of reasons, the first being the volume of data
involved - to say a tad more, collectl generates as much as 2-10MB/day
of compressed data. If you uncompress it you're in the 20-50MB range.
That's a ton of data for each system to be sending upstream but I would
also claim it's vital for any diagnostic work.
The other point is when a system is in distress, one often sees
networking problems as well. If you're dependent on a sick network to
get time-critical data to a central manager for logging, you're going to
loose that very data you so desperately need. I admit keeping it
locally can be problematic if you're trying to track down a problem
involving a lot of systems, but that's where the 2-tier approach can be
so powerful. You can still use the rrd data to help point you in the
right direction and then use the node-specific data to actually dive
deeper into what was happening at the time in question.
Bernard Li wrote:
> Hi Mark!
> On Tue, Jul 22, 2008 at 5:40 AM, Mark Seger <Mark.Seger at hp.com> wrote:
>> I had posted a note to this list some time ago about collectl
>> http://collectl.sourceforge.net/ and its use as a source of data for rrd
>> and while I have had a few notes of mild interest I thought I'd try
>> again since collectl is becoming better known and is even now part of
>> fedora. As with all monitoring situations, everyone has different needs
>> and I've always tried to address as many as possible with collectl.
>> For example, rrd recognizes the importance of finer granularity for more
>> recent data but I doubt it could handle what collectl produces -
>> hundreds of samples every 10 seconds or even more frequently, down to
>> fractions of seconds if you prefer. Maybe for a couple of nodes, but
>> hundreds or thousands? But collectl has a number of mechanisms to deal
>> with a lot of different situations and perhaps the answer to this
>> situation is to have collectl save all its data locally and only pass a
>> subset (perhaps at a different frequency) up to rrd. Then someone could
>> use rrd to monitor a cluster and if a problem arises dive deeper into
>> collectl's local data? Just a thought. Then again someone who knows
>> rrd better might have a better solution.
>> In any event, collectl has the ability to pass results over a socket or
>> even write a current snapshot to a small file that another tool can pick
>> up at its convenience. In fact, collectl can even write its output in
>> rrd format if that's what someone is looking for, but as I said I fear
>> there may just be too much data.
>> So if anyone does have any interest, check out what collectl can collect
>> and if there is any interest in using it to feed rrd, let's talk...
> I think this ongoing thread in rrd-developers might interest you:
More information about the rrd-users