[rrd-users] Re: VISIONARY: New tool to consider to create: a "DynamicData Set" tool...
Jakob Ilves
jakob.ilves at oracle.com
Thu Nov 23 09:56:10 MET 2000
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Hello!
(oops, yet another VERY LONG follow up! Well, hope you find it worth reading.)
"BAARDA, Don" wrote:
> G'day,
>
> Sounds like the biggest problem is you are going to have bucket-loads of
> data...storage may be cheap, but searching through and analysing that data
> is a headache. Are you really sure you want to record byte counts for every
That's the problem. Information storage is cheap, but performance in
information processing isn't and it's the latter part which gives us some...
challenge :-).
> single src-dest pair, on every port, for every 5 minutes/hour/day/year
> whatever? And then you want to highlight the busy ones? Fun :-)
No, I don't want to see all that. Some of that data is worth keeping in detail,
some isn't. The really fun part is trying to find out what data should belong
to either of these two cathegories and how to make that tunable for the user of
the tool.
You can trim the data
* before you even store it anywhere by sorting "low contributors" into their
own cathegory.
* when you consolidate 6 * 5 min samples into 1 * 30 min sample as some
contributors will lower their total relative contribution to the 30 min
data and thus become "low contributors. (and when 4 x 30 min is
consolidated into 1 x 120 min etc..)
* When you graph the data or otherwise process it for report generation,
depending on the size/precision/format of the graph/presentation.
I suppose there are other opportunities and methods as well for reducing the
amount of data involved.
The right tuning/trimming could probably reduce the storage and performance
requirements with a factor of 100 or even 1000 compared to the "store all and
everything" approach.
> -----Original Message-----
> From: Chris Snell [SMTP:chris at bikeworld.com]
> Sent: Thursday, November 23, 2000 1:05 AM
> To: Jakob Ilves
> Cc: rrd-users at list.ee.ethz.ch
> Subject: [rrd-users] Re: VISIONARY: New tool to consider to create:
a
> "DynamicData Set" tool...
>
>
>
> On Wed, 22 Nov 2000, Jakob Ilves wrote:
>
> > Well, if we extend the scope from not just destinations for traffic, I
> want the
> > tool to produce graphs with the same information as those produced by
> the
> > Netmetrix product, but better (of course ;-). Netmetrix provides you
> with
> > statistics for a link such as graphs showing the distribution of:
> >
> > * protocol usage during the day
> > * top talkers during the day
> > * top listeners during the day
> > * top conversation during the day.
> >
>
> I think I'd go with a mix of Tobi's suggestions and your suggestions.
I'd
> use a SQL database to store such things as total bytes in/out for a
> particular destination. More specifically, I'd store these totals for
> each day. I think it would be easier to determine "top talkers for
> today" by making a SQL call than it would be to query 10,000+ RRD
> files. For the actual data measurements, I'd grab them every 30 seconds
> and store them in individual RRDs, as Tobi suggested. Yes, it's a lot of
> files and disk usage but, hey, disks are cheap.
>
I was going to suggest the same thing, but why not whack the whole
RRD database into the SQL database? That way it's all contained in the one
Makes sense. Especially as each sample has an unpredictable size (you don't how
many hosts or whatever is involved at each individual sample interval).
> place, and you don't have the filesystem overheads. You would add whatever
> data you needed to search on (ie talkers for last 5 mins/hrs/days/whatever),
> and then you could pull the RRD database out to create graphs or more
Well, in my vision the tool (regardless if it is rrdtool with extra features or
if it is a separate tool) handles all the data storage and graphing itself.
> detailed analysis. This way you would get the SQL searching and storage
> features for the huge number of DS's, and you would get RRD's efficient
> storage/accumulation over time. If you find you need to do searches on
> fields you didn't originally cater for, at least you have _all_ the data in
> the RRDs, so you can do slow exhaustive searches. If the data you want to
> search on can be incrementaly updated by your data collector, you can then
> add those fields and initialise them from the existing RRD data for future
> searches.
The advantage of using the RRD format as an intermediate format for the storage
of the data is that the tools are already there. Performance wise I'm not sure
as it enforces the tool to focus sort of align the storage to the hosts
generating traffic, there will be one file per host. It might be better to have
a file per sample or per group of samples, containing all the hosts-datacount
for that sample group. But still, all consolidation and resampling is done in
the same manner as in RRDtool.
Details of how I think: you have a root directory for the data collection, call
it "wanlink.dds". Below this, you have four directories: 5m, 30m, 120m, 1440m.
The wanlink.dds/5m directory contains files with names such as 974972700 which
simply is the seconds since 1970 timestamp for the data in the file, which is a
group of 6 * 5min samples. To make up 48 hours, 96 such files are required. In
the wanlink.dds/30m directory, there are again a bunch of files all with names
like 974972700 (the timestamp) and containing 4 * 30 minute samples. For two
weeks of data in this 30m dir, you need 172 files. Similarly the directory
wanlink.dds/120m contains files where each contains 12 * 120m where two months
worth requires approx 60 files. Finally, the directory wanlink.dds/1440m
contains files with, say, 7 * 1440m samples, 104 such files consists of two
years of data.
These directories requires a total of 96 + 172 + 60 + 104 = 432 files... Hm,
quite a few just for ONE link.
If we go for 5m dir to contain files with 24 * 5min samples (24 files), 30m dir
contain 24 * 30 min samples (14 files), 120m dir contain 86 * 120 m samples
(requiring 8 files) and 1440m dir contains 30 * 1440 min samples (requiring 24
files) we get just 70 files.
With the first suggestion (the one with 432 files) consolidation of 5min data
into 30min data simply means processing the file in wanlink.dss/5m containing
those 6 * 5 min samples and then use the results to update the proper file in
wanlink.dss/30m .
Of course, an interface to an SQL database of your choice (personally, I
recommend Oracle for obvious reasons) might provide a better solution.
Anyway, my gut feeling is that RRDtool as it looks right now perhaps should not
be included in this new "DDStool", but that it definitively should serve as a
model to follow as closely as reasonable. For instance, DDStool should have the
concepts of "ds" (datasources) and CF:s (consolidation functions) implemented as
well as a counterpart to the "rra".
>
> > I'd roll my own data collector that could store the collected data in the
> > RRDs but would also update the SQL tables with the current "total" counts.
> >
> I'd add that it needs to pull the RRD's out of the database to
> update them, then shove them back in.
Unless the tool itself takes care of writing the data into the SQL database.
>
> ABO
Best regards and my hopes I didn't put anyone asleep...
/IlvJa
--
(Jakob Ilves) <jakob.ilves at oracle.com>
{Oracle Global IT, Network Management Group}
[Office as well as mobile phone: +46/8/477 3666 | Fax: +46/8/477 3572]
- Intranet Home Page: http://jilves.se.oracle.com -
-- Attached file removed by Listar and put at URL below --
-- Type: text/x-vcard
-- Desc: Card for Jakob Ilves
-- Size: 444 bytes
-- URL : http://www.ee.ethz.ch/~slist/pantomime/22-jakob.ilves.vcf
--
Unsubscribe mailto:rrd-users-request at list.ee.ethz.ch?subject=unsubscribe
Help mailto:rrd-users-request at list.ee.ethz.ch?subject=help
Archive http://www.ee.ethz.ch/~slist/rrd-users
WebAdmin http://www.ee.ethz.ch/~slist/lsg2.cgi
More information about the rrd-users
mailing list