[mrtg] Using MRTG in BIG setup

Fri Jan 28 09:35:20 MET 2005

Hello friends,

since I've not found the following design on how to use mrtg in big setup, I'd like to make the contribution, hoping it can be useful for some of you.

We had to collect&report data from ca. 150boxes/around 4000 interfaces with MRTG/rrd with a single 2xPIII(512MB) linux server, and tried several techniques. 

First let me summarize pros&cons of what is available:

1. mrtg as a daemon (RunAsDaemon option): good scheme for it loads into memory once and progressively polls devices; this scheme is kind to system resources; however, if half of the network is down, timeout on polling can be so large that MRTG does not poll up-devices in 5minutes intervals and there are missing data in graphs.

2. mrtg forking (Forks option): As the reference manual says, "For situations with high latency or a great number of devices this will speed things up considerably". That's absolutely true and we used this scheme for 2 years or so. Forks option was set as high as 70. If many boxes are down, polling other boxes is not affected much, we did not experience data loses in graphs. 
However, this scheme starts many processes at once, thus causing high load average, filling system cache, disk operations and so on. The situation was so painful that if you tried to log on to the server at the time mrtg polled boxes, you waited dozens of seconds for prompt. Crontabs were changed so that mrtg was started at different 5mins. intervals for groups of boxes; it helped but load average remained high.

These are system stats (taken from yearly, daily graphs):
Load avg: 3.5-4.0, during polling 7
Memory cached: 150M (very low)

PROPOSED SCHEME for big setups: MRTG & Nagios

Nagios (www.nagios.org) is a network monitoring programs which periodically runs service&hosts checks, it has many useful features etc.

The main advantage is that it schedules periodic service checks in such a way that the average system load is as small as possible. Any program which does that is suitable for the proposed setup.

The idea is to run mrtg for each box as a service check (ie. one config file per box). The result is constant load on the system, no load spikes, no heavy disk operations, no intense swapping etc. 

Remark: One can argue that mrtg has to parse the config file for a particular box every time it runs. That's true but in normal setup (single/forking) it is the same. Only in "RunAsDaemon" setup the config files are parsed only once.

These are system stats with the MRTG&Nagios integration:
Load average: 0.6 !!!
Memory cached: 380M

I was really astonished at the small value of the load average for the same number of boxes/interfaces polled. Now thanks to Nagios, a mrtg process is started about every 2 secs. Since the mrtg memory usage for one box is small, higher amount of memory serves as a cache, so the perl/mrtg/libs are not read from disk but cached. Disk operations on rrd files are spread evenly across time, filesystem operations are not intensive.

I'm sure it won't be a problem to poll 15000 interfaces with the same setup; unfortunatelly I don't have so many boxes...

If you read as far as here and would like to try, the setup instructions and scripts are provided:

1. install mrtg, rrdtool as per documentation (http://people.ee.ethz.ch/~oetiker/webtools/mrtg/)
2. use cfgmaker for periodical update of configuration of boxes - one cfg file per box
3. install Nagios as per documentation (www.nagios.org), nagios-plugins are not necessary for this setup but are very useful for other thinks
4. install Dan Bernstein's daemontools (http://cr.yp.to) - I recommend to install them on every server; they make life much easier
5. mkdir /var/log/mrtg && chown mrtg:mrtg /var/log/mrtg

6. This is my filesystem organization for Nagios & mrtg:

/var/mrtg/cfg holds cfg files in <<box-name>>.cfg naming
/var/mrtg/cgi-bin: 14all.cgi for graphing
/var/mrtg/html: html files
/var/mrtg/nagios/ - nagios installation (I use separate nagios only for mrtg, ie. install nagios with ./configure --prefix=/var/mrtg --nagios-user=mrtg)
/var/mrtg/rrds/ rrd files

7. Nagios config (/var/mrtg/nagios/etc):
default setup, delete all sample hosts, hostgroups etc.

cat checkcommands.cfg:
# 'check-true' command definition
define command{
        command_name    check-true
        command_line    /bin/true
        }

define command{
        command_name    mrtg
        command_line    $USER1$/mrtg $ARG1$
}

edit hosts.cfg and add:
define host{
        use                     generic-host            ; Name of host template to use

        host_name               mrtg
        alias                   MRTG Nagios Server
        address                 127.0.0.1
        check_command           check-true
        max_check_attempts      10
        notification_interval   120
        notification_period     24x7
        notification_options    d,u,r
        }

edit services.cfg and add:
define service{
        use                             generic-service         ; Name of service template to use
        name                            mrtg-service            ; Name of service template to use

        host_name                       mrtg
        is_volatile                     0
        check_period                    24x7
        max_check_attempts              3
        normal_check_interval           5
        retry_check_interval            1
        contact_groups                  mrtg-admins
        notification_interval           120
        notification_period             24x7
        notification_options            w,u,c,r
        register                        0       ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
        }

individual boxes to be polled are added to services.cfg in this way:
Example: box name box1 will have mrtg configuration in /var/mrtg/cfg/box1.cfg and its entry in /var/mrtg/nagios/etc/services.cfg will be:

define service{
        use                             mrtg-service            ; Name of service template to use
        service_description             MRTG - Servers
        normal_check_interval           5
        check_command                   mrtg!box1!
        }

8. mrtg service check script (put to /var/mrtg/nagios/libexec):
cat /var/mrtg/nagios/libexec/mrtg
#!/bin/sh

SYS=$1
LOG_FILES=$2
SIZE=99999

if [ -z $1 ]; then
        echo "Usage: $0 system [nLOG_FILES]"
        exit 1
fi

if [ -z $LOG_FILES ]; then
        LOG_FILES=10
fi

LOCK=/tmp/mrtg.lock.$SYS
LOG=/var/log/mrtg/$SYS
CFG=/var/mrtg/cfg/$SYS.cfg

if [ ! -d $LOG ]; then
        mkdir -p $LOG
fi;

(echo "INFO: Start $SYS"; \
/command/setlock -n $LOCK mrtg $CFG; \
echo "INFO: Finished $SYS"; \
) 2>&1 | /command/multilog t s$SIZE n$LOG_FILES $LOG

if [ $? -eq 0 ]; then
        echo MRTG OK
else
        echo MRTG Error
        exit 1
fi

9. that's all; you can check outputs from mrtg processes in /var/log/mrtg/<<box-name>>/ dirs

I would very appreciate any feedback/experience if you want to try this setup :))

Best regards,
Tomas Zeman
SysAdmin & NMS-Developer

--
Unsubscribe mailto:mrtg-request at list.ee.ethz.ch?subject=unsubscribe
Archive     http://www.ee.ethz.ch/~slist/mrtg
FAQ         http://faq.mrtg.org    Homepage     http://www.mrtg.org
WebAdmin    http://www.ee.ethz.ch/~slist/lsg2.cgi