[mrtg] Re: Please, I really need help with this

Tue Mar 26 11:17:38 MET 2002

It really sounds like a question for the developers 
list.  It makes sense that the -1 gets converted to 
a +1 since MRTG does not deal with negative numbers 
at all.  Have you tried delaying the threshold script 
from kicking off until you've have something like 2 or 
3 consecutive polls of no response?  I have no idea 
if the threshold stuff can handle it.  I wrote a 
script for my box that scrubs the inbox to determine 
how many missed polls on a certain device I have.  
Then I send my real e-mail address a note saying that 
the device is having a problem.  Would that be a 
solution?

You could also use an external system that interfaces 
to MRTG (not ideal) to watch the MRTG stuff rather 
than relying on the built-in threshold stuff.  
Something like ganglia, gossips, Big Brother, or Mon.

Paul

>>> <robert.harrowfield at axon.co.nz> 03/26/02 00:34 AM >>>

please, please, please... if noone can help, do you have any suggestions of
who I may be able to talk to??

The issue I am having is both when an external script doesnt return data or
else a snmp query times out. This might be due to traffic congestion, host
downs or wan link issue. Instead of just repeating the last known entry in
the log, MRTG/rateup appears to be putting 1's into the logfile.

>From reading the mrtg script, it seems that rateup is passed a '-1' if the
data mrtg sees isnt a valid number. It then appears that rateup is writing
this into the log file as a value of '1', rather than the last known number.
I cant track the functionality of rateup as Im not very conversant with C,
but it seems to me that rateup isnt treating the '-1' as unknown data. I
would say that it is actually converting the 1 to a +1 and then adding this
data into the log as though it were a standard data entry rather than
getting the last good entry.

This then causes a large sawtooth dip in the graph and causes minimum
threshold alert scripts to be run. This is a bit of an issue as we use
minimum thresholds for diskspace available alerting and these alerts page
our oncall engineers. So, as you can see, its an issue that is causing a few
headaches.

Ive only started seeing this issue since upgrading from mrtg-2.8.11 to
2.9.18preX. The reason behind the upgrades were to help fix performance
problems caused by the number of devices that we poll with MRTG and that
most of these devices are polled via fairly slow (256k or less) WAN links. I
had gotten to the stage that the number of config files starting were
overloading the box and failing to run inside a 5 min period. The forking
and daemons really helps cut down the load!! 

I cant easily downgrade to 2.8.11 as the pre-production testing didnt show
up this issue (we didnt have any host or network issues) so 2.9.18pre1 was
implemented. After that, another 4 config files with approx 30 hosts were
added to the run list, so we would be hopelessly overloaded if we
downgraded.

Other possibly useful information includes:
MRTG 2.9.18pre3
RedHat Linux 6.2 running on AlphaServer
Perl 5.00503
UCD-snmp version: 4.1.1
further details of issue, excerpts from log files and config files in email
below.

Any help will be greatly appreciated.
Rob.

-----Original Message-----
From: Harrowfield, Robert - Axon AKL 
Sent: 25 March 2002 12:47PM
To: mrtg at list.ee.ethz.ch
Subject: [mrtg] Unknown data issues

Hi all,

Thanks for the couple of responses I had when I initially posted this query,
its appreciated. Im not sure wether I should be posting this to the general
mrtg list, or to the dev list??

Im having a bit of an issue with MRTG acting like its defaulting to
UNKASZERO when an SNMP query fails to retrieve data from a remote source.
Ive pasted acouple of sections of logs below that demonstrate what is
happening. 

Up until recently, I was running MRTG 2.8.something and this worked fine, if
the snmp source was unavailable, it just kept repeating its last known good
value, keeping a nice flat graph, rather than dropping to zero. I had to
update to a newer version to get the forking and daemon functionality that
is now in MRTG (thanks Tobi!!) and so got v2.9.18pre1. This started giving
me the problem with unknown data (when a server isnt available) causing
large troughs in my graphs and dropping down to zero it the host is
unavailable for enough 5 min periods.

I then tried v2.9.18pre2 (and also pre3 this morning) with no better
success. Normally I wouldnt deem this to be a real issue, but we also use
MRTG for polling the available HDD space on NT servers for customers. We use
minimum threshold alerts for diskspace available, which final destination is
an on-call engineer's pager. Understandably, the engineers get more than a
little grumpy when the pager wakes them at 2am, just because a few snmp
packets have gone AWOL. 

mrtglog
usr/local/mrtg-2/bin/mrtg line 1521
2002-03-21 17:22:50 -- WARNING: Expected a number but got ''
2002-03-21 17:22:50 -- WARNING: Expected a number but got ''
2002-03-21 17:23:01 -- SNMP Error:
no response received

logfile for disk during dip period
1016688160 -1 -1
1016688160 1 1 1 1
1016687843 8443 8443 8443 8443
1016687700 8698 8698 8986 8986
1016687400 9307 9307 9675 9675
1016687100 9737 9737 9807 9807
1016686800 9810 9810 9815 9815
1016686500 9816 9816 9819 9819

logfile about 5 mins after dip period
1016688443 7545 7545
1016688443 7545 7545 7545 7545
1016688160 1 1 1 1
1016688000 4025 4025 8443 8443
1016687700 8698 8698 8986 8986
1016687400 9307 9307 9675 9675

logfile 10 mins after dip period
1016688741 7351 7351
1016688741 7351 7351 7351 7351
1016688443 7545 7545 7545 7545
1016688300 3521 3521 7545 7545
1016688000 4025 4025 8443 8443
1016687700 8698 8698 8986 8986
1016687400 9307 9307 9675 9675
1016687100 9737 9737 9807 9807

section of config file (all other entries similar)
Target[server1.memory]:
.1.3.6.1.4.1.311.1.1.3.1.1.1.1.0&.1.3.6.1.4.1.311.1.1.3.1.1.1.2.0:community@
server
Options[server.memory]: gauge, nopercent
AbsMax[server.memory]: 529088512
MaxBytes1[server.memory]: 267837440
MaxBytes2[server.memory]: 529088512
Unscaled[server.memory]: dwmy
YLegend[server.memory]: Bytes
ShortLegend[server.memory]: Bytes
Legend1[server.memory]: Availble Memory
Legend2[server.memory]: Committed Memory
LegendI[server.memory]: *Available:
LegendO[server.memory]: *Committed:
ThreshMaxO[server.memory]: 396816384
ThreshProgO[server.memory]: /usr/local/bin/tecalert.customer.pl
Title[server.memory]: SERVER: Available Real / Committed Virtual Memory
PageTop[server.memory]: <h1>GPNTS01: Available Real / Committed Virtual
Memory</h1>
  Total Physical Memory: 256 MB
  <div align="center"><center>

Any help that anyone can provide will be greatly appreciated.
Rob.

-- 
The information contained in this e-mail message is intended only for the use of the person or entity to whom it is addressed and may contain information that is CONFIDENTIAL and may be exempt from disclosure under applicable laws. 

If you read this message and are not the addressee you are notified that use, dissemination, distribution, or reproduction of this message is prohibited. If you have received this message in error, please notify us immediately and delete the original message. You should scan this message and any attached files for viruses. 

Axon Computertime accepts no liability for any loss caused either directly or indirectly by a virus arising from the use of this message or any attached file.

--
Unsubscribe mailto:mrtg-request at list.ee.ethz.ch?subject=unsubscribe
Archive     http://www.ee.ethz.ch/~slist/mrtg
FAQ         http://faq.mrtg.org    Homepage     http://www.mrtg.org
WebAdmin    http://www.ee.ethz.ch/~slist/lsg2.cgi

--
Unsubscribe mailto:mrtg-request at list.ee.ethz.ch?subject=unsubscribe
Archive     http://www.ee.ethz.ch/~slist/mrtg
FAQ         http://faq.mrtg.org    Homepage     http://www.mrtg.org
WebAdmin    http://www.ee.ethz.ch/~slist/lsg2.cgi