[mrtg] Re: Missed snmp poll causes unrealistic spike in charts

Rich Adamson radamson at routers.com
Sat Mar 2 15:16:51 MET 2002



> I'm having more of the same problem with some device I'm logging (Cisco 7206
> and Catalyst 2924). Sometimes with no reason a major spike is on the graphs
> (and in the log). Even interfaces which are polled but are down (no link),
> and therefore always return a zero, have a major spike in the graph up to 40
> Mbps. I can't figure out why.

I've not tried to analyze the source code, however several people on this
list have complained about this exact same problem, and the problem is with
all of the recent mrtg versions (don't know exactly which version it started).
The problem is not limited to polling Cisco equipment only; it is observable
with any 32-bit snmp counter mib variable that returns a large value.

I've posted several problem summary emails to the mrtg-developers list, 
however there does not appear to be any developers listening/reading that 
list anymore. (???)

Problem Summary (from casual observations):
1. Mrtg snmp polls a device and updates log files, charts, etc. All is fine.
2. Mrtg snmp misses one or two polls due to a connectivity disruption, 
   network congestion, or for any other reason.
3. Mrtg snmp does not get a response but writes a record onto the top of
   the log file showing the most recent response with zero's ("0").
4. The connectivity disruption is corrected (has nothing to do with mrtg).
5. The next successful mrtg snmp poll obtains the snmp mib value (which is
   likely to be a very large 32-bit number since the remote device being
   polled did not actually fail or have its snmp counters reset).
6. Mrtg calculates the difference between zero and the most recently 
   returned value (which is the very large number), and writes this large
   value onto the top of the log file, ignoring the MaxBytes value.
7. Mrtg plots the chart using this unrealistic large value, ignoring MaxBytes
   in the plotting routines.
8. If this large value (spike) is not recognized by a human within min/hrs
   and manually edited/deleted from the log, then the weekly/monthly charts  
   become distorted with the spike, and will not be corrected for several 
   hours/days after one manually corrected the original log file error.
9. 32-bit counter roll overs, etc, seem to be handled correctly by mrtg.

Mrtg Source Code changes needed (multiple choices, needs to be reviewed
by multiple eyes):
1. If mrtg misses a poll, it should not write a current log record 
   with zero's into the log file.
2. When mrtg calculates the difference between the last poll and the current
   poll, it should compare the calculated value to MaxBytes, and either:
   a. record the MaxBytes value instead of the calculated value, or
   b. ignore the last poll and not write any record into log file at all.
   c. check the previous poll value, and if greater than zero:
      1. calculate the difference between last poll and current poll,
      2. if the calculated value is greater than MaxBytes, log the MaxBytes 
         value instead of the calculated value

Since I have no idea how mrtg is using the log file records (and there seems
to be some mysterious averaging going on), the realistic corrective action 
choice would seem to be limited either #1. Would that distort the charting 
and/or averaging in any realistic way (considering the significant negative
impact of the current method of operation)?

It would also seem very realistic/beneficial to compare the calculated 
value to MaxBytes, and if the calculated value is greater, simply ignore
the last poll. If the MaxBytes parameter really was intended to represent
the largest capacity value that could ever occur for a particular facility,
then one should _never_ see a calculated value greater than this. If it 
does happen, then something is seriously wrong.  If the MaxBytes parameter 
is to be used in this way, then a warning should be added to the man pages 
regarding "compression". (E.g., if a user implements compression on a T1 
serial link, the calculated snmp values can be larger then the T1 
bandwidth.)

Can anyone help with the source code changes needed to address this issue
and get them applied to the mrtg distribution?


--
Unsubscribe mailto:mrtg-request at list.ee.ethz.ch?subject=unsubscribe
Archive     http://www.ee.ethz.ch/~slist/mrtg
FAQ         http://faq.mrtg.org    Homepage     http://www.mrtg.org
WebAdmin    http://www.ee.ethz.ch/~slist/lsg2.cgi



More information about the mrtg mailing list