[mrtg] mrtg Digest, Vol 22, Issue 8 - snmp crashing a router

Fri Oct 10 15:16:55 CEST 2008

Steve,
   We have only 10 or so primary routers, but they are very heavily
loaded and we carried 2 200 Gig internet connections and sometimes hit
load, and just recently added another 200 Gig connection to eliminate
that problem.  We have been seeing problems with one of our largest
routers.  

We also have about 4,000 switches, but these are not monitored by mrtg.
We have about 4000 switches, and probably about 10,000 to 15,000 nodes.
We have been getting high CPU warnings on this one router.  This one
router handled all of our wireless traffic, main campus traffic and our
NAC product.  The NAC product accesses
the router to alter ACLs for wireless authentication - [The NAC product
is currently in use only for our
wireless environment which is pretty large, all of our dorms - with
about 10,000 students, about 20-30 buildings]

   We also have other tools which we use for network monitoring and
analysis.  Our routers are all Cisco 6500s
by the way, with our switches also being Cisco.  We use CiscoWorks to
configure our switches, but once a day
they also go out to everything and gather configs for backup purposes.
We also monitor our entire network (except Wireless) using NetVigil.
With NetVigil we use SNMP on our routers and gather everything we can
get.
On our aggregator switches (probably about 300) we gather stats on
interface errors and interface traffic.
We monitor in 5 minute polls.  Lastly we were also gathering arp history
for tracing connections {RIAA compliance] via SNMP.  

   On this particular router we found that it was SNMP that seemed to be
impacting the router the worst.
We saw frequent CPU over 90% utilization and many times right up to 98%.
And a couple of times it did reload.

   We had Cisco in on it to determine what was causing our excessive CPU
and it was felt that the SNMP traffic
as well as the traffic from our NAC product was putting too much load on
it.  

   What we did was move the wireless traffic to one of the wireless
controllers (also cisco 6500s), we also changed our retrieval parameters
on mrtg.  Other smaller changes such as polling intervals and time out
changes did not alleviate the problem.  However, When they moved the
wireless traffic off of the router it shot the CPU on the wireless
controller router up to about 60%, which is fine for these boxes, and we
saw the CPU drop back to about 10 - 20% average for the other box that
was hitting capacity.

   We are still monitoring it pretty closely but we have not seen
another high CPU warning now since the move
of wireless off the box.  This box handled most of our main campus
traffic as well, so it was very heavily
loaded.  Oh we also changed our Arp history recovery. We are depending
more on our NAC Product for it for wireless, and we are changing the way
we retrieve it so that it doesn't put as much load on it  when it pulls
it.  (not sure how we changed it - the person that created this program
is a heavily loaded guy since he's also
our DNS expert, our NAC expert as well as writing and being the expert
for our IP address management program.
His work was really excellent but he was a little over worked and was
the only one who really could work with
the programs he created.)

  Good luck, maybe some of what I mentioned will give you some hints of
where to look.  FYI we do most all of
the networking part of our network. So mrtg as well as the arp
gathering, the NAC product and IP management
as well as our wireless computing all reside within my department so we
were lucky because we knew pretty much
everything that could be going against our routers.  In other
installations I wouldn't be surprised to find
that different departments could be doing different parts, which would
make it much harder to determine where
all of the traffic to hit a router (for management) would be coming
from.  

Ray

-----Original Message-----

----------------------------------------------------------------------

Message: 1
Date: Fri, 10 Oct 2008 11:06:46 +1300
From: Steve Shipway <s.shipway at auckland.ac.nz>
Subject: Re: [mrtg] Can Mrtg cause a Router to crash because of the
	snmp querys?
To: 'Jack Bauguer' <jbauguer at yahoo.es>,	"'mrtg at lists.oetiker.ch'"
	<mrtg at lists.oetiker.ch>
Message-ID:

<6B587E8C999646469B54486AF219584606F3E1E414 at UXCHANGE7-1.UoA.auckland.ac.
nz>

Content-Type: text/plain; charset="us-ascii"

Our network here is way larger (1500 switches, 60 routers, 1320 hosts),
and we poll not only network traffic but also stats for multicast
traffic and CPU use on the core switches and routers, all at 5min
intervals.  There is no discernable load on the switches from the SNMP
queries.  Of course, our main switches are only on about 20% load, but
remember that these queries are only simple network traffic ones.

Note that, since SNMP is UDP traffic, and low priority to a switch, the
SNMP queries will simply be dropped in favour of TCP traffic if a link
is overloaded and not resent.  You'll get greyout on your MRTG graphs
(if using routers2) or blanks.  Most switches/routers will ignore SNMP
queries in favour of forwarding packets.

If your network is running at 85% loading I'd say you have other more
serious issues in capacity planning to deal with!  If a few SNMP queries
every 5min would cause a network collapse, then you're at far more risk
from (eg) someone opening a web browser...

Steve

________________________________