[rrd-users] Weird peaks on loaded systems over NFS

Tue Nov 12 08:56:36 MET 2002

Hi list

This is my first posting, so bear with me. We have a quite complex
platform here (all based on linux) and I have to solve this problem
ASAP. But first some explanations.

The platform I am talking about is built around an NFS cluster
(heartbeat/linux-ha) with half a tera storage on a EMC. The NFS servers
are connected to the storage using SCSI over fibre channel. Around that
NFS cluster, we have several "diskless" servers which are being booted
from the NFS cluster. They have disks, but only for swap and /tmp. We
currently have two application servers (with RRDtool, but more later)
and two webservers which access the rrd's lateron and build graphs. The
NFS servers as well as the clients are all dual P3/800 with 1GB RAM. The
filesystems are exported with nfs3/async.

On the application servers, we have two different perl scripts polling
our customer's routers. One is quite old (~2 years) but well proven, and
the other one is new and fast (forks). The old one has a list of around
200 routers and it's polling only in/out traffic via SNMP. The new
script has a smaller list (currently around 50), but it's polling
several details like voltage, temperature, load and stuff. Goal is to
stop using #1 and replace it by #2. As we have two application servers
and the polling scripts write to the same rrd's, the crontabs are
shifted by one minute. So app01 does 0-58/2 and app02 does 1-59/2 to
avoid locking problems.

So much for the situation. It has to be said that everything works
smooth as long as the NFS cluster is not under load. Load in this case
means I/O and CPU load. As soon as I "stress" the cluster (e.g. doing
simple things like cp -R or a backup), the newer script starts to
complain about locking problems and the graphs produce peaks which are
sometimes 10 times the average. This is ugly.

Ok. Can someone tell me what could be done to avoid this, or has anyone
made a similar experience with rrdtool over NFS? For me, it's just plain
incomprehensible that locking issues result in a peak. Are there some
fine-tuning tipps to produce cleaner graphs, without giving a MAX?

Thank you all in advance!

Greetings from foggy Switzerland

-- 
Real programmers do "cp /dev/audio a.out" and whistle into the mike.
                                                (Randal L. Schwartz)

--
Unsubscribe mailto:rrd-users-request at list.ee.ethz.ch?subject=unsubscribe
Help        mailto:rrd-users-request at list.ee.ethz.ch?subject=help
Archive     http://www.ee.ethz.ch/~slist/rrd-users
WebAdmin    http://www.ee.ethz.ch/~slist/lsg2.cgi