[rrd-users] Monitoring CPU stats on Linux? Watch your jiffies.
Clay Chambers
cchamb at gmail.com
Fri Dec 9 20:59:29 MET 2005
This is just a heads-up email intended to warn others who are
monitoring CPU utilization on Linux boxes by watching jiffy counters.
It seems that under extremely heavy load, the idle counter actually
goes down due to rounding. (I'm not completely certain that it's due
to rounding, but that's the most plausible cause I've come up with.)
The result is that your CPU utilization graphs could show 0% when
they're actually at 100% utilization.
Or maybe I'm doing something wrong and someone can enlighten me. I
think I've got things setup correctly, but it's quite possible that
I'm overlooking something.
First, I should give some background. I'm monitoring several
multi-processor servers that are running Red Hat Linux Advanced Server
2.1 with the 2.4.9 kernel. Each of these machines has four physical
CPUs with hyperthreading that makes it look like eight CPUs to the
kernel. I gather my CPU statistics by periodically looking at
/proc/stat. Here's an example of my /proc/stat:
cpu 129331 1782357 1738188 518385708
cpu0 13767 257345 414602 64568734
cpu1 23047 203356 174679 64853366
cpu2 28838 231905 211631 64782074
cpu3 23208 211822 192775 64826643
cpu4 9263 223316 173752 64848117
cpu5 10286 224565 191296 64828301
cpu6 12785 197899 176978 64866786
cpu7 8137 232149 202475 64811687
<snip>
As you can see, each physical CPU shows up as two CPUs because of
hyperthreading. The top line is just an aggregate of all the
individual CPU counters. As for the numbers themselves, they're just
counting jiffies (1/100th of a second) spent in user, nice, system,
and idle mode since reset/reboot. For my purposes, I'm only
considering the top line, which should give me a good overall measure
of CPU utilization for the box.
When programs execute, the kernel keeps track of how much CPU time was
spent executing it and increments the appropriate counters, most often
user mode, but some time in nice and system mode, too, depending on
the apps. Likewise when the CPUs are idle, the idle counters get
incremented. A completely idle CPU's idle counter will be incremented
by 100 jiffies for every for every second of idle uptime. For my
4-CPU systems with hyperthreading, that's 800 jiffies per second.
Okay, enough background. Hopefully that gives enough info to make the
rest of this message meaningful.
The problem arises when these counters actually decrease over time
when the system is under extremely heavy load. This shouldn't be
possible, obviously, because jiffies are units of time, and time is
always moving forward. However, I observed my idle counter decreasing
very slightly over several minutes. I pulled the following table from
my log files:
time, timestamp, user, nice, sys, idle
15:30:08, 1132698601, 147818, 2550558, 838017, 668577943
15:35:08, 1132698901, 359466, 2550558, 838132, 668606300
15:40:08, 1132699201, 599100, 2550560, 838201, 668606299
15:45:08, 1132699501, 839503, 2550561, 838270, 668606298
15:50:08, 1132699801, 1079148, 2550561, 838392, 668606291
15:55:08, 1132700100, 1196048, 2550566, 838496, 668728922
So as you can see, between 15:35 and 15:50, the idle jiffy count
decreases by 9 while the other counters increase. The error is quite
small. 9 jiffies over 25 minutes on an 4-CPU system is an error of
only 0.00075%. However, it did have the unpleasant effect of making
my graphs show 0% utilization instead of 100%.
I create my RRDs with a DS for each mode and a few reasonable RRAs,
something like this:
rrdtool create cpu.rrd --step 300
DS:user:COUNTER:600:0:U
DS:nice:COUNTER:600:0:U
DS:sys:COUNTER:600:0:U
DS:idle:COUNTER:600:0:U
RRA:AVERAGE:0.5:1:864
<several more averages and maximum RRAs here>
Then I graph them using the following DEFs and CDEF:
DEF:user=cpu.rrd:user:AVERAGE
DEF:nice=cpu.rrd:nice:AVERAGE
DEF:sys=cpu.rrd:sys:AVERAGE
DEF:idle=cpu.rrd:idle:AVERAGE
CDEF:utilization_pct=user,sys,idle,+,+,user,nice,sys,idle,+,+,+,/,100,*
This normally works great and gives me a good view into total CPU
utilization for the whole box. But when all four CPUs are running at
full throttle, my graph breaks.
Has anyone else encountered this problem? I googled for it but didn't
find anything. How should I create my RRD files to account for this?
Or am I doing something wrong in my graphs?
Thanks for reading,
Clay
--
Unsubscribe mailto:rrd-users-request at list.ee.ethz.ch?subject=unsubscribe
Help mailto:rrd-users-request at list.ee.ethz.ch?subject=help
Archive http://lists.ee.ethz.ch/rrd-users
WebAdmin http://lists.ee.ethz.ch/lsg2.cgi
More information about the rrd-users
mailing list