Bug#509549: procps: top and vmstat may be incorrectly reporting low iowait

Craig Small Thu, 01 Jan 2009 23:28:06 -0800

On Tue, Dec 23, 2008 at 07:37:39AM +0000, Doug Winter wrote:
> 75 processes: 1 running, 74 sleeping
> CPU states:  0.5% user,  0.0% nice,  0.0% system, 99.5% idle,  0.0% iowait
This is missing a lot of states.  I suspect this is the fork top.
Cpu(s):  8.9%us,  1.0%sy,  0.0%ni, 90.1%id,  0.0%wa,  0.0%hi,  0.0%si, 0.0%st


I got iowait (wa) too, but missing is hi and si and st which are
hard and soft interrupts and stolen time.

>   PID USERNAME  THR PRI NICE  SIZE   RES   SHR STATE   TIME    CPU COMMAND
> 14357 zope        5  20    0 2186M 2036M 4696K sleep   7:59 20.00% python2.4

> The system this dump was taken from is a dual-core Athlon system that was
> under heavy load on a single thread. There are three things that are weird
> here:
The dual-core is important. Single cpu too?

> 1. the overall CPU states show the system being 99.5% idle, and yet the top
> process consumes 20% of CPU. That surely is just wrong ;)
I think you mean the python process.  Single threaded, 20% of a single
core. It does seem low, but it looks like your other cores are sitting
aronud doing and it is an average, hit '1' in top to see a per-cpu
percent.

> 2. even under very heavy load, the process rarely consumes more than 20%
> CPU, when I would expect to see it use 100% of one core, if it was CPU
> bound. But if it's IO bound I would expect to see more iowait, hence...
> 
> 3. the iowait is shown at 0.0%, which is very odd.
These look like one and the same problem, if the cpu is not flat out
doing something then what is it doing? Why, its waiting!

> During these periods when it shows this behaviour, strace shows the process
> is spending all of it's time in repeated calls to recvfrom(2) - from which
> I would expect to see a high iowait. Yet iowait is never reported at
> higher than 1-2%.
Ah! you've got the terms confused here. 

The percentages up the top are *CPU* percentages, recvfrom(2) is a system
call that a *process* uses.

Now let's consider two scenarios.

1) You like slow MFM drives, every time you want to write something to
disc the CPU has to hang around until the 20 year-old drive finally says
'yeah got it'. the cpu spends a lot of time in iowait. Crummy SCSI
busses used to do this too. iowait is a deep-down sitting right near the
metal kind of hanging around waiting that only CPUs do.

2) You are on an island on a high-latency satellite link that is
connected to your computers ethernet port.  You send something to your
remote server and then wait with a recvfrom().  The iowait is almost 0
because the etherent card services the traffic real quick, the CPU hangs
around doing nothing and getting bored. Your process sits in recvfrom() 
as the 600ms round trip time means it has to wait for the far end, a lot.
Blocking is a system call very high level waiting that processes do.

A process sitting at recvfrom is not sitting in iowait, it doesn't sit
there grabbing the CPU. This is Unix, the system call blocks and it
sleeps, ie 0% cpu. It really should use select() first, but for this bug
that's beside the point.

In short, your program is spending a lot of time waiting for messages to
arrive and Linux, being clever, puts the process to sleep until there is
something to look at.  That means the numbers are right.

I think you were getting the iowait mixed up with the WCHAN fields you
can see in top and ps.

I'll leave the bug open momentarily in case I misunderstood something,
but to me your process is fine, the data channel or whatever it is
talking to is your bottleneck. My wild guess it is a database being sad.

 - Craig
-- 
Craig Small      GnuPG:1C1B D893 1418 2AF4 45EE  95CB C76C E5AC 12CA DFA5
http://www.enc.com.au/                             csmall at : enc.com.au
http://www.debian.org/          Debian GNU/Linux, software should be Free 



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Bug#509549: procps: top and vmstat may be incorrectly reporting low iowait

Reply via email to