Hello all,

 

I am using a cluster of machines running Debian 5.0.4, kernel
2.6.26-2-amd64. These machines have dual Intel Xeon E5530 2.4GHz CPUs, which
are quad-core CPUs with hyperthreading. So that means each machine has 8
physical CPUs and a total of 16 logical CPUs.

 

I have run into an apparent issue with the kernel scheduler. Under the
circumstances described below, the scheduler will run two tasks on two
logical CPUs of the same physical CPU, even if all the remaining physical
CPUs are idle. This obviously causes a large slowdown for these tasks.

 

What I'm doing is this. I have a simple process that reads a file from disk
and performs some computation. The process is largely CPU bound, so if
execution of one such task takes N seconds, I would expect execution of two
parallel tasks to also take N seconds in the absence of other tasks on the
system. However, if these two tasks are the only thing running in the
system, the scheduler will consistently assign one task to CPU 0 and the
other to CPU 8. Since these are logical CPUs on the same physical CPU, the
actual run time of the two parallel task is closer to 1.8N, much slower than
what is possible.

 

The problem seems to arise from I/O interrupt handling. If I look at
/proc/interrupts, it seems that all interrupts are handled by the first
physical CPU. These are then apparently processed by one of this CPUs
logical CPUs (which corresponds to CPU 0 and 8). Once the tasks have run on
these CPUs, natural affinity ensures that the kernel scheduler will keep
them there. This leads to the interesting observation that if I create two
tasks that do no I/O (for example because all their I/O requests could be
satisfied by the cache) it is scheduled on two random CPUs and runs fast,
but if there is even a single I/O operation causing an interrupt anywhere in
the process, from that point on the tasks stay on CPU 0 and 8, even if they
do no further I/O, and will be much slower.

 

It seems to me that the proper behavior for the kernel scheduler should be
to give a higher penalty to running a task on a logical CPU whose logical
sibling is also being used while other physical CPUs are available than it
does to moving a thread to a different CPU, but it appears that isn't the
case.

 

I can work around this issue by setting CPU affinity for the tasks to CPUs
0-7, effectively disabling hyperthreading. However, this is not an ideal
solution.

 

My question then is twofold. Firstly, why are all interrupts being handled
by the first CPU? I checked the various /proc/irq/#/smp_affinity entries and
they are all 0000ffff so that's not the issue. By changing the value in
those files to a specific CPU I can get the interrupts to be handled by a
different CPU, but that just moves the problem. No matter what I do, I can't
get them to be handled by more than one CPU. I've tried running irqbalance
but that also didn't help. Is there a way to prevent this interrupt CPU
affinity, and if so would it fix my problem?

 

Secondly, why does the scheduler not realize that satisfying natural
affinity is not a good idea if the CPUs involved are logical siblings of
each other on the same physical CPU? I thought that the Linux kernel was
hyperthreading-aware and would take these kinds of things into
consideration. Is this a true shortcoming of the scheduler, or is my system
misconfigured somehow?

 

I hope you will be able to help.

 

Thanks,

Sven

Reply via email to