I still don't know whether this is a problem of the linux kernel sata driver,
a hardware problem, a flaw of the disk firmware or something else. I'm

the logs show that a command times out, and defies recovery.  I don't think
your chipset is the most common - is the SATA controller integrated, or
something like a Promise chip?

do you have any guess about whether your disks are getting enough power?
it seems to be a fairly common occurrance for people to report this kind of "stops working" bug to the list ([EMAIL PROTECTED]), only later to discover that the problem was a marginal power supply.

looking for a possibilty to track down the problem without substantially
interfering with the jobs on the cluster.

the sata developers hang out on linux-ide, and seem very responsive.
quite a lot of work has been done on exception handling, but as always,
it's the most common controllers which are best tested/supported.

I tried several linux kernel versions (eg. 2.6.18.1, currently: 2.6.20.3
from kernel.org) which seems to make no difference.

well, by kernel standards, 2.6.20.3 is fairly old; there have certainly been
plenty of SATA updates this year.

I also tried to reduce SATA bandwidth down to 150MB/s with a jumper at
the disk. This does not help either.

it wouldn't, unless you had a noise problem with the cable.

NCQ is disabled:
# cat  /sys/block/sda/device/queue_depth
1

such features wouldn't cause the fairly low-level hang in your logs - to me it looks like power, given that it appears to affect even the phy-level
disk interface.  it wouldn't hurt to see what smart says about it (health,
metrics and even a self-test.) you might also try stressing the disk with IO to see whether you can repeatably trigger the problem.

regards, mark hahn.
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to