Dear Mr. Hahn,

> the logs show that a command times out, and defies recovery.  I don't think
> your chipset is the most common - is the SATA controller integrated, or
> something like a Promise chip?
The HT1000 is an integrated controller for USB, IDE and SATA. As far as I
understand, it is the same chip as the Broadcom BCM5785.

> do you have any guess about whether your disks are getting enough power?
> it seems to be a fairly common occurrance for people to report this kind of
> "stops working" bug to the list ([EMAIL PROTECTED]), only later to
> discover that the problem was a marginal power supply.
24 of the 57 nodes have an additional infiniband HA. If power were marginal I
would expect that this subset of nodes had a higher error rate than the other
nodes. But there seems to be no difference that is statistically significant.

> > I also tried to reduce SATA bandwidth down to 150MB/s with a jumper at
> > the disk. This does not help either.
>
> it wouldn't, unless you had a noise problem with the cable.
it has been an advise from our hardware vendor.

Eoin McHugh gave me a hint that our disks might have a firmware bug and
there is an update available. (For whatever reason I affiliated our disks
with Maxtor. So I hadn't found any firmware update on their website.
But of course the disks are from Western Digital). This is the most
promising trace I'm following now.

Thanks for your advice! SY, Th. Gebhardt
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to