Hello, thank you all for your advice! After a Firmware upgrade (->20.06C06) of the SATA disks we had no further incident until now. So I'm pretty sure that we have caught the bug.
Thanks again, Th. Gebhardt On Wednesday 23 May 2007 11:13, Gebhardt Thomas wrote: > we are running a cluster of 57 dual opteron nodes. Once or twice a week > one of these nodes gets in an error state and can't connect to the > I/O-subsystem anymore. I need to reboot that node. As far as I can see, > the problem occurs randomly at any of our nodes, i.e., the MTBF of a single > node is about 6-12 months. > > I still don't know whether this is a problem of the linux kernel sata > driver, a hardware problem, a flaw of the disk firmware or something else. > I'm looking for a possibilty to track down the problem without > substantially interfering with the jobs on the cluster. > > This is our environment: > TYAN S3992 motherboard with Serverworks HT1000+2000 chipset. > 2 DualCore Opteron 2216 HE 2.4GHz, 16GByte Mem > Western Digital 250GByte SATA disk, WDC WD2500YS-01SHB0, firmware rev. 20.06C03 _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf