Hi, we are running a cluster of 57 dual opteron nodes. Once or twice a week one of these nodes gets in an error state and can't connect to the I/O-subsystem anymore. I need to reboot that node. As far as I can see, the problem occurs randomly at any of our nodes, i.e., the MTBF of a single node is about 6-12 months.
I still don't know whether this is a problem of the linux kernel sata driver, a hardware problem, a flaw of the disk firmware or something else. I'm looking for a possibilty to track down the problem without substantially interfering with the jobs on the cluster. This is our environment: TYAN S3992 motherboard with Serverworks HT1000+2000 chipset. 2 DualCore Opteron 2216 HE 2.4GHz, 16GByte Mem Maxtor 250GByte SATA disk, WDC WD2500YS-01SHB0, firmware rev. 20.06C03 Debian sarge amd64 (custom kernel) I tried several linux kernel versions (eg. 2.6.18.1, currently: 2.6.20.3 from kernel.org) which seems to make no difference. I also tried to reduce SATA bandwidth down to 150MB/s with a jumper at the disk. This does not help either. NCQ is disabled: # cat /sys/block/sda/device/queue_depth 1 Any ideas? Thanks, Thomas +++++++++++++++++++ Here is a typical console error log. As far as I can see, this means that the communication between the kernel and the disk suddenly get interupted. May 17 04:39:51 ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x40000000 action 0x2 frozen May 17 04:39:51 ata1.00: cmd ca/00:50:9a:32:7b/00:00:00:00:00/e0 tag 0 cdb 0x0 data 40960 out May 17 04:39:51 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) May 17 04:39:58 ata1: port is slow to respond, please be patient (Status 0xd0) May 17 04:40:21 ata1: port failed to respond (30 secs, Status 0xd0) May 17 04:40:21 ata1: soft resetting port May 17 04:40:28 ata1: port is slow to respond, please be patient (Status 0xd0) May 17 04:40:51 ata1: port failed to respond (30 secs, Status 0xd0) May 17 04:40:51 ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) May 17 04:40:51 ATA: abnormal status 0xD0 on port 0xFFFFC2000000401C May 17 04:40:51 ATA: abnormal status 0xD0 on port 0xFFFFC2000000401C May 17 04:40:51 ATA: abnormal status 0xD0 on port 0xFFFFC2000000401C May 17 04:40:51 ATA: abnormal status 0xD0 on port 0xFFFFC2000000401C May 17 04:40:51 ATA: abnormal status 0xD0 on port 0xFFFFC2000000401C May 17 04:40:51 ATA: abnormal status 0xD0 on port 0xFFFFC2000000401C May 17 04:40:51 ATA: abnormal status 0xD0 on port 0xFFFFC2000000401C May 17 04:41:21 ata1.00: qc timeout (cmd 0xec) May 17 04:41:22 ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4) May 17 04:41:22 ata1.00: revalidation failed (errno=-5) May 17 04:41:22 ata1: failed to recover some devices, retrying in 5 secs May 17 04:41:26 ata1: hard resetting port May 17 04:41:34 ata1: port is slow to respond, please be patient (Status 0xd0) May 17 04:41:57 ata1: port failed to respond (30 secs, Status 0xd0) May 17 04:41:57 ata1: COMRESET failed (device not ready) May 17 04:41:57 ata1: hardreset failed, retrying in 5 secs May 17 04:42:02 ata1: hard resetting port May 17 04:42:09 ata1: port is slow to respond, please be patient (Status 0xd0) May 17 04:42:32 ata1: port failed to respond (30 secs, Status 0xd0) May 17 04:42:32 ata1: COMRESET failed (device not ready) May 17 04:42:32 ata1: hardreset failed, retrying in 5 secs May 17 04:42:37 ata1: hard resetting port May 17 04:42:45 ata1: port is slow to respond, please be patient (Status 0xd0) May 17 04:43:08 ata1: port failed to respond (30 secs, Status 0xd0) May 17 04:43:08 ata1: COMRESET failed (device not ready) May 17 04:43:08 ata1: reset failed, giving up May 17 04:43:08 ata1.00: disabled May 17 04:43:08 ata1: EH complete May 17 04:43:08 sd 0:0:0:0: SCSI error: return code = 0x00040000 May 17 04:43:08 end_request: I/O error, dev sda, sector 8073882 May 17 04:43:08 Buffer I/O error on device sda2, logical block 9189 May 17 04:43:08 lost page write due to I/O error on sda2 May 17 04:43:08 sd 0:0:0:0: SCSI error: return code = 0x00040000 May 17 04:43:08 end_request: I/O error, dev sda, sector 16099660 May 17 04:43:08 Buffer I/O error on device sda3, logical block 12365 May 17 04:43:08 lost page write due to I/O error on sda3 May 17 04:43:08 sd 0:0:0:0: SCSI error: return code = 0x00040000 May 17 04:43:08 end_request: I/O error, dev sda, sector 73606884 May 17 04:43:08 Buffer I/O error on device sda3, logical block 7200768 May 17 04:43:08 lost page write due to I/O error on sda3 .... _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf