Hi,
I run a 2.2.12 kernel + latest RAID (990824) + latest knfsd (1.4.7)
OS is on an IDE disk, I have a 8 x 9 Gb SCSI disk RAID 5 array with
2 SCSI controllers each having four disks (four internal, four external)
After three days uptime (and reading 50 Gb data from tapes) the following
crash occurred:
Sep 7 01:17:06 zorn kernel: scsi0: MEDIUM ERROR on channel 0, id 4, lun 0, CDB:
Read (10) 00 00 22 bc 7f 00 00 02 00
Sep 7 01:17:06 zorn kernel: Info fld=0x22bc80, Current sd08:31: sense key Mediu
m Error
Sep 7 01:17:06 zorn kernel: Additional sense indicates Error too long to correc
t
Sep 7 01:17:06 zorn kernel: scsidisk I/O error: dev 08:31, sector 2276416
Sep 7 01:17:06 zorn kernel: raid5: Disk failure on sdd1, disabling device. Oper
ation continuing on 6 devices
Sep 7 01:17:06 zorn kernel: md: recovery thread got woken up ...
Sep 7 01:17:06 zorn kernel: md0: no spare disk to reconstruct array! -- continu
ing in degraded mode
Sep 7 01:17:06 zorn kernel: md: recovery thread finished ...
Sep 7 01:17:06 zorn kernel: md: updating md0 RAID superblock on device
Sep 7 01:17:06 zorn kernel: sdh1 [events: 00000003](write) sdh1's sb offset: 88
83840
Sep 7 01:17:06 zorn kernel: sdg1 [events: 00000003](write) sdg1's sb offset: 89
07904
Sep 7 01:17:06 zorn kernel: sdf1 [events: 00000003](write) sdf1's sb offset: 89
07904
Sep 7 01:17:06 zorn kernel: sde1 [events: 00000003](write) sde1's sb offset: 88
83840
Sep 7 01:17:06 zorn kernel: (skipping faulty sdd1 )
Sep 7 01:17:06 zorn kernel: sdc1 [events: 00000003](write) sdc1's sb offset: 88
83840
Sep 7 01:17:06 zorn kernel: sdb1 [events: 00000003](write) sdb1's sb offset: 88
83840
Sep 7 01:17:06 zorn kernel: (skipping faulty sda1 )
Sep 7 01:17:06 zorn kernel: .
Sep 7 01:17:06 zorn kernel: raid5: restarting stripe 2276416
Sep 7 01:17:06 zorn kernel: raid5: md0: unrecoverable I/O error for block 79674
88
Sep 7 01:17:06 zorn kernel: raid5: restarting stripe 2276418
The strange thing is that after one disk failure (sdd1) another one is
reported faulty (sda1) without there being an error from the SCSI layer.
With two disks down the RAID5 array is beyond automatic recovery.
I would like to suggest the following:
- When an array is not recoverable shut it down automatically, in my case
that means after "skipping faulty sda1"
- For enhanced recovery make a tool that can make a superblock backup on
another disk (and restore it)
Finally, I tried the suggestions on Jakob O/stergaards page. mkraid with one
failed disk (sdd1) worked without error messages, but I could not mount
the raid dev. Then I reset sdd1 to be a raid disk again. mkraid then
happily made a raid array, and for some reason automatically started doing
file system stuff, which screwed up everything.
With the tool I suggested one can try to restore a superblock and then
do a consistency check, to see if the array is recreated correctly.
The tool should also be able to recreate the superblock in degraded mode.
Groeten, David.
________________________________________________________________________
Dr. David van der Spoel Biomedical center, Dept. of Biochemistry
s-mail: Husargatan 3, Box 576, 75123 Uppsala, Sweden
e-mail: [EMAIL PROTECTED] www: http://zorn.bmc.uu.se/~spoel
phone: 46 18 471 4205 fax: 46 18 511 755
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++