raid5 problem+tool suggestion

David van der Spoel Tue, 7 Sep 1999 03:42:37 -0700
Hi,

I run a 2.2.12 kernel + latest RAID (990824) + latest knfsd (1.4.7)
OS is on an IDE disk, I have a 8 x 9 Gb SCSI disk RAID 5 array with
2 SCSI controllers each having four disks (four internal, four external)
After three days uptime (and reading 50 Gb data from tapes) the following
crash occurred:

Sep  7 01:17:06 zorn kernel: scsi0: MEDIUM ERROR on channel 0, id 4, lun 0, CDB:
 Read (10) 00 00 22 bc 7f 00 00 02 00  
Sep  7 01:17:06 zorn kernel: Info fld=0x22bc80, Current sd08:31: sense key Mediu
m Error 
Sep  7 01:17:06 zorn kernel: Additional sense indicates Error too long to correc
t 
Sep  7 01:17:06 zorn kernel: scsidisk I/O error: dev 08:31, sector 2276416 
Sep  7 01:17:06 zorn kernel: raid5: Disk failure on sdd1, disabling device. Oper
ation continuing on 6 devices 
Sep  7 01:17:06 zorn kernel: md: recovery thread got woken up ... 
Sep  7 01:17:06 zorn kernel: md0: no spare disk to reconstruct array! -- continu
ing in degraded mode 
Sep  7 01:17:06 zorn kernel: md: recovery thread finished ... 
Sep  7 01:17:06 zorn kernel: md: updating md0 RAID superblock on device 
Sep  7 01:17:06 zorn kernel: sdh1 [events: 00000003](write) sdh1's sb offset: 88
83840 
Sep  7 01:17:06 zorn kernel: sdg1 [events: 00000003](write) sdg1's sb offset: 89
07904 
Sep  7 01:17:06 zorn kernel: sdf1 [events: 00000003](write) sdf1's sb offset: 89
07904 
Sep  7 01:17:06 zorn kernel: sde1 [events: 00000003](write) sde1's sb offset: 88
83840 
Sep  7 01:17:06 zorn kernel: (skipping faulty sdd1 ) 
Sep  7 01:17:06 zorn kernel: sdc1 [events: 00000003](write) sdc1's sb offset: 88
83840 
Sep  7 01:17:06 zorn kernel: sdb1 [events: 00000003](write) sdb1's sb offset: 88
83840 
Sep  7 01:17:06 zorn kernel: (skipping faulty sda1 ) 
Sep  7 01:17:06 zorn kernel: . 
Sep  7 01:17:06 zorn kernel: raid5: restarting stripe 2276416 
Sep  7 01:17:06 zorn kernel: raid5: md0: unrecoverable I/O error for block 79674
88 
Sep  7 01:17:06 zorn kernel: raid5: restarting stripe 2276418 

The strange thing is that after one disk failure (sdd1) another one is 
reported faulty (sda1) without there being an error from the SCSI layer.
With two disks down the RAID5 array is beyond automatic recovery.
I would like to suggest the following:

- When an array is not recoverable shut it down automatically, in my case
  that means after "skipping faulty sda1"
- For enhanced recovery make a tool that can make a superblock backup on 
  another disk (and restore it)

Finally, I tried the suggestions on Jakob O/stergaards page. mkraid with one
failed disk (sdd1) worked without error messages, but I could not mount 
the raid dev. Then I reset sdd1 to be a raid disk again. mkraid then 
happily made a raid array, and for some reason automatically started doing 
file system stuff, which screwed up everything.

With the tool I suggested one can try to restore a superblock and then
do a consistency check, to see if the array is recreated correctly. 
The tool should also be able to recreate the superblock in degraded mode.

Groeten, David.
________________________________________________________________________
Dr. David van der Spoel         Biomedical center, Dept. of Biochemistry
s-mail: Husargatan 3, Box 576,  75123 Uppsala, Sweden
e-mail: [EMAIL PROTECTED]    www: http://zorn.bmc.uu.se/~spoel
phone:  46 18 471 4205          fax: 46 18 511 755
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
raid5 problem+tool suggestion

Reply via email to