Without a power cycle updating the drive firmware would be the only method of tricking the drives into a power-cycle. Obviously very risky. A reboot should be low risk.
On Tue, Sep 5, 2017 at 12:28 PM, mathog <mat...@caltech.edu> wrote: > Short form: > > An 8 disk (all 2Tb SATA) RAID5 on an LSI MR-USAS2 SuperMicro controller > (lspci shows " LSI Logic / Symbios Logic MegaRAID SAS 2008 [Falcon]") > system was long ago configured with a small partition of one disk as /boot > and logical volumes for / (root) and /home on a single large virual drive > on the RAID. Due to disk problems and a self goal (see below) the array > went into a degraded=1 state (as reported by megacli) and write locked both > root and home. When the failed disk was replaced and the rebuild completed > those were both still write locked. "mount -a" didn't help in either > case. A reboot brought them up normally but ideally that should not have > been necessary. Is there a method to remount the logical volumes writable > that does not require a reboot? > > Long form: > > Periodic testing of the disks inside this array turned up pending sectors > with > this command: > > smartctl -a /dev/sda -d sat+megaraid,7 > > A replacement disk was obtained and the usual replacement method applied: > > megacli -pdoffline -physdrv[64:7] -a0 > megacli -pdmarkmissing -physdrv[64:7] -a0 > megacli -pdprprmv -physdrv[64:7] -a0 > megacli -pdlocate -start -physdrv[64:7] -a0 > > The disk with the flashing light was physically swapped. The smartctl was > run again and unfortunately its values were unchanged. I had always > assumed that the "7" in that smartctl was a physical slot, turns out that > it is actually the "Device ID". In my defense the smartctl man page does a > very poor job describing this: > > megaraid,N - [Linux only] the device consists of one or more SCSI/SAS > disks > connected to a MegaRAID controller. The non-negative integer N (in > the range of 0 to 127 inclusive) denotes which disk on the controller > is monitored. Use syntax such as: > > In this system, unlike the others I had worked on previously, Device ID and > slots were not 1:1. > > Anyway, about a nanosecond after this was discovered the disk at Device ID > 7 was marked as Failed by the controller whereas previously it had been > "Online, Spun Up". > Ugh. At that point the logical volumes were all set read only and the OS > became barely usable, with commands like "more" no longer functioning. > Megacli and sshd, thankfully, still worked. Figuring that I had nothing to > lose the replacement disk was removed from slot 7 and the original, > hopefully still good disk replaced. That put the system into this state. > > slot 4 (device ID 7) failed. > slot 7 (device ID 5) is Offline. > > and > > megacli -PDOnline -physdrv[64:7] -a0 > > put it at > > slot 4 (device ID 7) failed. > slot 7 (device ID 5) Online, Spun Up > > The logical volumes were still read only but "more" and most other > commands now worked again. Megacli still showed the "degraded" value as > 1. I'm still not clear > how the two "read only" states differed to cause this change. > > At that point the failed disk in slot 4 (not 7!) was replaced with the > new disk (which had been briefly in slot 7) and it immediately began to > rebuild. Something on the order of 48 hours later that rebuild completed, > and the controller set "degraded" back to 0. However, the logical volumes > were still readonly. "mount -a" didn't fix it, so the system was rebooted, > which worked. > > > We have two of these back up systems. They are supposed to have identical > contents but do not. Fixing that is another item on a long todo list. > RAID 6 would have been a better choice for this much storage, but it does > not look like this card supports it: > > RAID0, RAID1, RAID5, RAID00, RAID10, RAID50, PRL 11, PRL 11 with > spanning, > SRL 3 supported, PRL11-RLQ0 DDF layout with no span, > PRL11-RLQ0 DDF layout with span > > That rebuild is far too long for comfort. Had another disk failed in > those two days that would have been it. Neither controller has battery > backup, and the one in question is not even on a UPS, so a power glitch > could be fatal too. Not a happy thought while record SoCal temperatures > persisted throughout the entire rebuild! The systems are in different > buildings on the same campus, sharing the same power grid. There are no > other backups for most of this data. > > Even though the controller shows this system as no longer degraded, should > I believe that there was no data loss? I can run checksums on all the > files (even though it will take forever) and compare the two systems. But > as I said previously, the files were not entirely 1:1, so there are > certainly going to be some files on this system which have no match on the > other. > > Regards, > > David Mathog > mat...@caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- - Andrew "lathama" Latham lath...@gmail.com http://lathama.com <http://lathama.org> -
_______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf