Aren't the drives in the RAID hot-swappable? Removing the defective drive and installing a new one certainly cycled power on those two? But I'm weak at hardware, and have never knowingly relied on firmware on a disk.
On Tue, Sep 5, 2017 at 1:52 PM, Andrew Latham <lath...@gmail.com> wrote: > Without a power cycle updating the drive firmware would be the only method > of tricking the drives into a power-cycle. Obviously very risky. A reboot > should be low risk. > > On Tue, Sep 5, 2017 at 12:28 PM, mathog <mat...@caltech.edu> wrote: > >> Short form: >> >> An 8 disk (all 2Tb SATA) RAID5 on an LSI MR-USAS2 SuperMicro controller >> (lspci shows " LSI Logic / Symbios Logic MegaRAID SAS 2008 [Falcon]") >> system was long ago configured with a small partition of one disk as /boot >> and logical volumes for / (root) and /home on a single large virual drive >> on the RAID. Due to disk problems and a self goal (see below) the array >> went into a degraded=1 state (as reported by megacli) and write locked both >> root and home. When the failed disk was replaced and the rebuild completed >> those were both still write locked. "mount -a" didn't help in either >> case. A reboot brought them up normally but ideally that should not have >> been necessary. Is there a method to remount the logical volumes writable >> that does not require a reboot? >> >> Long form: >> >> Periodic testing of the disks inside this array turned up pending sectors >> with >> this command: >> >> smartctl -a /dev/sda -d sat+megaraid,7 >> >> A replacement disk was obtained and the usual replacement method applied: >> >> megacli -pdoffline -physdrv[64:7] -a0 >> megacli -pdmarkmissing -physdrv[64:7] -a0 >> megacli -pdprprmv -physdrv[64:7] -a0 >> megacli -pdlocate -start -physdrv[64:7] -a0 >> >> The disk with the flashing light was physically swapped. The smartctl >> was run again and unfortunately its values were unchanged. I had always >> assumed that the "7" in that smartctl was a physical slot, turns out that >> it is actually the "Device ID". In my defense the smartctl man page does a >> very poor job describing this: >> >> megaraid,N - [Linux only] the device consists of one or more SCSI/SAS >> disks >> connected to a MegaRAID controller. The non-negative integer N (in >> the range of 0 to 127 inclusive) denotes which disk on the controller >> is monitored. Use syntax such as: >> >> In this system, unlike the others I had worked on previously, Device ID >> and >> slots were not 1:1. >> >> Anyway, about a nanosecond after this was discovered the disk at Device >> ID 7 was marked as Failed by the controller whereas previously it had been >> "Online, Spun Up". >> Ugh. At that point the logical volumes were all set read only and the OS >> became barely usable, with commands like "more" no longer functioning. >> Megacli and sshd, thankfully, still worked. Figuring that I had nothing to >> lose the replacement disk was removed from slot 7 and the original, >> hopefully still good disk replaced. That put the system into this state. >> >> slot 4 (device ID 7) failed. >> slot 7 (device ID 5) is Offline. >> >> and >> >> megacli -PDOnline -physdrv[64:7] -a0 >> >> put it at >> >> slot 4 (device ID 7) failed. >> slot 7 (device ID 5) Online, Spun Up >> >> The logical volumes were still read only but "more" and most other >> commands now worked again. Megacli still showed the "degraded" value as >> 1. I'm still not clear >> how the two "read only" states differed to cause this change. >> >> At that point the failed disk in slot 4 (not 7!) was replaced with the >> new disk (which had been briefly in slot 7) and it immediately began to >> rebuild. Something on the order of 48 hours later that rebuild completed, >> and the controller set "degraded" back to 0. However, the logical volumes >> were still readonly. "mount -a" didn't fix it, so the system was rebooted, >> which worked. >> >> >> We have two of these back up systems. They are supposed to have >> identical contents but do not. Fixing that is another item on a long todo >> list. RAID 6 would have been a better choice for this much storage, but it >> does not look like this card supports it: >> >> RAID0, RAID1, RAID5, RAID00, RAID10, RAID50, PRL 11, PRL 11 with >> spanning, >> SRL 3 supported, PRL11-RLQ0 DDF layout with no span, >> PRL11-RLQ0 DDF layout with span >> >> That rebuild is far too long for comfort. Had another disk failed in >> those two days that would have been it. Neither controller has battery >> backup, and the one in question is not even on a UPS, so a power glitch >> could be fatal too. Not a happy thought while record SoCal temperatures >> persisted throughout the entire rebuild! The systems are in different >> buildings on the same campus, sharing the same power grid. There are no >> other backups for most of this data. >> >> Even though the controller shows this system as no longer degraded, >> should I believe that there was no data loss? I can run checksums on all >> the files (even though it will take forever) and compare the two systems. >> But as I said previously, the files were not entirely 1:1, so there are >> certainly going to be some files on this system which have no match on the >> other. >> >> Regards, >> >> David Mathog >> mat...@caltech.edu >> Manager, Sequence Analysis Facility, Biology Division, Caltech >> _______________________________________________ >> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> > > > > -- > - Andrew "lathama" Latham lath...@gmail.com http://lathama.com > <http://lathama.org> - > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > >
_______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf