Looking at that disk config, your metadata was striped across 4 devices and you lost 1/4 of that. Not much you can do to come back from that.
I had a similar but easier situation in the past where I lost some data disks (but not metadata!) and using some low-level tools one can scan the GPFS medatada and make a list of the files which had data blocks on the data disks that I lost. And then restore/delete just those files. But in your case, you are out of luck. I'm not sure what the behavior would be after you mmdeldisk that disk, but I imagine you will not be able to mount the fs after that, and your only option will be mmdelfs. Re "BeeGFS over ZFS" vs "GPFS" I think you fill find the corner-case failure modes are not that much simpler in either case. "Better the devil you know..." On Sat, Apr 29, 2017 at 11:14 AM Evan Burness < evan.burn...@cyclecomputing.com> wrote: > ;-) > > On Sat, Apr 29, 2017 at 1:12 PM, John Hanks <griz...@gmail.com> wrote: > >> Thanks for the suggestions, but when this Phoenix rises from the ashes it >> will be running BeeGFS over ZFS. The more I learn about GPFS the more I am >> reminded of quote seen recently on twitter: >> >> "People bred, selected, and compensated to find complicated solutions do >> not have an incentive to implement simplified ones." -- @nntaleb >> <https://twitter.com/nntaleb> >> >> You can only read "you should contact support" so many times in >> documentation and forum posts before you remember "oh yeah, IBM is a >> _services_ company." >> >> jbh >> >> >> On Sat, Apr 29, 2017 at 8:58 PM Evan Burness < >> evan.burn...@cyclecomputing.com> wrote: >> >>> Hi John, >>> >>> Yeah, I think the best word here is "ouch" unfortunately. I asked a few >>> of my GPFS-savvy colleagues and they all agreed there aren't many good >>> options here. >>> >>> The one "suggestion" (I promise, no Monday morning quarterbacking) I and >>> my storage admins friends can offer, if you have the ability to do so (both >>> from a technical but also from a procurement/policy change standpoint) is >>> to swap out spinning drives for NVMe ones for your metadata servers. Yes, >>> you'll still take the write performance hit from replication relative to a >>> non-replicated state, but modern NAND and NVMe drives are so fast and low >>> latency that it will still be as fast or faster than the replicated, >>> spinning disk approach it sounds like (please forgive me if I'm >>> misunderstanding this piece). >>> >>> We took this very approach on a 10+ petabyte DDN SFA14k running GPFS >>> 4.2.1 that was designed to house research and clinical data for a large US >>> hospital. They had 600+ million files b/t 0-10 MB, so we had high-end >>> requirements for both metadata performance AND reliability. Like you, we >>> tagged 4 GPFS NSD's with metadata duty and gave each a 1.6 TB Intel P3608 >>> NVMe disk, and the performance was still exceptionally good even with >>> replication because these modern drives are such fire-breathing IOPS >>> monsters. If you don't have as much data as this scenario, you could >>> definitely get away with 400 or 800 GB versions and save yourself a fair >>> amount of $$. >>> >>> Also, if you're looking to experiment with whether a replicated approach >>> can meet your needs, I suggest you check out AWS' I3 instances for >>> short-term testing. They have up to 8 * 1.9 TB NVMe drives. At Cycle >>> Computing we've helped a number of .com's and .edu's address high-end IO >>> needs using these or similar instances. If you have a decent background >>> with filesystems these cloud instances can be excellent performers, either >>> for test/lab scenarios like this or production environments. >>> >>> Hope this helps! >>> >>> >>> Best, >>> >>> Evan Burness >>> >>> ------------------------- >>> Evan Burness >>> Director, HPC >>> Cycle Computing >>> evan.burn...@cyclecomputing.com >>> (919) 724-9338 >>> >>> >>> >>> >>> >>> On Sat, Apr 29, 2017 at 11:13 AM, John Hanks <griz...@gmail.com> wrote: >>> >>>> There are no dumb questions in this snafu, I have already covered the >>>> dumb aspects adequately :) >>>> >>>> Replication was not enabled, this was scratch space set up to be as >>>> large and fast as possible. The fact that I can say "it was scratch" >>>> doesn't make it sting less, thus the grasping at straws. >>>> jbh >>>> >>>> On Sat, Apr 29, 2017, 7:05 PM Evan Burness < >>>> evan.burn...@cyclecomputing.com> wrote: >>>> >>>>> Hi John, >>>>> >>>>> I'm not a GPFS expert, but I did manage some staff that ran GPFS >>>>> filesystems while I was at NCSA. Those folks reeeaaalllly knew what they >>>>> were doing. >>>>> >>>>> Perhaps a dumb question, but should we infer from your note that >>>>> metadata replication is not enabled across those 4 NSDs handling it? >>>>> >>>>> >>>>> Best, >>>>> >>>>> Evan >>>>> >>>>> >>>>> ------------------------- >>>>> Evan Burness >>>>> Director, HPC >>>>> Cycle Computing >>>>> evan.burn...@cyclecomputing.com >>>>> (919) 724-9338 >>>>> >>>>> On Sat, Apr 29, 2017 at 9:36 AM, Peter St. John < >>>>> peter.st.j...@gmail.com> wrote: >>>>> >>>>>> just a friendly reminder that while the probability of a particular >>>>>> coincidence might be very low, the probability that there will be >>>>>> **some** >>>>>> coincidence is very high. >>>>>> >>>>>> Peter (pedant) >>>>>> >>>>>> On Sat, Apr 29, 2017 at 3:00 AM, John Hanks <griz...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I'm not getting much useful vendor information so I thought I'd ask >>>>>>> here in the hopes that a GPFS expert can offer some advice. We have a >>>>>>> GPFS >>>>>>> system which has the following disk config: >>>>>>> >>>>>>> [root@grsnas01 ~]# mmlsdisk grsnas_data >>>>>>> disk driver sector failure holds holds >>>>>>> storage >>>>>>> name type size group metadata data status >>>>>>> availability pool >>>>>>> ------------ -------- ------ ----------- -------- ----- >>>>>>> ------------- ------------ ------------ >>>>>>> SAS_NSD_00 nsd 512 100 No Yes ready >>>>>>> up system >>>>>>> SAS_NSD_01 nsd 512 100 No Yes ready >>>>>>> up system >>>>>>> SAS_NSD_02 nsd 512 100 No Yes ready >>>>>>> up system >>>>>>> SAS_NSD_03 nsd 512 100 No Yes ready >>>>>>> up system >>>>>>> SAS_NSD_04 nsd 512 100 No Yes ready >>>>>>> up system >>>>>>> SAS_NSD_05 nsd 512 100 No Yes ready >>>>>>> up system >>>>>>> SAS_NSD_06 nsd 512 100 No Yes ready >>>>>>> up system >>>>>>> SAS_NSD_07 nsd 512 100 No Yes ready >>>>>>> up system >>>>>>> SAS_NSD_08 nsd 512 100 No Yes ready >>>>>>> up system >>>>>>> SAS_NSD_09 nsd 512 100 No Yes ready >>>>>>> up system >>>>>>> SAS_NSD_10 nsd 512 100 No Yes ready >>>>>>> up system >>>>>>> SAS_NSD_11 nsd 512 100 No Yes ready >>>>>>> up system >>>>>>> SAS_NSD_12 nsd 512 100 No Yes ready >>>>>>> up system >>>>>>> SAS_NSD_13 nsd 512 100 No Yes ready >>>>>>> up system >>>>>>> SAS_NSD_14 nsd 512 100 No Yes ready >>>>>>> up system >>>>>>> SAS_NSD_15 nsd 512 100 No Yes ready >>>>>>> up system >>>>>>> SAS_NSD_16 nsd 512 100 No Yes ready >>>>>>> up system >>>>>>> SAS_NSD_17 nsd 512 100 No Yes ready >>>>>>> up system >>>>>>> SAS_NSD_18 nsd 512 100 No Yes ready >>>>>>> up system >>>>>>> SAS_NSD_19 nsd 512 100 No Yes ready >>>>>>> up system >>>>>>> SAS_NSD_20 nsd 512 100 No Yes ready >>>>>>> up system >>>>>>> SAS_NSD_21 nsd 512 100 No Yes ready >>>>>>> up system >>>>>>> SSD_NSD_23 nsd 512 200 Yes No ready >>>>>>> up system >>>>>>> SSD_NSD_24 nsd 512 200 Yes No ready >>>>>>> up system >>>>>>> SSD_NSD_25 nsd 512 200 Yes No to be >>>>>>> emptied down system >>>>>>> SSD_NSD_26 nsd 512 200 Yes No ready >>>>>>> up system >>>>>>> >>>>>>> SSD_NSD_25 is a mirror in which both drives have failed due to a >>>>>>> series of unfortunate events and will not be coming back. From the GPFS >>>>>>> troubleshooting guide it appears that my only alternative is to run >>>>>>> >>>>>>> mmdeldisk grsnas_data SSD_NSD_25 -p >>>>>>> >>>>>>> around which the documentation also warns is irreversible, the sky >>>>>>> is likely to fall, dogs and cats sleeping together, etc. But at this >>>>>>> point >>>>>>> I'm already in an irreversible situation. Of course this is a scratch >>>>>>> filesystem, of course people were warned repeatedly about the risk of >>>>>>> using >>>>>>> a scratch filesystem that is not backed up and of course many ignored >>>>>>> that. >>>>>>> I'd like to recover as much as possible here. Can anyone confirm/reject >>>>>>> that deleting this disk is the best way forward or if there are other >>>>>>> alternatives to recovering data from GPFS in this situation? >>>>>>> >>>>>>> Any input is appreciated. Adding salt to the wound is that until a >>>>>>> few months ago I had a complete copy of this filesystem that I had made >>>>>>> onto some new storage as a burn-in test but then removed as that storage >>>>>>> was consumed... As they say, sometimes you eat the bear, and sometimes, >>>>>>> well, the bear eats you. >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> jbh >>>>>>> >>>>>>> (Naively calculated probability of these two disks failing close >>>>>>> together in this array: 0.00001758. I never get this lucky when buying >>>>>>> lottery tickets.) >>>>>>> -- >>>>>>> ‘[A] talent for following the ways of yesterday, is not sufficient >>>>>>> to improve the world of today.’ >>>>>>> - King Wu-Ling, ruler of the Zhao state in northern China, 307 BC >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin >>>>>>> Computing >>>>>>> To change your subscription (digest mode or unsubscribe) visit >>>>>>> http://www.beowulf.org/mailman/listinfo/beowulf >>>>>>> >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin >>>>>> Computing >>>>>> To change your subscription (digest mode or unsubscribe) visit >>>>>> http://www.beowulf.org/mailman/listinfo/beowulf >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Evan Burness >>>>> Director, HPC Solutions >>>>> Cycle Computing >>>>> evan.burn...@cyclecomputing.com >>>>> (919) 724-9338 >>>>> >>>> -- >>>> ‘[A] talent for following the ways of yesterday, is not sufficient to >>>> improve the world of today.’ >>>> - King Wu-Ling, ruler of the Zhao state in northern China, 307 BC >>>> >>> >>> >>> >>> -- >>> Evan Burness >>> Director, HPC Solutions >>> Cycle Computing >>> evan.burn...@cyclecomputing.com >>> (919) 724-9338 >>> >> -- >> ‘[A] talent for following the ways of yesterday, is not sufficient to >> improve the world of today.’ >> - King Wu-Ling, ruler of the Zhao state in northern China, 307 BC >> > > > > -- > Evan Burness > Director, HPC Solutions > Cycle Computing > evan.burn...@cyclecomputing.com > (919) 724-9338 > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf >
_______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf