Thanks for the suggestions, but when this Phoenix rises from the ashes it will be running BeeGFS over ZFS. The more I learn about GPFS the more I am reminded of quote seen recently on twitter:
"People bred, selected, and compensated to find complicated solutions do not have an incentive to implement simplified ones." -- @nntaleb <https://twitter.com/nntaleb> You can only read "you should contact support" so many times in documentation and forum posts before you remember "oh yeah, IBM is a _services_ company." jbh On Sat, Apr 29, 2017 at 8:58 PM Evan Burness < evan.burn...@cyclecomputing.com> wrote: > Hi John, > > Yeah, I think the best word here is "ouch" unfortunately. I asked a few of > my GPFS-savvy colleagues and they all agreed there aren't many good options > here. > > The one "suggestion" (I promise, no Monday morning quarterbacking) I and > my storage admins friends can offer, if you have the ability to do so (both > from a technical but also from a procurement/policy change standpoint) is > to swap out spinning drives for NVMe ones for your metadata servers. Yes, > you'll still take the write performance hit from replication relative to a > non-replicated state, but modern NAND and NVMe drives are so fast and low > latency that it will still be as fast or faster than the replicated, > spinning disk approach it sounds like (please forgive me if I'm > misunderstanding this piece). > > We took this very approach on a 10+ petabyte DDN SFA14k running GPFS 4.2.1 > that was designed to house research and clinical data for a large US > hospital. They had 600+ million files b/t 0-10 MB, so we had high-end > requirements for both metadata performance AND reliability. Like you, we > tagged 4 GPFS NSD's with metadata duty and gave each a 1.6 TB Intel P3608 > NVMe disk, and the performance was still exceptionally good even with > replication because these modern drives are such fire-breathing IOPS > monsters. If you don't have as much data as this scenario, you could > definitely get away with 400 or 800 GB versions and save yourself a fair > amount of $$. > > Also, if you're looking to experiment with whether a replicated approach > can meet your needs, I suggest you check out AWS' I3 instances for > short-term testing. They have up to 8 * 1.9 TB NVMe drives. At Cycle > Computing we've helped a number of .com's and .edu's address high-end IO > needs using these or similar instances. If you have a decent background > with filesystems these cloud instances can be excellent performers, either > for test/lab scenarios like this or production environments. > > Hope this helps! > > > Best, > > Evan Burness > > ------------------------- > Evan Burness > Director, HPC > Cycle Computing > evan.burn...@cyclecomputing.com > (919) 724-9338 > > > > > > On Sat, Apr 29, 2017 at 11:13 AM, John Hanks <griz...@gmail.com> wrote: > >> There are no dumb questions in this snafu, I have already covered the >> dumb aspects adequately :) >> >> Replication was not enabled, this was scratch space set up to be as large >> and fast as possible. The fact that I can say "it was scratch" doesn't make >> it sting less, thus the grasping at straws. >> jbh >> >> On Sat, Apr 29, 2017, 7:05 PM Evan Burness < >> evan.burn...@cyclecomputing.com> wrote: >> >>> Hi John, >>> >>> I'm not a GPFS expert, but I did manage some staff that ran GPFS >>> filesystems while I was at NCSA. Those folks reeeaaalllly knew what they >>> were doing. >>> >>> Perhaps a dumb question, but should we infer from your note that >>> metadata replication is not enabled across those 4 NSDs handling it? >>> >>> >>> Best, >>> >>> Evan >>> >>> >>> ------------------------- >>> Evan Burness >>> Director, HPC >>> Cycle Computing >>> evan.burn...@cyclecomputing.com >>> (919) 724-9338 >>> >>> On Sat, Apr 29, 2017 at 9:36 AM, Peter St. John <peter.st.j...@gmail.com >>> > wrote: >>> >>>> just a friendly reminder that while the probability of a particular >>>> coincidence might be very low, the probability that there will be **some** >>>> coincidence is very high. >>>> >>>> Peter (pedant) >>>> >>>> On Sat, Apr 29, 2017 at 3:00 AM, John Hanks <griz...@gmail.com> wrote: >>>> >>>>> Hi, >>>>> >>>>> I'm not getting much useful vendor information so I thought I'd ask >>>>> here in the hopes that a GPFS expert can offer some advice. We have a GPFS >>>>> system which has the following disk config: >>>>> >>>>> [root@grsnas01 ~]# mmlsdisk grsnas_data >>>>> disk driver sector failure holds holds >>>>> storage >>>>> name type size group metadata data status >>>>> availability pool >>>>> ------------ -------- ------ ----------- -------- ----- ------------- >>>>> ------------ ------------ >>>>> SAS_NSD_00 nsd 512 100 No Yes ready >>>>> up system >>>>> SAS_NSD_01 nsd 512 100 No Yes ready >>>>> up system >>>>> SAS_NSD_02 nsd 512 100 No Yes ready >>>>> up system >>>>> SAS_NSD_03 nsd 512 100 No Yes ready >>>>> up system >>>>> SAS_NSD_04 nsd 512 100 No Yes ready >>>>> up system >>>>> SAS_NSD_05 nsd 512 100 No Yes ready >>>>> up system >>>>> SAS_NSD_06 nsd 512 100 No Yes ready >>>>> up system >>>>> SAS_NSD_07 nsd 512 100 No Yes ready >>>>> up system >>>>> SAS_NSD_08 nsd 512 100 No Yes ready >>>>> up system >>>>> SAS_NSD_09 nsd 512 100 No Yes ready >>>>> up system >>>>> SAS_NSD_10 nsd 512 100 No Yes ready >>>>> up system >>>>> SAS_NSD_11 nsd 512 100 No Yes ready >>>>> up system >>>>> SAS_NSD_12 nsd 512 100 No Yes ready >>>>> up system >>>>> SAS_NSD_13 nsd 512 100 No Yes ready >>>>> up system >>>>> SAS_NSD_14 nsd 512 100 No Yes ready >>>>> up system >>>>> SAS_NSD_15 nsd 512 100 No Yes ready >>>>> up system >>>>> SAS_NSD_16 nsd 512 100 No Yes ready >>>>> up system >>>>> SAS_NSD_17 nsd 512 100 No Yes ready >>>>> up system >>>>> SAS_NSD_18 nsd 512 100 No Yes ready >>>>> up system >>>>> SAS_NSD_19 nsd 512 100 No Yes ready >>>>> up system >>>>> SAS_NSD_20 nsd 512 100 No Yes ready >>>>> up system >>>>> SAS_NSD_21 nsd 512 100 No Yes ready >>>>> up system >>>>> SSD_NSD_23 nsd 512 200 Yes No ready >>>>> up system >>>>> SSD_NSD_24 nsd 512 200 Yes No ready >>>>> up system >>>>> SSD_NSD_25 nsd 512 200 Yes No to be emptied >>>>> down system >>>>> SSD_NSD_26 nsd 512 200 Yes No ready >>>>> up system >>>>> >>>>> SSD_NSD_25 is a mirror in which both drives have failed due to a >>>>> series of unfortunate events and will not be coming back. From the GPFS >>>>> troubleshooting guide it appears that my only alternative is to run >>>>> >>>>> mmdeldisk grsnas_data SSD_NSD_25 -p >>>>> >>>>> around which the documentation also warns is irreversible, the sky is >>>>> likely to fall, dogs and cats sleeping together, etc. But at this point >>>>> I'm >>>>> already in an irreversible situation. Of course this is a scratch >>>>> filesystem, of course people were warned repeatedly about the risk of >>>>> using >>>>> a scratch filesystem that is not backed up and of course many ignored >>>>> that. >>>>> I'd like to recover as much as possible here. Can anyone confirm/reject >>>>> that deleting this disk is the best way forward or if there are other >>>>> alternatives to recovering data from GPFS in this situation? >>>>> >>>>> Any input is appreciated. Adding salt to the wound is that until a few >>>>> months ago I had a complete copy of this filesystem that I had made onto >>>>> some new storage as a burn-in test but then removed as that storage was >>>>> consumed... As they say, sometimes you eat the bear, and sometimes, well, >>>>> the bear eats you. >>>>> >>>>> Thanks, >>>>> >>>>> jbh >>>>> >>>>> (Naively calculated probability of these two disks failing close >>>>> together in this array: 0.00001758. I never get this lucky when buying >>>>> lottery tickets.) >>>>> -- >>>>> ‘[A] talent for following the ways of yesterday, is not sufficient to >>>>> improve the world of today.’ >>>>> - King Wu-Ling, ruler of the Zhao state in northern China, 307 BC >>>>> >>>>> _______________________________________________ >>>>> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin >>>>> Computing >>>>> To change your subscription (digest mode or unsubscribe) visit >>>>> http://www.beowulf.org/mailman/listinfo/beowulf >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin >>>> Computing >>>> To change your subscription (digest mode or unsubscribe) visit >>>> http://www.beowulf.org/mailman/listinfo/beowulf >>>> >>>> >>> >>> >>> -- >>> Evan Burness >>> Director, HPC Solutions >>> Cycle Computing >>> evan.burn...@cyclecomputing.com >>> (919) 724-9338 >>> >> -- >> ‘[A] talent for following the ways of yesterday, is not sufficient to >> improve the world of today.’ >> - King Wu-Ling, ruler of the Zhao state in northern China, 307 BC >> > > > > -- > Evan Burness > Director, HPC Solutions > Cycle Computing > evan.burn...@cyclecomputing.com > (919) 724-9338 > -- ‘[A] talent for following the ways of yesterday, is not sufficient to improve the world of today.’ - King Wu-Ling, ruler of the Zhao state in northern China, 307 BC
_______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf