;-) On Sat, Apr 29, 2017 at 1:12 PM, John Hanks <griz...@gmail.com> wrote:
> Thanks for the suggestions, but when this Phoenix rises from the ashes it > will be running BeeGFS over ZFS. The more I learn about GPFS the more I am > reminded of quote seen recently on twitter: > > "People bred, selected, and compensated to find complicated solutions do > not have an incentive to implement simplified ones." -- @nntaleb > <https://twitter.com/nntaleb> > > You can only read "you should contact support" so many times in > documentation and forum posts before you remember "oh yeah, IBM is a > _services_ company." > > jbh > > > On Sat, Apr 29, 2017 at 8:58 PM Evan Burness <evan.burness@cyclecomputing. > com> wrote: > >> Hi John, >> >> Yeah, I think the best word here is "ouch" unfortunately. I asked a few >> of my GPFS-savvy colleagues and they all agreed there aren't many good >> options here. >> >> The one "suggestion" (I promise, no Monday morning quarterbacking) I and >> my storage admins friends can offer, if you have the ability to do so (both >> from a technical but also from a procurement/policy change standpoint) is >> to swap out spinning drives for NVMe ones for your metadata servers. Yes, >> you'll still take the write performance hit from replication relative to a >> non-replicated state, but modern NAND and NVMe drives are so fast and low >> latency that it will still be as fast or faster than the replicated, >> spinning disk approach it sounds like (please forgive me if I'm >> misunderstanding this piece). >> >> We took this very approach on a 10+ petabyte DDN SFA14k running GPFS >> 4.2.1 that was designed to house research and clinical data for a large US >> hospital. They had 600+ million files b/t 0-10 MB, so we had high-end >> requirements for both metadata performance AND reliability. Like you, we >> tagged 4 GPFS NSD's with metadata duty and gave each a 1.6 TB Intel P3608 >> NVMe disk, and the performance was still exceptionally good even with >> replication because these modern drives are such fire-breathing IOPS >> monsters. If you don't have as much data as this scenario, you could >> definitely get away with 400 or 800 GB versions and save yourself a fair >> amount of $$. >> >> Also, if you're looking to experiment with whether a replicated approach >> can meet your needs, I suggest you check out AWS' I3 instances for >> short-term testing. They have up to 8 * 1.9 TB NVMe drives. At Cycle >> Computing we've helped a number of .com's and .edu's address high-end IO >> needs using these or similar instances. If you have a decent background >> with filesystems these cloud instances can be excellent performers, either >> for test/lab scenarios like this or production environments. >> >> Hope this helps! >> >> >> Best, >> >> Evan Burness >> >> ------------------------- >> Evan Burness >> Director, HPC >> Cycle Computing >> evan.burn...@cyclecomputing.com >> (919) 724-9338 >> >> >> >> >> >> On Sat, Apr 29, 2017 at 11:13 AM, John Hanks <griz...@gmail.com> wrote: >> >>> There are no dumb questions in this snafu, I have already covered the >>> dumb aspects adequately :) >>> >>> Replication was not enabled, this was scratch space set up to be as >>> large and fast as possible. The fact that I can say "it was scratch" >>> doesn't make it sting less, thus the grasping at straws. >>> jbh >>> >>> On Sat, Apr 29, 2017, 7:05 PM Evan Burness <evan.burness@cyclecomputing. >>> com> wrote: >>> >>>> Hi John, >>>> >>>> I'm not a GPFS expert, but I did manage some staff that ran GPFS >>>> filesystems while I was at NCSA. Those folks reeeaaalllly knew what they >>>> were doing. >>>> >>>> Perhaps a dumb question, but should we infer from your note that >>>> metadata replication is not enabled across those 4 NSDs handling it? >>>> >>>> >>>> Best, >>>> >>>> Evan >>>> >>>> >>>> ------------------------- >>>> Evan Burness >>>> Director, HPC >>>> Cycle Computing >>>> evan.burn...@cyclecomputing.com >>>> (919) 724-9338 >>>> >>>> On Sat, Apr 29, 2017 at 9:36 AM, Peter St. John < >>>> peter.st.j...@gmail.com> wrote: >>>> >>>>> just a friendly reminder that while the probability of a particular >>>>> coincidence might be very low, the probability that there will be **some** >>>>> coincidence is very high. >>>>> >>>>> Peter (pedant) >>>>> >>>>> On Sat, Apr 29, 2017 at 3:00 AM, John Hanks <griz...@gmail.com> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I'm not getting much useful vendor information so I thought I'd ask >>>>>> here in the hopes that a GPFS expert can offer some advice. We have a >>>>>> GPFS >>>>>> system which has the following disk config: >>>>>> >>>>>> [root@grsnas01 ~]# mmlsdisk grsnas_data >>>>>> disk driver sector failure holds holds >>>>>> storage >>>>>> name type size group metadata data status >>>>>> availability pool >>>>>> ------------ -------- ------ ----------- -------- ----- ------------- >>>>>> ------------ ------------ >>>>>> SAS_NSD_00 nsd 512 100 No Yes ready >>>>>> up system >>>>>> SAS_NSD_01 nsd 512 100 No Yes ready >>>>>> up system >>>>>> SAS_NSD_02 nsd 512 100 No Yes ready >>>>>> up system >>>>>> SAS_NSD_03 nsd 512 100 No Yes ready >>>>>> up system >>>>>> SAS_NSD_04 nsd 512 100 No Yes ready >>>>>> up system >>>>>> SAS_NSD_05 nsd 512 100 No Yes ready >>>>>> up system >>>>>> SAS_NSD_06 nsd 512 100 No Yes ready >>>>>> up system >>>>>> SAS_NSD_07 nsd 512 100 No Yes ready >>>>>> up system >>>>>> SAS_NSD_08 nsd 512 100 No Yes ready >>>>>> up system >>>>>> SAS_NSD_09 nsd 512 100 No Yes ready >>>>>> up system >>>>>> SAS_NSD_10 nsd 512 100 No Yes ready >>>>>> up system >>>>>> SAS_NSD_11 nsd 512 100 No Yes ready >>>>>> up system >>>>>> SAS_NSD_12 nsd 512 100 No Yes ready >>>>>> up system >>>>>> SAS_NSD_13 nsd 512 100 No Yes ready >>>>>> up system >>>>>> SAS_NSD_14 nsd 512 100 No Yes ready >>>>>> up system >>>>>> SAS_NSD_15 nsd 512 100 No Yes ready >>>>>> up system >>>>>> SAS_NSD_16 nsd 512 100 No Yes ready >>>>>> up system >>>>>> SAS_NSD_17 nsd 512 100 No Yes ready >>>>>> up system >>>>>> SAS_NSD_18 nsd 512 100 No Yes ready >>>>>> up system >>>>>> SAS_NSD_19 nsd 512 100 No Yes ready >>>>>> up system >>>>>> SAS_NSD_20 nsd 512 100 No Yes ready >>>>>> up system >>>>>> SAS_NSD_21 nsd 512 100 No Yes ready >>>>>> up system >>>>>> SSD_NSD_23 nsd 512 200 Yes No ready >>>>>> up system >>>>>> SSD_NSD_24 nsd 512 200 Yes No ready >>>>>> up system >>>>>> SSD_NSD_25 nsd 512 200 Yes No to be emptied >>>>>> down system >>>>>> SSD_NSD_26 nsd 512 200 Yes No ready >>>>>> up system >>>>>> >>>>>> SSD_NSD_25 is a mirror in which both drives have failed due to a >>>>>> series of unfortunate events and will not be coming back. From the GPFS >>>>>> troubleshooting guide it appears that my only alternative is to run >>>>>> >>>>>> mmdeldisk grsnas_data SSD_NSD_25 -p >>>>>> >>>>>> around which the documentation also warns is irreversible, the sky is >>>>>> likely to fall, dogs and cats sleeping together, etc. But at this point >>>>>> I'm >>>>>> already in an irreversible situation. Of course this is a scratch >>>>>> filesystem, of course people were warned repeatedly about the risk of >>>>>> using >>>>>> a scratch filesystem that is not backed up and of course many ignored >>>>>> that. >>>>>> I'd like to recover as much as possible here. Can anyone confirm/reject >>>>>> that deleting this disk is the best way forward or if there are other >>>>>> alternatives to recovering data from GPFS in this situation? >>>>>> >>>>>> Any input is appreciated. Adding salt to the wound is that until a >>>>>> few months ago I had a complete copy of this filesystem that I had made >>>>>> onto some new storage as a burn-in test but then removed as that storage >>>>>> was consumed... As they say, sometimes you eat the bear, and sometimes, >>>>>> well, the bear eats you. >>>>>> >>>>>> Thanks, >>>>>> >>>>>> jbh >>>>>> >>>>>> (Naively calculated probability of these two disks failing close >>>>>> together in this array: 0.00001758. I never get this lucky when buying >>>>>> lottery tickets.) >>>>>> -- >>>>>> ‘[A] talent for following the ways of yesterday, is not sufficient to >>>>>> improve the world of today.’ >>>>>> - King Wu-Ling, ruler of the Zhao state in northern China, 307 BC >>>>>> >>>>>> _______________________________________________ >>>>>> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin >>>>>> Computing >>>>>> To change your subscription (digest mode or unsubscribe) visit >>>>>> http://www.beowulf.org/mailman/listinfo/beowulf >>>>>> >>>>>> >>>>> >>>>> _______________________________________________ >>>>> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin >>>>> Computing >>>>> To change your subscription (digest mode or unsubscribe) visit >>>>> http://www.beowulf.org/mailman/listinfo/beowulf >>>>> >>>>> >>>> >>>> >>>> -- >>>> Evan Burness >>>> Director, HPC Solutions >>>> Cycle Computing >>>> evan.burn...@cyclecomputing.com >>>> (919) 724-9338 >>>> >>> -- >>> ‘[A] talent for following the ways of yesterday, is not sufficient to >>> improve the world of today.’ >>> - King Wu-Ling, ruler of the Zhao state in northern China, 307 BC >>> >> >> >> >> -- >> Evan Burness >> Director, HPC Solutions >> Cycle Computing >> evan.burn...@cyclecomputing.com >> (919) 724-9338 >> > -- > ‘[A] talent for following the ways of yesterday, is not sufficient to > improve the world of today.’ > - King Wu-Ling, ruler of the Zhao state in northern China, 307 BC > -- Evan Burness Director, HPC Solutions Cycle Computing evan.burn...@cyclecomputing.com (919) 724-9338
_______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf