Re: [Beowulf] GPFS and failed metadata NSD

Ryan Novosielski Thu, 25 May 2017 13:27:22 -0700

> On May 25, 2017, at 4:10 PM, Kilian Cavalotti 
> <kilian.cavalotti.w...@gmail.com> wrote:
> 
> On Thu, May 25, 2017 at 8:58 AM, Ryan Novosielski <novos...@rutgers.edu> 
> wrote:
>> I’d be interested to hear what people are doing, generally, about backing up 
>> very large volumes of data (that probably seem smaller to more established 
>> centers), like 500TB to 1PB. It sounds to me like a combination of 
>> replication and filesystem snapshots (those replicated or not) do protect 
>> against hardware failure and user failure, depending on the frequency and 
>> whether or not you have any other hidden weaknesses.
> 
> At Stanford, we (Research Computing) have developed a PoC using Lustre
> HSM and a Google Drive backend to back our /scratch filesystem up,
> mostly because Google Drive is free and unlimited for .edu accounts
> (^_^). We didn't announce anything to our users, so they don't start
> relying on it, and use this more as an insurance against user
> "creativity" than a real disaster-recovery mechanism.
> 
> We found out that this was working quite well for backing up large
> files, but not so well for smaller ones because Google enforce secret
> file operation rate limits (I say secret because they're not the ones
> that are documented, and support doesn't want to talk about them),
> which I guess is fair for a free and unlimited service. But that means
> that for a filesystem with hundreds of millions of files, this is not
> really appropriate.
> 
> We did some tests for restoring data from the Google Drive backend,
> and another limitation with the current Lustre HSM implementation is
> that the HSM coordinator doesn't prioritize restore operations.
> Meaning that if you have thousands of "archive" operations in queue,
> the coordinator will need to go through all of them before processing
> your "restore" ops. Which again, in real life, might be a deal-breaker
> for disaster recovery.
> 
> Anyway, we had quite some fun doing it, including some nice chats with
> the Networking people on campus (which actually lead to a new 100G
> data link being deployed). We've released the open source Lustre HSM
> to Google Drive copytool that we developed on GitHub
> (https://github.com/stanford-rc/ct_gdrive). And we're now the proud
> users of about about 3.3 PB on Google Drive (screenshot attached,
> because it happened).


Boy, that’s great, Kilian, thanks! I’m already glad I asked.

--
____
|| \\UTGERS,     |---------------------------*O*---------------------------
||_// the State  |         Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\    of NJ  | Office of Advanced Research Computing - MSB C630, Newark
     `'

signature.asc
Description: Message signed with OpenPGP using GPGMail

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] GPFS and failed metadata NSD

Reply via email to