On 2017-02-13 20:45, Ellis H. Wilson III wrote: > On 02/13/17 14:00, Greg Lindahl wrote: >> On Mon, Feb 13, 2017 at 07:55:43AM +0000, Tony Brian Albers wrote: >>> Hi guys, >>> >>> So, we're running a small(as in a small number of nodes(10), not >>> storage(170TB)) hadoop cluster here. Right now we're on IBM Spectrum >>> Scale(GPFS) which works fine and has POSIX support. On top of GPFS we >>> have a GPFS transparency connector so that HDFS uses GPFS. >> >> I don't understand the question. Hadoop comes with HDFS, and HDFS runs >> happily on top of shared-nothing, direct-attach storage. Is there >> something about your hardware or usage that makes this a non-starter? >> If so, that might help folks make better suggestions. > > I'm guessing the "POSIX support" is the piece that's missing with a > native HDFS installation. You can kinda-sorta get a form of it with > plug-ins, but it's not a first-class citizen like in most DFS and when I > used it last it was not performant. Native HDFS makes large datasets > expensive to work with in anything but Hadoop-ready (largely MR) > applications. If there is a mixed workload, having a filesystem that > can support both POSIX access and HDFS /without/ copies is invaluable. > With extremely large datasets (170TB is not that huge anymore), copies > may be a non-starter. With dated codebases or applications that don't > fit the MR model, complete movement to HDFS may also be a non-starter. > > The questions I feel need to be answered here to get good answers rather > than a shotgun full of random DFS's are: > > 1. How much time and effort are you willing to commit to setup and > administration of the DFS? For many completely open source solutions > (Lustre and HDFS come to mind) setup and more critically maintenance can > become quite heavyweight, and performance tuning can grow to > summer-grad-student-internship level. > > 2. Are you looking to replace the hardware, or just the DFS? These > days, 170 TB is at the fringes (IMHO) of what can fit reasonably into a > single (albeit rather large) box. It wouldn't be completely unthinkable > to run all of your storage with ZFS/BTRFS, a very beefy server, > redundant 10, 25 or 40GE NICs, some SSD acceleration, a UPS, and > plain-jane NFS (or your protocol of choice out of most Linux distros). > You could even host the HDFS daemons on that node, pointing at POSIX > paths rather than devices. But this falls into the category of "host it > yourself," so that might be too much work. > > 3. How committed to HDFS are you (i.e., what features of it do your > applications actually leverage)? Many map reduce applications actually > have zero attachment to HDFS whatsoever. You can reasonably re-point > them at posix-complaint NAS and they'll "just work." Plus you get > cross-protocol access to the files without any wizardry, copying, etc. > HBase is a notable example of where they've built dependence on HDFS > into the code, but that's more the exception than the norm. > > Best, > > ellis > > Disclaimer: I work for Panasas, a storage appliance vendor. I don't > think I'm shamelessly plugging anywhere above as I love when people host > themselves, but it's not for everybody. > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf
1) Pretty much whatever it takes. We have the mentioned cluster, a second one running only hbase(for now) and the third is a storage cluster for our DSpace installation which will probably grow to tens of petabytes within a couple of years. To be able to use the same FS on all would be nice. (yes I know, there's probably not a swiss-knife -but we are willing to compromise) 2) Just the DFS (having issues with IBM support(not on the DFS alone)). 3) HBase. Doesn't work without HDFS AFAIK. /tony -- Best regards, Tony Albers Systems administrator, IT-development Royal Danish Library, Victor Albecks Vej 1, 8000 Aarhus C, Denmark. Tel: +45 2566 2383 / +45 8946 2316 _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf