Re: [Beowulf] Large amounts of data to store and process

Ellis H. Wilson III Mon, 04 Mar 2019 08:07:30 -0800

On 3/4/19 1:38 AM, Jonathan Aquilina wrote:

Good Morning all,
I am working on a project that I sadly cant go into much detail butthere will be quite large amounts of data that will be ingested by thissystem and would need to be efficiently returned as output to the enduser in around 10 min or so. I am in discussions with another partnerinvolved in this project about the best way forward on this.
For me given the amount of data (and it is a huge amount of data) thatan RDBMS such as postgresql would be a major bottle neck. Another thingthat was considered flat files, and I think the best for that would be aHadoop cluster with HDFS. But in the case of HPC how can such anenvironment help in terms of ingesting and analytics of large amounts ofdata? Would said flat files of data be put on a SAN/NAS or something andthrough an NFS share accessed that way for computational purposes?

There has been a lot of good discussion about various tools (databases,filesystems, processing frameworks, etc) on this thread, but I fearwe're putting the cart before the horse in many respects. A few keyquestions/concerns that need to be answered/considered before you beginthe tool selection process:

1. Is there existing storage already and if so, in what ways does itfail to meet this projects needs? This will give you key clues as towhat your new storage needs to deliver, or how you might ideally improveor expand the existing storage system to meet those needs.

2. Remember that every time you create a distinct storage pool for adistinct project you are creating a nightmare down the road, and aredis-aggregating your capacity and performance. Especially with theextremely thin pipes into hard drives today, the more you can keep themworking in concert the better. Hadoop, for all of its benefits, is atypical example of storage isolation (usually from existingmore-posix-compliant NAS storage) that can create problems when futureprojects come up and can't be ported to run atop HDFS.

3. Run the software in question against your sample dataset and collectblock traces. Do some analysis. Is it predominantly random I/O,sequential I/O, or mixed? Is it metadata heavy or data heavy? Whatdoes the file distribution look like? What kind of semantics areexpected by the application, or can these be adapted to meet what thestorage can provide? You may or may not be able to share some of thesestats depending on your NDA.

4. Talk with the application designers -- are there discrete phases totheir application(s)? This will help you intelligently block tracethose phases rather than the entire thing, which would be quite onerous.Do they expect additional phases down the line, or will the software"always" act this way roughly speaking? If you hyper-tune for a highlysequential and metadata-light workload today but future workloadsattempt to index into it, there's another difficult project and anotherdiscrete storage pool, which is unfortunate.

5. What future consumers of the data in question might there be down theline? The more you purpose-build this system for just this project, thebigger the headache you create for whatever future project wants to usethis data as well in a slightly different way. Some balance must bestruck here.

I think if you provided answers/responses to the above (and to many ofJoe's points) we could give you better advice. Trying to understandwhat a wooden jointer plane is and how to use it prior to fullyunderstanding not only the immediate task at hand but also potentialfuture ways the data might be used is a recipe for disaster.


Best,

ellis
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Large amounts of data to store and process

Reply via email to