On 3/4/19 1:38 AM, Jonathan Aquilina wrote:
Good Morning all,

I am working on a project that I sadly cant go into much detail but there will be quite large amounts of data that will be ingested by this system and would need to be efficiently returned as output to the end user in around 10 min or so. I am in discussions with another partner involved in this project about the best way forward on this.

For me given the amount of data (and it is a huge amount of data) that an RDBMS such as postgresql would be a major bottle neck. Another thing that was considered flat files, and I think the best for that would be a Hadoop cluster with HDFS. But in the case of HPC how can such an environment help in terms of ingesting and analytics of large amounts of data? Would said flat files of data be put on a SAN/NAS or something and through an NFS share accessed that way for computational purposes?

There has been a lot of good discussion about various tools (databases, filesystems, processing frameworks, etc) on this thread, but I fear we're putting the cart before the horse in many respects. A few key questions/concerns that need to be answered/considered before you begin the tool selection process:

1. Is there existing storage already and if so, in what ways does it fail to meet this projects needs? This will give you key clues as to what your new storage needs to deliver, or how you might ideally improve or expand the existing storage system to meet those needs.

2. Remember that every time you create a distinct storage pool for a distinct project you are creating a nightmare down the road, and are dis-aggregating your capacity and performance. Especially with the extremely thin pipes into hard drives today, the more you can keep them working in concert the better. Hadoop, for all of its benefits, is a typical example of storage isolation (usually from existing more-posix-compliant NAS storage) that can create problems when future projects come up and can't be ported to run atop HDFS.

3. Run the software in question against your sample dataset and collect block traces. Do some analysis. Is it predominantly random I/O, sequential I/O, or mixed? Is it metadata heavy or data heavy? What does the file distribution look like? What kind of semantics are expected by the application, or can these be adapted to meet what the storage can provide? You may or may not be able to share some of these stats depending on your NDA.

4. Talk with the application designers -- are there discrete phases to their application(s)? This will help you intelligently block trace those phases rather than the entire thing, which would be quite onerous. Do they expect additional phases down the line, or will the software "always" act this way roughly speaking? If you hyper-tune for a highly sequential and metadata-light workload today but future workloads attempt to index into it, there's another difficult project and another discrete storage pool, which is unfortunate.

5. What future consumers of the data in question might there be down the line? The more you purpose-build this system for just this project, the bigger the headache you create for whatever future project wants to use this data as well in a slightly different way. Some balance must be struck here.

I think if you provided answers/responses to the above (and to many of Joe's points) we could give you better advice. Trying to understand what a wooden jointer plane is and how to use it prior to fully understanding not only the immediate task at hand but also potential future ways the data might be used is a recipe for disaster.

Best,

ellis
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to