Re: [Beowulf] Large amounts of data to store and process

Joe Landman Mon, 04 Mar 2019 06:29:12 -0800


On 3/4/19 1:55 AM, Jonathan Aquilina wrote:

Hi Tony,


Sadly I cant go into much detail due to me being under an NDA. At this point 
with the prototype we have around 250gb of sample data but again this data is 
dependent on the type of air craft. Larger aircraft and longer flights will 
generate a lot more data as they have  more sensors and will log more data than 
the sample data that I have. The sample data is 250gb for 35 aircraft of the 
same type.

You need to return your answers in ~10m or 600s, with an assumed dataset size of 250GB or more (assuming you meant GB and not Gb). Dependingupon the nature of the calculation, whether or not you can perform thecalculations on subsets, or if it requires multiple passes through thedata in order to calculate.

I've noticed some recommendations popping up ahead of understanding whatthe rate limiting factors for returning the results from calculationsbased upon this data set. I'd suggest focusing on the analysis needs tostart, as this will provide some level of guidance on the system(s)design required to meet your objectives.

First off, do you know whether your code will meet this 600s responsetime with this 250GB data set? I am assuming this is unknown at thismoment, but if you have response time data for smaller data sets, youcould construct a rough scaling study and build a simple predictive model.

Second, do you need the entire bolus of data, all 250GB, in order togenerate a response to within required accuracy? If not, great, andwhat size do you need?

Third, will this data set grow over time (looking at your writeup, itlooks like this is a definite "yes")?

Fourth, does the code require physical access to all of the data bolus(what is needed for the calculation) locally in order to correctly operate?

Fifth, will the data access patterns for the code be streaming,searching, or random? In only one of these cases would a database (SQLor noSQL) be a viable option.

Sixth, is your working data set size comparable to the bolus size (e.g.250GB)?

Seventh, can your code work correctly with sharded data (variation onsecond point)?



Now some brief "data physics".

a) (data on durable storage) 250GB @ 1GB/s -> 250s to read, once,assuming large block sequential read. For a 600s response time, thatleaves you with 350s to calculate. Is this enough time? Is a singlepass (streaming) workable?

b) (data in ram) 250GB/s @ 100GB/s -> 2.5s to walk through once inparallel amongst multiple cores. If multiple/many passes through dataare required, this strongly suggests a large memory machine (512GB orlarger).

c) if your data is shardable, and you can distribute it amongst Nmachines, the above analyses still hold, replacing the 250GB with thesize of the shards. If you can do this, how much information does yourcode need to share amongst the worker nodes in order to effect thecalculation? This will provide guidance on interconnect choices.

Basically, I am advocating focusing on the analysis needs, how thescale/grow, and your near/medium/long term goals with this, before youcommit to a specific design/implementation. Avoid the "if all you haveis a hammer, every problem looks like a nail" view as much as possible.



--
Joe Landman
e: joe.land...@gmail.com
t: @hpcjoe
w: https://scalability.org
g: https://github.com/joelandman
l: https://www.linkedin.com/in/joelandman

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Large amounts of data to store and process

Reply via email to