On 3/4/19 1:55 AM, Jonathan Aquilina wrote:
Hi Tony,

Sadly I cant go into much detail due to me being under an NDA. At this point 
with the prototype we have around 250gb of sample data but again this data is 
dependent on the type of air craft. Larger aircraft and longer flights will 
generate a lot more data as they have  more sensors and will log more data than 
the sample data that I have. The sample data is 250gb for 35 aircraft of the 
same type.


You need to return your answers in ~10m or 600s, with an assumed data set size of 250GB or more (assuming you meant GB and not Gb).  Depending upon the nature of the calculation, whether or not you can perform the calculations on subsets, or if it requires multiple passes through the data in order to calculate.

I've noticed some recommendations popping up ahead of understanding what the rate limiting factors for returning the results from calculations based upon this data set.  I'd suggest focusing on the analysis needs to start, as this will provide some level of guidance on the system(s) design required to meet your objectives.

First off, do you know whether your code will meet this 600s response time with this 250GB data set?  I am assuming this is unknown at this moment, but if you have response time data for smaller data sets, you could construct a rough scaling study and build a simple predictive model.

Second, do you need the entire bolus of data, all 250GB, in order to generate a response to within required accuracy?  If not, great, and what size do you need?

Third, will this data set grow over time (looking at your writeup, it looks like this is a definite "yes")?

Fourth, does the code require physical access to all of the data bolus (what is needed for the calculation) locally in order to correctly operate?

Fifth, will the data access patterns for the code be streaming, searching, or random?  In only one of these cases would a database (SQL or noSQL) be a viable option.

Sixth, is your working data set size comparable to the bolus size (e.g. 250GB)?

Seventh, can your code work correctly with sharded data (variation on second point)?


Now some brief "data physics".

a) (data on durable storage) 250GB @ 1GB/s -> 250s to read, once, assuming large block sequential read.  For a 600s response time, that leaves you with 350s to calculate.  Is this enough time?  Is a single pass (streaming) workable?

b) (data in ram) 250GB/s @ 100GB/s -> 2.5s to walk through once in parallel amongst multiple cores.  If multiple/many passes through data are required, this strongly suggests a large memory machine (512GB or larger).

c) if your data is shardable, and you can distribute it amongst N machines, the above analyses still hold, replacing the 250GB with the size of the shards.  If you can do this, how much information does your code need to share amongst the worker nodes in order to effect the calculation?  This will provide guidance on interconnect choices.


Basically, I am advocating focusing on the analysis needs, how the scale/grow, and your near/medium/long term goals with this, before you commit to a specific design/implementation.  Avoid the "if all you have is a hammer, every problem looks like a nail" view as much as possible.


--
Joe Landman
e: joe.land...@gmail.com
t: @hpcjoe
w: https://scalability.org
g: https://github.com/joelandman
l: https://www.linkedin.com/in/joelandman

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to