Re: [Beowulf] Large amounts of data to store and process

Jonathan Aquilina Mon, 04 Mar 2019 05:20:14 -0800

Hi Michael,

As previously mentioned we don’t really need to have anything indexed so I am 
thinking flat files are the way to go my only concern is the performance of 
large flat files. Isnt that what HDFS is for to deal with large flat files.


On 04/03/2019, 14:13, "Beowulf on behalf of Michael Di Domenico" 
<beowulf-boun...@beowulf.org on behalf of mdidomeni...@gmail.com> wrote:

    even though you've alluded to this being time series data.  is there a
    requirement that you have to index into the data or is just read the
    data end-to-end and do some calculations.
    
    i routinely face these kind of issues, but we're not indexing into the
    data, so having things in hdfs or rdbms doesn't give us any benefit.
    we pull all the data into organized flat files and blow through them
    with HTCondor.  if the researcher wants to tweak the code they do and
    then just rerun the whole simulation.
    
    sometimes that's minutes sometimes days.  but in either case the time
    to develop code is always much shorter because the data is in flat
    files and easier for my "non-programmer" programmers.  no need to
    learn hdfs/hadoop or sql
    
    if you need to index the data and jump around, hdfs is probably still
    not the best solution unless you want index the files and 250gb isn't
    really big enough to warrant an hdfs cluster.  i've generally found
    unless you're dealing with multi-TB+ datasets you can't scale the
    hardware out enough to get the speed up.  (yes, i know there are
    tweaks to change this, but I've found its just simpler to buy a bigger
    lustre system)
    
    
    
    On Mon, Mar 4, 2019 at 1:39 AM Jonathan Aquilina
    <jaquil...@eagleeyet.net> wrote:
    >
    > Good Morning all,
    >
    >
    >
    > I am working on a project that I sadly cant go into much detail but there 
will be quite large amounts of data that will be ingested by this system and 
would need to be efficiently returned as output to the end user in around 10 
min or so. I am in discussions with another partner involved in this project 
about the best way forward on this.
    >
    >
    >
    > For me given the amount of data (and it is a huge amount of data) that an 
RDBMS such as postgresql would be a major bottle neck. Another thing that was 
considered flat files, and I think the best for that would be a Hadoop cluster 
with HDFS. But in the case of HPC how can such an environment help in terms of 
ingesting and analytics of large amounts of data? Would said flat files of data 
be put on a SAN/NAS or something and through an NFS share accessed that way for 
computational purposes?
    >
    >
    >
    > Regards,
    >
    > Jonathan
    >
    > _______________________________________________
    > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
    > To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf
    _______________________________________________
    Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
    To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf
    

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Large amounts of data to store and process

Reply via email to