What does your overall design look like? On Mon, Mar 4, 2019, 5:19 AM Jonathan Aquilina <jaquil...@eagleeyet.net> wrote:
> Hi Michael, > > As previously mentioned we don’t really need to have anything indexed so I > am thinking flat files are the way to go my only concern is the performance > of large flat files. Isnt that what HDFS is for to deal with large flat > files. > > On 04/03/2019, 14:13, "Beowulf on behalf of Michael Di Domenico" < > beowulf-boun...@beowulf.org on behalf of mdidomeni...@gmail.com> wrote: > > even though you've alluded to this being time series data. is there a > requirement that you have to index into the data or is just read the > data end-to-end and do some calculations. > > i routinely face these kind of issues, but we're not indexing into the > data, so having things in hdfs or rdbms doesn't give us any benefit. > we pull all the data into organized flat files and blow through them > with HTCondor. if the researcher wants to tweak the code they do and > then just rerun the whole simulation. > > sometimes that's minutes sometimes days. but in either case the time > to develop code is always much shorter because the data is in flat > files and easier for my "non-programmer" programmers. no need to > learn hdfs/hadoop or sql > > if you need to index the data and jump around, hdfs is probably still > not the best solution unless you want index the files and 250gb isn't > really big enough to warrant an hdfs cluster. i've generally found > unless you're dealing with multi-TB+ datasets you can't scale the > hardware out enough to get the speed up. (yes, i know there are > tweaks to change this, but I've found its just simpler to buy a bigger > lustre system) > > > > On Mon, Mar 4, 2019 at 1:39 AM Jonathan Aquilina > <jaquil...@eagleeyet.net> wrote: > > > > Good Morning all, > > > > > > > > I am working on a project that I sadly cant go into much detail but > there will be quite large amounts of data that will be ingested by this > system and would need to be efficiently returned as output to the end user > in around 10 min or so. I am in discussions with another partner involved > in this project about the best way forward on this. > > > > > > > > For me given the amount of data (and it is a huge amount of data) > that an RDBMS such as postgresql would be a major bottle neck. Another > thing that was considered flat files, and I think the best for that would > be a Hadoop cluster with HDFS. But in the case of HPC how can such an > environment help in terms of ingesting and analytics of large amounts of > data? Would said flat files of data be put on a SAN/NAS or something and > through an NFS share accessed that way for computational purposes? > > > > > > > > Regards, > > > > Jonathan > > > > _______________________________________________ > > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin > Computing > > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin > Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf >
_______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf