Re: [Beowulf] Large amounts of data to store and process

Jonathan Aquilina Mon, 04 Mar 2019 03:16:49 -0800

I read though that postgres can handle time shift data no problem. I am just 
concerned if the clients would want to do complex big data analytics on the 
data. At this stage we are just prototyping but things are very up in the air 
at this point I am wondering though if sticking with HDFS and Hadoop is the 
best way to go for this in terms of performance and over all analytical 
capabilities.


What I am trying to understand is how Hadoop being written in java is so 
performant.

Regards,
Jonathan

On 04/03/2019, 12:11, "Beowulf on behalf of Fred Youhanaie" 
<beowulf-boun...@beowulf.org on behalf of f...@anydata.co.uk> wrote:

    Hi Jonathan
    
    I have used PostgreSQL for collecting data, but there's nothing there that 
would be of use to you!
    
    A few years ago I set up a similar system (in a hurry) in a small company. 
The bulk data was compressed and it was made available to the applications via 
NFS (IPoIB). The applications were responsible 
    for decompressing and pre/post-processing the data. Later, one of the 
developers created a PostgreSQL based system to hold all the data, he used C++ 
for all the data handling. That system was never 
    used, even though all the historical data was loaded into the database!
    
    Your choice of components is going to depend on how your analytics software 
are going to access the data. If the data are being read and processed once, 
then loading into a database, then querying it 
    once may not pay off.
    
    Cheers,
    Fred
    
    On 04/03/2019 09:24, Jonathan Aquilina wrote:
    > Hi Fred,
    > 
    > I and my colleague had done some research and found an extension for 
postgresql called timescaleDB, but then upon further research postgres on its 
own is good for such data as well. The thing is these are not going to be given 
to use as the data is coming in but in bulk at the end from the parent company.
    > 
    > Have you used postgresql for such type's of data and how has it performed?
    > 
    > Regards,
    > Jonathan
    > 
    > On 04/03/2019, 10:19, "Beowulf on behalf of Fred Youhanaie" 
<beowulf-boun...@beowulf.org on behalf of f...@anydata.co.uk> wrote:
    > 
    >      Hi Jonathan,
    >      
    >      It seems you're collecting metrics and time series data. Perhaps a 
time series database (TSDB) is an option for you. There are a few of these out 
there, but I don't have any personal recommendation.
    >      
    >      Cheers,
    >      Fred
    >      
    >      On 04/03/2019 07:04, Jonathan Aquilina wrote:
    >      > These would be numerical data such as integers or floating point 
numbers.
    >      >
    >      > -----Original Message-----
    >      > From: Tony Brian Albers <t...@kb.dk>
    >      > Sent: 04 March 2019 08:04
    >      > To: beowulf@beowulf.org; Jonathan Aquilina 
<jaquil...@eagleeyet.net>
    >      > Subject: Re: [Beowulf] Large amounts of data to store and process
    >      >
    >      > Hi Jonathan,
    >      >
    >      >  From my limited knowledge of the technologies, I would say that 
HBase with file pointers to the files placed on HDFS would suit you well.
    >      >
    >      > But if the files are log files, consider some tools that are 
suited for analyzing those like Kibana.
    >      >
    >      > /tony
    >      >
    >      >
    >      > On Mon, 2019-03-04 at 06:55 +0000, Jonathan Aquilina wrote:
    >      >> Hi Tony,
    >      >>
    >      >> Sadly I cant go into much detail due to me being under an NDA. At 
this
    >      >> point with the prototype we have around 250gb of sample data but 
again
    >      >> this data is dependent on the type of air craft. Larger aircraft 
and
    >      >> longer flights will generate a lot more data as they have  more
    >      >> sensors and will log more data than the sample data that I have. 
The
    >      >> sample data is 250gb for 35 aircraft of the same type.
    >      >>
    >      >> Regards,
    >      >> Jonathan
    >      >>
    >      >> -----Original Message-----
    >      >> From: Tony Brian Albers <t...@kb.dk>
    >      >> Sent: 04 March 2019 07:48
    >      >> To: beowulf@beowulf.org; Jonathan Aquilina 
<jaquil...@eagleeyet.net>
    >      >> Subject: Re: [Beowulf] Large amounts of data to store and process
    >      >>
    >      >> On Mon, 2019-03-04 at 06:38 +0000, Jonathan Aquilina wrote:
    >      >>> Good Morning all,
    >      >>>
    >      >>> I am working on a project that I sadly cant go into much detail 
but
    >      >>> there will be quite large amounts of data that will be ingested 
by
    >      >>> this system and would need to be efficiently returned as output 
to
    >      >>> the end user in around 10 min or so. I am in discussions with
    >      >>> another partner involved in this project about the best way 
forward
    >      >>> on this.
    >      >>>
    >      >>> For me given the amount of data (and it is a huge amount of data)
    >      >>> that an RDBMS such as postgresql would be a major bottle neck.
    >      >>> Another thing that was considered flat files, and I think the 
best
    >      >>> for that would be a Hadoop cluster with HDFS. But in the case of 
HPC
    >      >>> how can such an environment help in terms of ingesting and 
analytics
    >      >>> of large amounts of data? Would said flat files of data be put 
on a
    >      >>> SAN/NAS or something and through an NFS share accessed that way 
for
    >      >>> computational purposes?
    >      >>>
    >      >>> Regards,
    >      >>> Jonathan
    >      >>> _______________________________________________
    >      >>> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin
    >      >>> Computing To change your subscription (digest mode or 
unsubscribe)
    >      >>> visit http:/ /www.beowulf.org/mailman/listinfo/beowulf
    >      >>
    >      >> Good morning,
    >      >>
    >      >> Around here, we're using HBase for similar purposes. We have a 
bunch
    >      >> of smaller nodes storing the data and all the management 
nodes(ambari,
    >      >> HDFS namenodes etc.) are vm's.
    >      >>
    >      >> Our nodes are configured so that we have a maximum of 2 cores per 
disk
    >      >> spindle and 4G of memory for each core. This seems to do the 
trick and
    >      >> is pretty responsive.
    >      >>
    >      >> But to be able to provide better advice, you will probably need 
to go
    >      >> into a bit more detail about what types of data you will be 
storing
    >      >> and which kind of calculations you want to perform.
    >      >>
    >      >> /tony
    >      >>
    >      >>
    >      >> --
    >      >> Tony Albers - Systems Architect - IT Development Royal Danish 
Library,
    >      >> Victor Albecks Vej 1, 8000 Aarhus C, Denmark
    >      >> Tel: +45 2566 2383 - CVR/SE: 2898 8842 - EAN: 5798000792142
    >      >
    >      > --
    >      > Tony Albers - Systems Architect - IT Development Royal Danish 
Library, Victor Albecks Vej 1, 8000 Aarhus C, Denmark
    >      > Tel: +45 2566 2383 - CVR/SE: 2898 8842 - EAN: 5798000792142
    >      > _______________________________________________
    >      > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin 
Computing
    >      > To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf
    >      >
    >      _______________________________________________
    >      Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin 
Computing
    >      To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf
    >      
    > 
    > _______________________________________________
    > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
    > To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf
    > 
    _______________________________________________
    Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
    To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf
    

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Large amounts of data to store and process

Reply via email to