> Hi Doug, > > How have they managed to squeeze so much performance out of java for such > big data sets?
Nothing to do with Java, originally had to do with "moving computation to data" (Hadoop YARN can provide data locality for Map Reduce, i.e. large files are sliced on HDFS data nodes, the Map process can operate in parallel on these slices, so you run the Map in parallel on the node that has the data slice (a bit more complicated but that is the general idea) Now the trick is to keep intermediate results in memory because most of the high level analytics jobs involve multiple Map Reduce steps. This is why Spark called as "faster than Hadoop" because every thing was done across a distributed memory object (across nodes). Hadoop Map Reduce now does this with the "Tez" acceleration component. Here is another important point, Data Engineering (data cleaning, verification, building feature matrix) is where scale comes into play. Running models (unless you are training an ML) usually does not require a huge amount of computing power. -- Doug > > Regards, > Jonathan > > -----Original Message----- > From: Beowulf <beowulf-boun...@beowulf.org> On Behalf Of Douglas Eadline > Sent: 13 October 2020 15:55 > To: Oddo Da <oddodao...@gmail.com> > Cc: beowulf@beowulf.org > Subject: [Beowulf] ***UNCHECKED*** Re: Spark, Julia, OpenMPI etc. - all in > one place > > > I have noticed a lot of Hadoop/Spark references in the replies. > The word "Hadoop" is probably the most misunderstood word in computing > today and may people have a somewhat vague idea what it actually is. > > Hadoop V1 was a monolithic Map Reduce framework written in Java. (BTW Map > Reduce is a SIMD algorithm) > > Hadoop V2 the Map Reduce component was separated from the scheduler (YARN) > and the underlying distributed file systems (HDFS) It is best thought of > as a "platform" for developing big data systems. The most popular Map > Reduce application is Hive. > Developed by Facebook, it allow relational databases to be run at scale. > > Hadoop V3 and beyond is moving more toward a true cloud based environment > with a new file systems called Ozone. Note, the need for HDFS made cloud > migration difficult > > Spark is a completely separate code base that has its own Map Reduce > engine. It can work stand-alone, with the YARN scheduler, or with other > schedulers. It can also take advantage of HDFS. > > Spark is language, Hadoop is platform. Map Reduce is SIMD algorithm that > works well with large amounts of read-only data. > > There is more to it, but that is the gist of it. > > -- > Doug > >> Hello, >> >> I used to be in HPC back when we built beowulf clusters by hand ;) and >> wrote code in C/pthreads, PVM and MPI and back when anyone could walk >> into fields like bioinformatics, all that was needed was a pulse, some >> C and Perl and a desire to do ;-). Then I left for the private sector >> and stumbled into "big data" some years later - I wrote a lot of code >> in Spark and Scala, worked in infrastructure to support it etc. >> >> Then I went back (in 2017) to HPC. I was surprised to find that not >> much has changed - researchers and grad students still write code in >> MPI and C/C++ and maybe some Python or R for visualization or >> localized data analytics. I also noticed that it was not easy to >> "marry" things like big data with HPC clusters - tools like >> Spark/Hadoop do not really have the same underlying infrastructure >> assumptions as do things like MPI/supercomputers. However, I find it >> wasteful for a university to run separate clusters to support a data >> science/big data load vs traditional HPC. >> >> I then stumbled upon languages like Julia - I like its approach, code >> is data, visualization is easy, decent ML/DS tooling. >> >> How does it fare on a traditional HCP cluster? Are people using it to >> substitute their MPI loads? On the opposite side, has it caught up to >> Spark in terms of DS/ML quality of offering? In other words, can it be >> used as a one fell swoop unifying substitute for both opposing >> approaches? >> >> I realize that many people have already committed to certain >> tech/paradigms but this is mostly educational debt (if MPI or Spark on >> the other side is working for me, why go to something different?) - >> but is there anything substantial stopping new people with no debt >> starting out in a different approach (offerings like Julia)? >> >> I do not have too much experience with Julia (and hence may be barking >> at the wrong tree) - in that case I am wondering what people are doing >> to "marry" the loads of traditional HPC with "big data" as practiced >> by the commercial/industry entities on a single underlying hardware >> offering. I know there are things like Twister2 but it is unclear to >> me (from cursory >> examination) what it actually offers in the context of my questions >> above. >> >> Any input, corrections, schooling me etc. are appreciated. >> >> Thank you! >> _______________________________________________ >> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin >> Computing To change your subscription (digest mode or unsubscribe) >> visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf >> > > > -- > Doug > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf > -- Doug _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf