Hi Jonathan, You need a "cluster" if your task is too large to run on a single system, or maybe if it takes too long to run on a single system.
So the primary motivation is practical, not theoretical. If you can run your task on just one computer, you should always do that rather than having to build a cluster of some kind and all the associated headaches. One aspect of your question seems to be about performance, which is ultimately limited by the hardware resources. E.g. a "hadoop cluster" might be 10 servers each with lots of CPUs and RAM and disks, etc. For "hadoop" workloads, typically the bottleneck is I/O so the primary parameter is the number of disks. So the programming language is not really an issue if you're spending all your time waiting on disk I/O. Regards, Alex On Tue, Nov 24, 2020 at 10:22 AM Jonathan Aquilina via Beowulf < beowulf@beowulf.org> wrote: > Hi Doug, > > So what is the advantage then of a cluster? > > Regards, > Jonathan > > -----Original Message----- > From: Douglas Eadline <deadl...@eadline.org> > Sent: 24 November 2020 18:21 > To: Jonathan Aquilina <jaquil...@eagleeyet.net> > Cc: beowulf@beowulf.org > Subject: RE: [Beowulf] Clustering vs Hadoop/spark > > > First I am not a Java expert (very far from it). > > Second, Java holds up quite well against Julia as compared to Python. (so > does Lisp!) > > > https://benchmarksgame-team.pages.debian.net/benchmarksgame/fastest/julia.html > > Some other tings to consider is the underlying Hadoop plumbing is written > in Java (or Scala for Spark). > However, it is possible with Hadoop to > create mapper and reducing functions in any language (text based std > in/out) Similar to Spark, that can use Python, R, Java, or Scala as a > front-end. > > So there is a bit of decoupling of parallel compute mechanism and compute > code. (I.e. in all these languages the user does think about cores, > interconnect, communications, etc) A higher level abstraction. > > Much of the early Hadoop performance was based on running large capability > jobs on gobs of data. > Jobs that could not be run otherwise (except for Google) So any > performance was good. Spark come along and says, lets put it all in a > redundant distributed memory structure. > Speed up is much faster then traditional Hadoop, so Hadoop creates Tez API > that does the same thing. > Performance even out. > > Plus analytics jobs are mostly integer. The floating point often comes > into play when running the models, which is often not a big data problem > (i.e. don't need a big cluster to run) > > -- > Doug > > > Hi Doug, > > > > Appreciate the clarification where I am not clear is given Hadoop and > > derivatives are java based where all of this performance all of a > > sudden comes from. Is it due to where the data resides? > > > > At one of my previous jobs I worked with Hadoop through Amazon AWS EMR > > managed to churn through 5 years' worth of historical data in 1 week. > > Data being calculations on vehicular tracking data. > > > > When I learned java as part of my degree I used to see it as clunky > > why go for an interpreted language such as java over something more > > low level like c/c++ on a traditional cluster? > > > > Regards, > > Jonathan > > > > -----Original Message----- > > From: Douglas Eadline <deadl...@eadline.org> > > Sent: 24 November 2020 17:38 > > To: Jonathan Aquilina <jaquil...@eagleeyet.net> > > Cc: beowulf@beowulf.org > > Subject: Re: [Beowulf] Clustering vs Hadoop/spark > > > > > >> Hi Guys, > >> > >> I am just wondering what advantages does setting up of a cluster have > >> in relation to big data analytics vs using something like Hadoop/spark? > >> > > > > Long email and the details are important. > > > > It all comes down to filesystems and schedulers. But first remember, > > most Data Analytics projects use many different tools and have various > > stages that often require iteration and development (e.g. ETL->Feature > > Matrix->and running models, repeat, and 80% of the work in in first > > Matrix->two > > steps) And, many end-users do not use Java map-reduce APIs. They use > > higher level tools. > > > > Filesystems: > > > > 1) Traditional Hadoop filesystem (HDFS) is about slicing large data > > files (or large number of files) across multiple servers, then doing > > the map phase on all servers at the same time (moving computation to > > where the data "live", reduce phase requires a shuffle (data movement) > > and final reduction of data. > > > > 2) On-prem HDFS still makes some sense (longer story) however, in the > > Cloud there is move to using native cloud storage using Apache Ozone FS. > > You loose the "data locality," but gain all the cloud Kubernettes stuff. > > > > 3) Both Hadoop Map-Reduce (mostly Hive RDB applications now) and Spark > > do "in-memory" map-reduce for performance reasons. > > In this case, data locality for processing is not as important, > > However, loading and storing files on large multi-server memory > > resident jobs still gains from HDFS. Very often Spark writes/reads > results into Hive tables. > > > > Schedulers: > > > > 1) Map Reduce scheduling is different than traditional HPC scheduling. > > The primary Hadoop scheduler is called YARN (Yet Another Resource > > Negotiator) It has two main features not found in most HPC schedulers, > > data locality as a resource and dynamic resource allocation. > > > > 2) Data locality is about moving jobs to where the data (slice) lives > > on the storage nodes (hyper-converged storage/compute nodes) > > > > 3) Dynamic resource allocation developed because most map-reduce jobs > > need a lot of containers for map phase, but much-much less for reduce > > phase, so Hadoop map-reduce can give back resources and ask for more > > later in other stages of the DAG (multiple map reduce phases are run > > as a Directed Acyclic Graph) > > > > Thus, this model is hard to map on to a traditional HPC cluster. > > There are map-reduce libraries for MPI. Another way to think about it > > is Data Analytics is almost always SIMD, all tools language and > > platforms are optimized to take advantage of map-reduce SIMD operations > and data flow. > > > > > > -- > > Doug > > > > > > > >> > >> Regards, > >> Jonathan > >> _______________________________________________ > >> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin > >> Computing To change your subscription (digest mode or unsubscribe) > >> visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf > >> > > > > > > -- > > Doug > > > > > -- > Doug > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf >
_______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf