Question becomes then with all the File systems out there how does one choose 
the best performing as each file system has its own unique advantages etc.

Regards,
Jonathan

From: Jim Cownie <jcow...@gmail.com>
Sent: 25 November 2020 10:05
To: Jonathan Aquilina <jaquil...@eagleeyet.net>
Cc: Douglas Eadline <deadl...@eadline.org>; beowulf@beowulf.org
Subject: Re: [Beowulf] Clustering vs Hadoop/spark

From: Douglas Eadline <deadl...@eadline.org<mailto:deadl...@eadline.org>>
Sent: 24 November 2020 18:21
…
So there is a bit of decoupling of parallel compute mechanism and compute code. 
(I.e. in all these languages the user does think about cores, interconnect, 
communications, etc) A higher level abstraction.

I think you’re missing a “not”, Doug. "in all these languages the user does not 
think about cores, interconnect, communications, etc”

More generally “It's always the data movement.” We see that here, but it also 
applies to our CPUs. Look at the performance of Fugaku and the A64FX which has 
optimised for that rather than lots of (generally unusable) FLOPS, or consider 
why we use roofline models when tuning compute kernels, and,  when we do, how 
often they show that we’re bandwidth bound.

If we assume that most of the time is I/O here, then it's unsurprising that 
changing the language used for the processing makes little difference. You can 
think of it as a variant of Amdahl, where you map I/O time to Amdahl’s serial 
time (i.e. it’s invariant) and compute to parallel time (I.e. you can change 
it).
For example, if the compute is 10% of the time, the best you can hope for even 
if you could use my new “psychic” language and run the compute in zero time is 
to get a 1/0.9 = 11% speedup.
In that context it should be no surprise that the precise performance of the 
compute doesn't matter much, so you get comparable overall performance 
independent of the choice of language used for the compute operations.

-- Jim
James Cownie <jcow...@gmail.com<mailto:jcow...@gmail.com>>
Mob: +44 780 637 7146



-----Original Message-----
From: Douglas Eadline <deadl...@eadline.org<mailto:deadl...@eadline.org>>
Sent: 24 November 2020 18:21
To: Jonathan Aquilina <jaquil...@eagleeyet.net<mailto:jaquil...@eagleeyet.net>>
Cc: beowulf@beowulf.org<mailto:beowulf@beowulf.org>
Subject: RE: [Beowulf] Clustering vs Hadoop/spark


First I am not a Java expert (very far from it).

Second, Java holds up quite well against Julia as compared to Python. (so does 
Lisp!)

 https://benchmarksgame-team.pages.debian.net/benchmarksgame/fastest/julia.html

Some other tings to consider is the underlying Hadoop plumbing is written in 
Java (or Scala for Spark).
However, it is possible with Hadoop to
create mapper and reducing functions in any language (text based std in/out) 
Similar to Spark, that can use Python, R, Java, or Scala as a front-end.

So there is a bit of decoupling of parallel compute mechanism and compute code. 
(I.e. in all these languages the user does think about cores, interconnect, 
communications, etc) A higher level abstraction.

Much of the early Hadoop performance was based on running large capability jobs 
on gobs of data.
Jobs that could not be run otherwise (except for Google) So any performance was 
good. Spark come along and says, lets put it all in a redundant distributed 
memory structure.
Speed up is much faster then traditional Hadoop, so Hadoop creates Tez API that 
does the same thing.
Performance even out.

Plus analytics jobs are mostly integer. The floating point often comes into 
play when running the models, which is often not a big data problem (i.e. don't 
need a big cluster to run)

--
Doug


Hi Doug,

Appreciate the clarification where I am not clear is given Hadoop and
derivatives are java based where all of this performance all of a
sudden comes from. Is it due to where the data resides?

At one of my previous jobs I worked with Hadoop through Amazon AWS EMR
managed to churn through 5 years' worth of historical data in 1 week.
Data being calculations on vehicular tracking data.

When I learned java as part of my degree I used to see it as clunky
why go for an interpreted language such as java over something more
low level like c/c++ on a traditional cluster?

Regards,
Jonathan

-----Original Message-----
From: Douglas Eadline <deadl...@eadline.org<mailto:deadl...@eadline.org>>
Sent: 24 November 2020 17:38
To: Jonathan Aquilina <jaquil...@eagleeyet.net<mailto:jaquil...@eagleeyet.net>>
Cc: beowulf@beowulf.org<mailto:beowulf@beowulf.org>
Subject: Re: [Beowulf] Clustering vs Hadoop/spark



Hi Guys,

I am just wondering what advantages does setting up of a cluster have
in relation to big data analytics vs using something like Hadoop/spark?

Long email and the details are important.

It all comes down to filesystems and schedulers. But first remember,
most Data Analytics projects use many different tools and have various
stages that often require iteration and development (e.g. ETL->Feature
Matrix->and running models, repeat, and 80% of the work in in first
Matrix->two
steps) And, many end-users do not use Java map-reduce APIs. They use
higher level tools.

Filesystems:

1) Traditional Hadoop filesystem (HDFS) is about slicing large data
files (or large number of files) across multiple servers, then doing
the map phase on all servers at the same time (moving computation to
where the data "live", reduce phase requires a shuffle (data movement)
and final reduction of data.

2) On-prem HDFS still makes some sense (longer story) however, in the
Cloud there is move to using native cloud storage using Apache Ozone FS.
You loose the "data locality," but gain all the cloud Kubernettes stuff.

3) Both Hadoop Map-Reduce (mostly Hive RDB applications now) and Spark
do "in-memory" map-reduce for performance reasons.
In this case, data locality for processing is not as important,
However, loading and storing files on large multi-server memory
resident jobs still gains from HDFS. Very often Spark writes/reads results into 
Hive tables.

Schedulers:

1) Map Reduce scheduling is different than traditional HPC scheduling.
The primary Hadoop scheduler is called YARN (Yet Another Resource
Negotiator) It has two main features not found in most HPC schedulers,
data locality as a resource and dynamic resource allocation.

2) Data locality is about moving jobs to where the data (slice) lives
on the storage nodes (hyper-converged storage/compute nodes)

3) Dynamic resource allocation developed because most map-reduce jobs
need a lot of containers for map phase, but much-much less for reduce
phase, so Hadoop map-reduce can give back resources and ask for more
later in other stages of the DAG (multiple map reduce phases are run
as a Directed Acyclic Graph)

Thus, this model is hard to map on to a traditional HPC cluster.
There are map-reduce libraries for MPI. Another way to think about it
is Data Analytics is almost always SIMD, all tools language and
platforms are optimized to take advantage of map-reduce SIMD operations and 
data flow.


--
Doug





Regards,
Jonathan
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org<mailto:Beowulf@beowulf.org> sponsored 
by Penguin
Computing To change your subscription (digest mode or unsubscribe)
visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf


--
Doug


--
Doug

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org<mailto:Beowulf@beowulf.org> sponsored 
by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf



_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

Reply via email to