I'm not an expert on Big Data at all, but I hear the phrase "Hadoop"
less and less these days. Where I work, most data analysts are using R,
Python, or Spark in the form of PySpark. For machine learning, most of
the researchers I support are using Python tools like TensorFlow or
PyTorch.
I don't know much about Julia replacing MPI, etc., but I wish I did. I
would like to know more about Julia.
Prentice
On 10/12/20 12:14 PM, Oddo Da wrote:
Hello,
I used to be in HPC back when we built beowulf clusters by hand ;) and
wrote code in C/pthreads, PVM and MPI and back when anyone could walk
into fields like bioinformatics, all that was needed was a pulse, some
C and Perl and a desire to do ;-). Then I left for the private sector
and stumbled into "big data" some years later - I wrote a lot of code
in Spark and Scala, worked in infrastructure to support it etc.
Then I went back (in 2017) to HPC. I was surprised to find that not
much has changed - researchers and grad students still write code in
MPI and C/C++ and maybe some Python or R for visualization or
localized data analytics. I also noticed that it was not easy to
"marry" things like big data with HPC clusters - tools like
Spark/Hadoop do not really have the same underlying infrastructure
assumptions as do things like MPI/supercomputers. However, I find it
wasteful for a university to run separate clusters to support a data
science/big data load vs traditional HPC.
I then stumbled upon languages like Julia - I like its approach, code
is data, visualization is easy, decent ML/DS tooling.
How does it fare on a traditional HCP cluster? Are people using it to
substitute their MPI loads? On the opposite side, has it caught up to
Spark in terms of DS/ML quality of offering? In other words, can it be
used as a one fell swoop unifying substitute for both opposing
approaches?
I realize that many people have already committed to certain
tech/paradigms but this is mostly educational debt (if MPI or Spark on
the other side is working for me, why go to something different?) -
but is there anything substantial stopping new people with no debt
starting out in a different approach (offerings like Julia)?
I do not have too much experience with Julia (and hence may be barking
at the wrong tree) - in that case I am wondering what people are doing
to "marry" the loads of traditional HPC with "big data" as practiced
by the commercial/industry entities on a single underlying hardware
offering. I know there are things like Twister2 but it is unclear to
me (from cursory examination) what it actually offers in the context
of my questions above.
Any input, corrections, schooling me etc. are appreciated.
Thank you!
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
--
Prentice Bisbal
Lead Software Engineer
Research Computing
Princeton Plasma Physics Laboratory
http://www.pppl.gov
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf