I'm not an expert on Big Data at all, but I hear the phrase "Hadoop" less and less these days. Where I work, most data analysts are using R, Python, or Spark in the form of PySpark. For machine learning, most of the researchers I support are using Python tools like TensorFlow or PyTorch.

I don't know much about Julia replacing MPI, etc., but I wish I did. I would like to know more about Julia.

Prentice

On 10/12/20 12:14 PM, Oddo Da wrote:
Hello,

I used to be in HPC back when we built beowulf clusters by hand ;) and wrote code in C/pthreads, PVM and MPI and back when anyone could walk into fields like bioinformatics, all that was needed was a pulse, some C and Perl and a desire to do ;-). Then I left for the private sector and stumbled into "big data" some years later - I wrote a lot of code in Spark and Scala, worked in infrastructure to support it etc.

Then I went back (in 2017) to HPC. I was surprised to find that not much has changed - researchers and grad students still write code in MPI and C/C++ and maybe some Python or R for visualization or localized data analytics. I also noticed that it was not easy to "marry" things like big data with HPC clusters - tools like Spark/Hadoop do not really have the same underlying infrastructure assumptions as do things like MPI/supercomputers. However, I find it wasteful for a university to run separate clusters to support a data science/big data load vs traditional HPC.

I then stumbled upon languages like Julia - I like its approach, code is data, visualization is easy, decent ML/DS tooling.

How does it fare on a traditional HCP cluster? Are people using it to substitute their MPI loads? On the opposite side, has it caught up to Spark in terms of DS/ML quality of offering? In other words, can it be used as a one fell swoop unifying substitute for both opposing approaches?

I realize that many people have already committed to certain tech/paradigms but this is mostly educational debt (if MPI or Spark on the other side is working for me, why go to something different?) - but is there anything substantial stopping new people with no debt starting out in a different approach (offerings like Julia)?

I do not have too much experience with Julia (and hence may be barking at the wrong tree) - in that case I am wondering what people are doing to "marry" the loads of traditional HPC with "big data" as practiced by the commercial/industry entities on a single underlying hardware offering. I know there are things like Twister2 but it is unclear to me (from cursory examination) what it actually offers in the context of my questions above.

Any input, corrections, schooling me etc. are appreciated.

Thank you!

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

--
Prentice Bisbal
Lead Software Engineer
Research Computing
Princeton Plasma Physics Laboratory
http://www.pppl.gov

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

Reply via email to