> cost was one factor that accelerated spark/hadoop, it's not the only > or even the biggest factor. the ML folks didn't start with MPi > because the AI frameworks were bred on workstations and then ported to > non HPC hardware (aka cloud platforms) where MPI isn't the dominant > paradigm. now that ML/AI is taking hold in the HPC community for > different aspects and the models are starting to expand beyond the 4-8 > gpus you can stick in a single box they are adding MPI underneath > (look at horovad) to spread the models over multiple machines (scale > out vs scale up). >
A few things. Hadoop is platform and framework for running Map Reduce (MR) at scale. Parallel MR is a highly defined algorithm with highly defined data flows. Data are sliced across server storage so the map stage can be run in parallel. Spark works the same way only by default data are sliced across memory on servers. Hadoop now does this as well. The Map stage can be anything you want it to be. MR is a SIMD algorithm, so if your application fits the algorithm, MR *might* be useful. Spark does have machine learning (Spark MLib) And, not all ML is GPUs and neural nets, there are many statistics based libraries for ML, Scikit Learn for instance. The basic Spark MLib libraries are scalable, however, and use statistical methods. There are now libraries that use neural nets as well and Spark V3 support a tight integration with things like Tensor Flow etc. IMO, both Hadoop and Spark did not use MPI because they had a highly defined algorithm with specific performance goals. Many MR jobs, like those with Hadoop are dynamic, requiring a varied resource load over the course of their lifetime. (Mapping uses a lot of resources, Reducing usually uses much less) Thus, the Hadoop scheduler, YARN, can dynamically reduce or increase the resources assigned to a running job. MPI does not provide such a dynamic resource allocation. Basically, MPI did not address their project goals. The authors were certainly aware of MPI (I worked with some of them on a book about YARN) -- Doug -- Doug _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf