On Tue, Oct 13, 2020 at 8:52 AM Guy Coates <guy.coa...@gmail.com> wrote: > > Having just spent some time looking at parallelising some ML/AI workloads, it > was enlightening to see that as you scratch beneath the various frameworks > like pytorch or horovod, you find...MPI. And RDMA. And workloads that can > quickly become IO bound. Plus ca change...
yup, we're working on that too. we've been watching these for a while now as the frameworks mature. We have a lot of models that want to spread across multiple gpus across multiple hosts. some of the frameworks have been less than pleasant in accomplishing this task, but the ones that took the mpi/gpudirect route seem to be making the most ground _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf