On 2/10/23 11:06 am, Analabha Roy wrote:

I'm having some complex issues coordinating OpenMPI, SLURM, and DMTCP in my cluster.

If you're looking to try checkpointing MPI applications you may want to experiment with the MANA ("MPI-Agnostic, Network-Agnostic MPI") plugin for DMTCP here: https://github.com/mpickpt/mana

We (NERSC) are collaborating with the developers and it is installed on Cori (our older Cray system) for people to experiment with. The documentation for it may be useful to others who'd like to try it out - it's got a nice description of how it works too which even I as a non-programmer can understand. https://docs.nersc.gov/development/checkpoint-restart/mana/

Pay special attention to the caveats in our docs though!

I've not used it myself, though I'm peripherally involved to give advice on system related issues.

All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA


Reply via email to