Hi, I am not sure if this related to GPUs. I rather think the issue has to do with how your OpenMPI has been built.
What does ompi_info command show? Look for "Configure command line" in the output. Does this include '--with-slurm' and '--with-pmi' flags? To my very best knowledge, both flags need to be set for OpenMPI to work with srun. Also see: https://www.open-mpi.org/faq/?category=slurm#slurm-direct-srun-mpi-apps https://slurm.schedmd.com/mpi_guide.html#open_mpi Best regards Jürgen * Saksham Pande 5-Year IDD Physics <saksham.pande.ph...@itbhu.ac.in> [230519 07:42]: > Hi everyone, > I am trying to run a simulation software on slurm using openmpi-4.1.1 and > cuda/11.1. > On executing, I get the following error: > > srun --mpi=pmi2 --nodes=1 --ntasks-per-node=5 --partition=gpu --gres=gpu:1 > --time=02:00:00 --pty bash -i > ./<execultable> > > > ```._____________________________________________________________________________________ > | > | Initial checks... > | All good. > |_____________________________________________________________________________________ > [gpu008:162305] OPAL ERROR: Not initialized in file pmix3x_client.c at line > 112 > -------------------------------------------------------------------------- > The application appears to have been direct launched using "srun", > but OMPI was not built with SLURM's PMI support and therefore cannot > execute. There are several options for building PMI support under > SLURM, depending upon the SLURM version you are using: > > version 16.05 or later: you can use SLURM's PMIx support. This > requires that you configure and build SLURM --with-pmix. > > Versions earlier than 16.05: you must use either SLURM's PMI-1 or > PMI-2 support. SLURM builds PMI-1 by default, or you can manually > install PMI-2. You must then build Open MPI using --with-pmi pointing > to the SLURM PMI library location. > > Please configure as appropriate and try again. > -------------------------------------------------------------------------- > *** An error occurred in MPI_Init > *** on a NULL communicator > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, > *** and potentially your MPI job) > [gpu008:162305] Local abort before MPI_INIT completed completed > successfully, but am not able to aggregate error messages, and not able to > guarantee that all other processes were killed!``` > > > using the following modules: gcc/10.2 openmpi/4.1.1 cuda/11.1 > on using which mpic++ or mpirun or nvcc, I get the module paths only, which > looks correct. > I also changed the $PATH and $LD_LIBRARY_PATH based on ldd <executable>, > but still the same error. > > [sakshamp.phy20.itbhu@login2 menura]$ srun --mpi=list > srun: MPI types are... > srun: cray_shasta > srun: none > srun: pmi2 > > What should I do from here, been stuck on this error for 6 days now? If > there is any build difference, I will have to tell the sysadmin. > Since there is an openmpi pairing error with slurm, are there other error I > could expect between cuda and openmpi? > > Thanks