Hello Angelines, Do you know how the Open MPI 4.0.3 package was configured and built? That information would be useful to help diagnose the problem.
Thanks, Howard From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of "Alberto Morillas, Angelines" <angelines.albe...@ciemat.es> Reply-To: Slurm User Community List <slurm-users@lists.schedmd.com> Date: Friday, May 29, 2020 at 4:25 AM To: "slurm-users@lists.schedmd.com" <slurm-users@lists.schedmd.com> Subject: [EXTERNAL] [slurm-users] problems with OpenMPI 4.0.3 Good morning, We have a cluster with two kind of infiniband cards, one connectx-4 and the other connectx-6. Openmpi-3.1.3 works fine, but when we start with connectx-6 we started to use openmpi-4.0.3 (that support connectx-6) and the programs that have several parts, first a call to a secuencial program and inside it a call to a parallel program, … (in our case the program is WRF, but we have others like this with the same problem), this kind of programs suddenly stop, ….. 0 S 4556 87383 87361 0 80 0 - 126676 hrtime ? 00:05:25 real.exe 0 S 4556 87384 87361 0 80 0 - 126677 hrtime ? 00:05:33 real.exe 0 S 4556 87385 87361 0 80 0 - 126675 hrtime ? 00:05:28 real.exe …… The WCHAN=hrtime, and it looks that it is running, but really it doesn´t work We don´t know if it could be problem with slurm and this version of openmpi… Any idea? ________________________________________________ Angelines Alberto Morillas Unidad de Arquitectura Informática Despacho: 22.1.32 Telf.: +34 91 346 6119 Fax: +34 91 346 6537 skype: angelines.alberto CIEMAT Avenida Complutense, 40 28040 MADRID ________________________________________________