OK so OpenMPI works fine. That means SLURM, OFED and hardware are fine.

Which mvapich2 package are you using, a home built one or one provided by 
Bright ?


Regards,

--

Jan-Albert


Jan-Albert van Ree | Linux System Administrator | Digital Services
MARIN | T +31 317 49 35 48 | j.a.v....@marin.nl<mailto:j.a.v....@marin.nl> | 
www.marin.nl<http://www.marin.nl>

[LinkedIn]<https://www.linkedin.com/company/marin> [YouTube] 
<http://www.youtube.com/marinmultimedia>  [Twitter] 
<https://twitter.com/MARIN_nieuws>  [Facebook] 
<https://www.facebook.com/marin.wageningen>
MARIN news: FLARE holds first General Assembly Meeting in Bremen, 
Germany<https://www.marin.nl/flare-holds-first-general-assembly-meeting-in-bremen-germany>

________________________________
From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Chris 
Woelkers - NOAA Federal <chris.woelk...@noaa.gov>
Sent: Wednesday, December 11, 2019 01:11
To: Slurm User Community List
Subject: Re: [slurm-users] Multi-node job failure

Thanks for the reply and the things to try. Here are the answers to your 
questions/tests in order:

- I tried mpiexec and the same issue occurred.
- While the job is listed as running I checked all the nodes. None of them have 
processes spawned. I have no idea on the hydra process.
- I have version 4.7 of the OFED stack installed on all nodes.
- Using openmpi with the hello world example you listed to gives output that 
seems to match what should normally be given. I upped the number of threads to 
16, because 4 doesn't help much, and ran it again with four nodes of 4 threads 
each, and got the following which looks like good output.
Hello world from processor bearnode14, rank 4 out of 16 processors
Hello world from processor bearnode14, rank 5 out of 16 processors
Hello world from processor bearnode14, rank 6 out of 16 processors
Hello world from processor bearnode15, rank 10 out of 16 processors
Hello world from processor bearnode15, rank 8 out of 16 processors
Hello world from processor bearnode16, rank 13 out of 16 processors
Hello world from processor bearnode15, rank 11 out of 16 processors
Hello world from processor bearnode13, rank 3 out of 16 processors
Hello world from processor bearnode14, rank 7 out of 16 processors
Hello world from processor bearnode15, rank 9 out of 16 processors
Hello world from processor bearnode16, rank 12 out of 16 processors
Hello world from processor bearnode16, rank 14 out of 16 processors
Hello world from processor bearnode16, rank 15 out of 16 processors
Hello world from processor bearnode13, rank 1 out of 16 processors
Hello world from processor bearnode13, rank 0 out of 16 processors
Hello world from processor bearnode13, rank 2 out of 16 processors
- I have not tested our test model with openmpi as it was compiled with Intel 
compilers and expects Intel MPI. It might work but for now I will hold that for 
later. I did test the hello world again using the Intel modules instead of the 
openmpi modules and it still worked.

Thanks,

Chris Woelkers


Reply via email to