Dear slurm user community, I have a slurm cluster on centos7 installed through yum, I also have mpich installed.
I can ssh into on of the nodes and run an mpi job: # /usr/lib64/mpich/bin/mpirun --hosts nid001001-bae562bc0bd98e50ad5c03200efaf799d6e82469,nid001002-bae562bc0bd98e50ad5c03200efaf799d6e82469,nid001003-bae562bc0bd98e50ad5c03200efaf799d6e82469 /scratch/mpi-helloworld Warning: Permanently added 'nid001003-bae562bc0bd98e50ad5c03200efaf799d6e82469,10.233.88.25' (ECDSA) to the list of known hosts. Hello world from processor nid001003-bae562bc0bd98e50ad5c03200efaf799d6e82469, rank 2 out of 3 processors Hello world from processor nid001001-bae562bc0bd98e50ad5c03200efaf799d6e82469, rank 0 out of 3 processors Hello world from processor nid001002-bae562bc0bd98e50ad5c03200efaf799d6e82469, rank 1 out of 3 processors However I can't make it work through slurm, these are the logs form running the job: # srun --mpi=pmi2 -N3 -vvv /usr/lib64/mpich/bin/mpirun /scratch/mpi-helloworld srun: defined options srun: -------------------- -------------------- srun: mpi : pmi2 srun: nodes : 3 srun: verbose : 3 srun: -------------------- -------------------- srun: end of defined options srun: debug: propagating RLIMIT_CPU=18446744073709551615 srun: debug: propagating RLIMIT_FSIZE=18446744073709551615 srun: debug: propagating RLIMIT_DATA=18446744073709551615 srun: debug: propagating RLIMIT_STACK=8388608 srun: debug: propagating RLIMIT_CORE=18446744073709551615 srun: debug: propagating RLIMIT_RSS=18446744073709551615 srun: debug: propagating RLIMIT_NPROC=18446744073709551615 srun: debug: propagating RLIMIT_NOFILE=1048576 srun: debug: propagating RLIMIT_AS=18446744073709551615 srun: debug: propagating SLURM_PRIO_PROCESS=0 srun: debug: propagating UMASK=0022 srun: debug2: srun PMI messages to port=33065 srun: debug: Entering slurm_allocation_msg_thr_create() srun: debug: port from net_stream_listen is 44387 srun: debug: Entering _msg_thr_internal srun: debug: auth/munge: init: Munge authentication plugin loaded srun: jobid 8: nodes(3):`nid001001-bae562bc0bd98e50ad5c03200efaf799d6e82469,nid001002-bae562bc0bd98e50ad5c03200efaf799d6e82469,nid001003-bae562bc0bd98e50ad5c03200efaf799d6e82469', cpu counts: 1(x3) srun: debug2: creating job with 3 tasks srun: debug: requesting job 8, user 0, nodes 3 including ((null)) srun: debug: cpus 3, tasks 3, name mpirun, relative 65534 srun: launch/slurm: launch_p_step_launch: CpuBindType=(null type) srun: debug: Entering slurm_step_launch srun: debug: mpi type = (null) srun: debug: mpi/pmi2: p_mpi_hook_client_prelaunch: mpi/pmi2: client_prelaunch srun: debug: mpi/pmi2: _get_proc_mapping: mpi/pmi2: processor mapping: (vector,(0,3,1)) srun: debug: mpi/pmi2: _setup_srun_socket: mpi/pmi2: srun pmi port: 37029 srun: debug2: mpi/pmi2: _tree_listen_readable: mpi/pmi2: _tree_listen_readable srun: debug: mpi/pmi2: pmi2_start_agent: mpi/pmi2: started agent thread srun: debug: Entering _msg_thr_create() srun: debug: initialized stdio listening socket, port 41275 srun: debug: Started IO server thread (140538792195840) srun: debug: Entering _launch_tasks srun: launching StepId=8.0 on host nid001001-bae562bc0bd98e50ad5c03200efaf799d6e82469, 1 tasks: 0 srun: debug2: Called _file_readable srun: debug2: Called _file_writable srun: debug2: Called _file_writable srun: launching StepId=8.0 on host nid001002-bae562bc0bd98e50ad5c03200efaf799d6e82469, 1 tasks: 1 srun: launching StepId=8.0 on host nid001003-bae562bc0bd98e50ad5c03200efaf799d6e82469, 1 tasks: 2 srun: route/default: init: route default plugin loaded srun: debug2: Tree head got back 0 looking for 3 srun: debug2: Tree head got back 1 srun: debug2: Tree head got back 2 srun: debug2: Tree head got back 3 srun: debug: launch returned msg_rc=0 err=0 type=8001 srun: debug: launch returned msg_rc=0 err=0 type=8001 srun: debug: launch returned msg_rc=0 err=0 type=8001 srun: debug2: Activity on IO listening socket 17 srun: debug2: Entering io_init_msg_read_from_fd srun: debug2: Leaving io_init_msg_read_from_fd srun: debug2: Entering io_init_msg_validate srun: debug2: Leaving io_init_msg_validate srun: debug2: Validated IO connection from 10.233.88.26:33470, node rank 0, sd=18 srun: debug2: eio_message_socket_accept: got message connection from 10.233.88.26:53410 19 srun: debug2: received task launch srun: launch/slurm: _task_start: Node nid001001-bae562bc0bd98e50ad5c03200efaf799d6e82469, 1 tasks started srun: debug2: Called _file_readable srun: debug2: Called _file_writable srun: debug2: Called _file_writable srun: debug2: Activity on IO listening socket 17 srun: debug2: Entering io_init_msg_read_from_fd srun: debug2: Leaving io_init_msg_read_from_fd srun: debug2: Entering io_init_msg_validate srun: debug2: Leaving io_init_msg_validate srun: debug2: Validated IO connection from 10.233.88.25:52764, node rank 2, sd=19 srun: debug2: Entering io_init_msg_read_from_fd srun: debug2: Leaving io_init_msg_read_from_fd srun: debug2: Entering io_init_msg_validate srun: debug2: Leaving io_init_msg_validate srun: debug2: Validated IO connection from 10.233.88.27:52768, node rank 1, sd=20 srun: debug2: eio_message_socket_accept: got message connection from 10.233.88.25:47948 21 srun: debug2: received task launch srun: launch/slurm: _task_start: Node nid001003-bae562bc0bd98e50ad5c03200efaf799d6e82469, 1 tasks started srun: debug2: eio_message_socket_accept: got message connection from 10.233.88.27:41996 21 srun: debug2: received task launch srun: launch/slurm: _task_start: Node nid001002-bae562bc0bd98e50ad5c03200efaf799d6e82469, 1 tasks started srun: debug2: Called _file_readable srun: debug2: Called _file_writable srun: debug2: Called _file_writable srun: debug2: Called _file_readable srun: debug2: Called _file_writable srun: debug2: Called _file_writable srun: debug2: Entering _file_write srun: Job 8 step creation temporarily disabled, retrying (Requested nodes are busy) The output clearly says the nodes are busy but they are not, actually I can run other jobs: # squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) # sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST workq* up infinite 3 idle nid001001-bae562bc0bd98e50ad5c03200efaf799d6e82469,nid001002-bae562bc0bd98e50ad5c03200efaf799d6e82469,nid001003-bae562bc0bd98e50ad5c03200efaf799d6e82469 [root@nid001001-bae562bc0bd98e50ad5c03200efaf799d6e82469 /]# srun -N3 hostname nid001003-bae562bc0bd98e50ad5c03200efaf799d6e82469 nid001001-bae562bc0bd98e50ad5c03200efaf799d6e82469 nid001002-bae562bc0bd98e50ad5c03200efaf799d6e82469 Any idea what is stopping the mpi job from starting? thank you very much