I am trouble using or running a python-mpi program involving more than one
node. The pythom-mpi program is very simple,
do you think there's something unique about the python program?
(also, you mean mpi4py, right?)
Since authentication with Slurm is used via munge, do I need a passwordless
SSH communication between the slurmctl and the nodes? (I found a
no, you don't need it. the combination of slurmd (the actual spawning)
and munge (for credentials/authentication) is how slurmctld starts jobs.
guide,probably outdated stating that passwordless SSH communication is a
neccessity for slurm,
HTTP://admin-magazine.com/HPC/Articles/Resource-Management-with-Slurm).
I suspect that's an editing escape: you do usually want mutual access among
user-accessible nodes (login, compute, but not usually admin things like
slurmctld or slurmdb nodes).
?srun -N2 -n8 python3 python-mpi.py? ,
using srun does not depend on ssh. if you use mpirun/mpiexec, it *might*
depend on ssh (but only among the compute nodes).
It works fine running on a single node(with ?-N1? instead of ?-N2?), but it is
aborted or stopped when running on two nodes.
I would guess you need to look at slurmd logs on the nodes.
regards, mark hahn.