How are your individual MPI processes crashing when run under Torque? Are
there any error messages?
The environment for a Torque job on a worker node under openMPI is inherited
from the pbs_mom process. Sometimes differences between this environment and
the standard login environment can cause troubles. For example, on Infiniband
clusters the "maximum locked memory" ulimit may need to be adjusted by editing
the script used to launch pbs_mom (usually the pbs-client init.d script). I've
also seen stack size problems in some user binaries that require such a ulimit
adjustment to mimic what they may have in their .bash_profile.
Instead of logging into the node directly, you might want to try an interactive
job (use "qsub -I") and then try your mpirun. This may give you messages that
for some reason aren't getting back to you in your job's .o or .e files.
Don Holmgren
Fermilab
On Tue, 31 Mar 2009, Rahul Nabar wrote:
I've a strange OpenMPI/Torque problem while trying to run a job on our
Opteron-SC-1435 based cluster:
Each node has 8 cpus.
If I got to a node and run like so then the job works:
mpirun -np 6 ${EXE_PATH}/${DACAPOEXE_PAR} ${ARGS}
Same job if I submit through PBS/Torque then it starts running but the
individual processes keep crashing:
mpirun -np 6 ${EXE_PATH}/${DACAPOEXE_PAR} ${ARGS}
I know that the --hostfile directive is not needed in the latest
torque-OpenMPI jobs.
I also tried including:
mpirun -np 6 --hosts node17,node17,node17,node17,node17,node17
${EXE_PATH}/${DACAPOEXE_PAR} ${ARGS}
Still does not work.
What could be going wrong? Are there other things I need to worry
about when PBS steps in? Any tips?
The ${DACAPOEXE_PAR} refers to a fortran executable for the
computational chemistry code DACAPO.
What;s the differences between submitting a job on a node via mpirun
directly vs via Torque. Shouldn't these both be transparent to the
fortran calls. I am assuming don't have to dig into the fortran code.
Any debug tips?
Thanks!
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf