I've a strange OpenMPI/Torque problem while trying to run a job on our Opteron-SC-1435 based cluster:
Each node has 8 cpus. If I got to a node and run like so then the job works: mpirun -np 6 ${EXE_PATH}/${DACAPOEXE_PAR} ${ARGS} Same job if I submit through PBS/Torque then it starts running but the individual processes keep crashing: mpirun -np 6 ${EXE_PATH}/${DACAPOEXE_PAR} ${ARGS} I know that the --hostfile directive is not needed in the latest torque-OpenMPI jobs. I also tried including: mpirun -np 6 --hosts node17,node17,node17,node17,node17,node17 ${EXE_PATH}/${DACAPOEXE_PAR} ${ARGS} Still does not work. What could be going wrong? Are there other things I need to worry about when PBS steps in? Any tips? The ${DACAPOEXE_PAR} refers to a fortran executable for the computational chemistry code DACAPO. What;s the differences between submitting a job on a node via mpirun directly vs via Torque. Shouldn't these both be transparent to the fortran calls. I am assuming don't have to dig into the fortran code. Any debug tips? Thanks! -- Rahul _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf