On Tue, Mar 31, 2009 at 6:43 PM, Don Holmgren <djh...@fnal.gov> wrote: > > How are your individual MPI processes crashing when run under Torque? Are > there any error messages?
Thanks Don! There aren't any useful error messages. My job hierarchy is actually like so: {shell_script sumitted to Torque} --> calls Python--> Loop until convergence {Calls a fortran executable} The fortran executable is the one that has the mpi calls to parrellize over processors. The crash is *not* so bad that torque kills the job. What happens is that the fortran exec crashes and python continues to loop it over and over again. The crash is only whenever I submit via torque. If I do this instead mpirun fron node --> shell wrapper--> calls Python--> Loop until convergence {Calls a fortran executable} Then everything works fine. Note that the Python and shell are not truely parallelized. The fortran is the only place where actual parallelization happens. > The environment for a Torque job on a worker node under openMPI is inherited > from the pbs_mom process. Sometimes differences between this environment > and > the standard login environment can cause troubles. Exactly. Can I somehow obtain a dump of this environment to compare the direct mprun vs the torque run? What would be the best way? Just a dump from set? Any crucial variables to look for? Maybe a ulimit? > > Instead of logging into the node directly, you might want to try an > interactive > job (use "qsub -I") and then try your mpirun. I'm trying that now. -- Rahul _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf