> Am 07.04.2019 um 19:15 schrieb Mahmood Naderan <mahmood...@gmail.com>:
> 
> Hi,
> A multinode MPI job terminated with the following messages in the log file
> 
> =------------------------------------------------------------------------------=
>    JOB DONE.
> =------------------------------------------------------------------------------=
> STOP 2
> STOP 2
> STOP 2
> STOP 2
> STOP 2
> STOP 2
> STOP 2
> STOP 2
> STOP 2
> STOP 2
> STOP 2
> STOP 2
> -------------------------------------------------------
> Primary job  terminated normally, but 1 process returned
> a non-zero exit code.. Per user-direction, the job has been aborted.
> -------------------------------------------------------
> STOP 2
> STOP 2
> --------------------------------------------------------------------------
> mpirun detected that one or more processes exited with non-zero status, thus 
> causing
> the job to be terminated. The first process to do so was:
> 
>   Process name: [[9801,1],8]
>   Exit code:    2
> ------------------------------------
> 
> 
> Although it said job is done, I would like to know if there is any abnormal 
> termination for that.
> Moreover, I can not figure out if there is a problem with the input files or 
> not. For example, maybe the calculations diverged. But this error can not 
> clarify that.
> Any idea?

This seems to be unrelated to SLURM.

I assume you are using Open MPI. In Open MPI *all* processes must exit with an 
exit code of zero, otherwiese an error in the application is assumed – even if  
MPI_Finalize() was called before and not MPI_ABORT(). This is of course a point 
of disussion: at least the rank zero should be able to give an exit code 
besides zero back to the calling script (IMO). I suggest to raise this question 
on the Open MPI maling list.

I don't know what the MPI standard says about it, but with Intel MPI it's 
different: an exit after MPI_Finalize() is treated as a normal program 
termination. The highest value returned by any of the processes will be 
returned to the job script and no application error is raised. Hence one can 
act on this return code in a proper way in the job script.

-- Reuti

Reply via email to