[slurm-users] Jobs stop after 1:05:11 with segmentation faul.

Zacarias Benta Thu, 12 Sep 2019 03:53:12 -0700

Greetings everyone,

I have an issue with jobs I'm submiting, I have no idea how to solve
it.


I submit the following script using sbatch:

#!/bin/bash
#SBATCH -t 1-00:00:00
#SBATCH --mem 8000 
#SBATCH -n 512
#SBATCH -p all # partition
#SBATCH -J 1day8GB # job name


ulimit -s unlimited

module purge
module load gcc63/netcdf/4.6.1
module load gcc63/netcdf-fortran/4.4.4

{ sleep 30 && while killall -q -0 pschism_LNEC_WS_GNU_VL_HA; do ps fux
>> psFux.out; sleep 10; done; } &

mpirun pschism_LNEC_WS_GNU_VL_HA


The job runs for 1 hour 5 minutes and 11 seconds sharp, every time I
submit it and then the slurm+++.out gives me the following output:

---------------------------------------------------------------------
-----
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
---------------------------------------------------------------------
-----
---------------------------------------------------------------------
-----
mpirun noticed that process rank 421 with PID 74958 on node wn058
exited on signal 11 (Segmentation fault).
---------------------------------------------------------------------
-----

When I view to the slurmd.log on the node that is specified in the
error message i see the following:

[2019-09-12T10:19:32.693] launch task 3684.0 request from UID:4000005
GID:4000001 HOST:x.x.x.22 PORT:44211
[2019-09-12T10:19:32.695] _run_prolog: run job script took usec=15
[2019-09-12T10:19:32.695] _run_prolog: prolog with lock for job 3684
ran for 0 seconds
[2019-09-12T11:24:37.306] [3684.0] done with job

This issue is happening to a lot of users and I have no idea where to
look for anymore.
I tried submiting soverall jobs using openmpi to see if that was the
issue and they all finish ok.
Does anyone have any idea how to debug this issue?



Cumprimentos / Best Regards,
Zacarias Benta
INCD @ LIP - UMinho

[slurm-users] Jobs stop after 1:05:11 with segmentation faul.

Reply via email to