Greetings everyone, I have an issue with jobs I'm submiting, I have no idea how to solve it.
I submit the following script using sbatch: #!/bin/bash #SBATCH -t 1-00:00:00 #SBATCH --mem 8000 #SBATCH -n 512 #SBATCH -p all # partition #SBATCH -J 1day8GB # job name ulimit -s unlimited module purge module load gcc63/netcdf/4.6.1 module load gcc63/netcdf-fortran/4.4.4 { sleep 30 && while killall -q -0 pschism_LNEC_WS_GNU_VL_HA; do ps fux >> psFux.out; sleep 10; done; } & mpirun pschism_LNEC_WS_GNU_VL_HA The job runs for 1 hour 5 minutes and 11 seconds sharp, every time I submit it and then the slurm+++.out gives me the following output: --------------------------------------------------------------------- ----- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. --------------------------------------------------------------------- ----- --------------------------------------------------------------------- ----- mpirun noticed that process rank 421 with PID 74958 on node wn058 exited on signal 11 (Segmentation fault). --------------------------------------------------------------------- ----- When I view to the slurmd.log on the node that is specified in the error message i see the following: [2019-09-12T10:19:32.693] launch task 3684.0 request from UID:4000005 GID:4000001 HOST:x.x.x.22 PORT:44211 [2019-09-12T10:19:32.695] _run_prolog: run job script took usec=15 [2019-09-12T10:19:32.695] _run_prolog: prolog with lock for job 3684 ran for 0 seconds [2019-09-12T11:24:37.306] [3684.0] done with job This issue is happening to a lot of users and I have no idea where to look for anymore. I tried submiting soverall jobs using openmpi to see if that was the issue and they all finish ok. Does anyone have any idea how to debug this issue? Cumprimentos / Best Regards, Zacarias Benta INCD @ LIP - UMinho