Re: [slurm-users] All user's jobs killed at the same time on all nodes

2018-06-29 Thread Thomas M. Payerle
A couple comments/possible suggestions. First, it looks to me that all the jobs are run from the same directory with same input/output files. Or am I missing something? Also, what MPI library is being used? I would suggest verifying if any of the jobs in question are terminating normally. I.e.

Re: [slurm-users] All user's jobs killed at the same time on all nodes

2018-06-29 Thread John Hearns
I have got this all wrong. Paddy Doyle has got it right. However are you SURE than mpirun is not creating tasks on the other machines? I would look at the compute nodes while the job is running and do ps -eaf --forest Also using mpirun to run a single core gives me the heebie-jeebies... https://

Re: [slurm-users] All user's jobs killed at the same time on all nodes

2018-06-29 Thread Matteo Guglielmi
You are right but I'm actually supporting the system administrator of that cluster, I'll mention this to him. Beside that, the user runs this for loop to submit the jobs: # submit.sh # typeset -i i=1 typeset -i j=12500 #number of frames goes to each core = number of frames (100)/40 (cor

Re: [slurm-users] All user's jobs killed at the same time on all nodes

2018-06-29 Thread Paddy Doyle
Hi Matteo, On Fri, Jun 29, 2018 at 10:13:33AM +, Matteo Guglielmi wrote: > Dear comunity, > > I have a user who usually submits 36 (identical) jobs at a time using a > simple for loop, > thus jobs are sbatched all the same time. > > Each job requests a single core and all jobs are independ

Re: [slurm-users] All user's jobs killed at the same time on all nodes

2018-06-29 Thread John Hearns
Matteo, a stupid question but if these are single CPU jobs why is mpirun being used? Is your user using these 36 jobs to construct a parallel job to run charmm? If the mpirun is killed, yes all the other processes which are started by it on the other compute nodes will be killed. I suspect your u

[slurm-users] All user's jobs killed at the same time on all nodes

2018-06-29 Thread Matteo Guglielmi
Dear comunity, I have a user who usually submits 36 (identical) jobs at a time using a simple for loop, thus jobs are sbatched all the same time. Each job requests a single core and all jobs are independent from one another (read different input files and write to different output files). Jobs