Hi Matteo, On Fri, Jun 29, 2018 at 10:13:33AM +0000, Matteo Guglielmi wrote:
> Dear comunity, > > I have a user who usually submits 36 (identical) jobs at a time using a > simple for loop, > thus jobs are sbatched all the same time. > > Each job requests a single core and all jobs are independent from one another > (read > different input files and write to different output files). > > Jobs are then usually started during the next couple of hours, somewhat at > random > times. > > What happens then is that after a certain amount of time (maybe from 2 to 12 > hours) > ALL jobs belonging to this particular user are killed by slurm on all nodes > at exactly the > same time. > > One example: > > ### master: /var/log/slurmctld.log ### > > [2018-06-28T18:43:06.871] _slurm_rpc_submit_batch_job: JobId=718560 > InitPrio=4294185624 usec=255 > ... > [2018-06-28T19:29:04.671] backfill: Started JobID=718560 in partition on > node38 > ... > [2018-06-28T23:37:53.471] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 718560 > uid 1007 > [2018-06-28T23:37:53.472] _job_signal: 9 of running JobID=718560 State=0x8004 > NodeCnt=1 successful 0x8004 That line looks like the user (presuming that uid 1007 is them; otherwise it's an operator who can kill jobs) killed their job. Have a look in the slurmctld.log for more lines with 'REQUEST_KILL_JOB'; if they all appear at basically the same time, then it looks like uid 1007 did something like 'scancel -u theusername'. That might not be it, but that would be my first guess. Paddy -- Paddy Doyle Trinity Centre for High Performance Computing, Lloyd Building, Trinity College Dublin, Dublin 2, Ireland. Phone: +353-1-896-3725 http://www.tchpc.tcd.ie/