Hi all,

Sorry for the noise, this was down to a problem with our configless setup.

Really must start running slurmd everywhere and get rid of the compute-only version of slurm.conf...

Cheers,

Mark

On Mon, 13 Dec 2021, Mark Dixon wrote:

[EXTERNAL EMAIL]

Hi all,

Just wondering if anyone else had seen this.

Running slurm 21.08.2, we're seeing srun work normally if it is able to
run immediately. However, if there is a delay in job start, for example
after a wait for another job to end, srun fails. e.g.

  [test@foo ~]$ srun -p test --pty bash
  [test@bar ~]$ exit
  exit
  [test@foo ~]$

  [test@foo ~]$ sbatch -p test --exclusive sleep.sh
  Submitted batch job 3407
  [test@foo ~]$ srun -p test --pty bash
 srun:  job 3409 queued and waiting for resources
 srun: error:  Security violation, slurm message from uid 456
 srun: error:  Security violation, slurm message from uid 456
 srun: error:  Job allocation 3409 has been revoked
  [test@foo ~]$

With --slurmd-debug=verbose, I see:

 srun:  job 3390 queued and waiting for resources
 srun: error:  Security violation, slurm message from uid 456
 srun: error:  Security violation, slurm message from uid 456
 srun: error:  Job allocation 3390 has been revoked

Meanwhile, the slurmd log shows:

[2021-12-13T13:08:06.028] Job 3390 already killed, do not launch extern step


Any ideas, please?

Thanks!

Mark




Reply via email to