Hi all,

Just wondering if anyone else had seen this.

Running slurm 21.08.2, we're seeing srun work normally if it is able to run immediately. However, if there is a delay in job start, for example after a wait for another job to end, srun fails. e.g.

  [test@foo ~]$ srun -p test --pty bash
  [test@bar ~]$ exit
  exit
  [test@foo ~]$

  [test@foo ~]$ sbatch -p test --exclusive sleep.sh
  Submitted batch job 3407
  [test@foo ~]$ srun -p test --pty bash
  srun: job 3409 queued and waiting for resources
  srun: error: Security violation, slurm message from uid 456
  srun: error: Security violation, slurm message from uid 456
  srun: error: Job allocation 3409 has been revoked
  [test@foo ~]$

With --slurmd-debug=verbose, I see:

  srun: job 3390 queued and waiting for resources
  srun: error: Security violation, slurm message from uid 456
  srun: error: Security violation, slurm message from uid 456
  srun: error: Job allocation 3390 has been revoked

Meanwhile, the slurmd log shows:

[2021-12-13T13:08:06.028] Job 3390 already killed, do not launch extern step


Any ideas, please?

Thanks!

Mark

Reply via email to