Hi all, Sorry for the noise, this was down to a problem with our configless setup.
Really must start running slurmd everywhere and get rid of the compute-only version of slurm.conf...
Cheers, Mark On Mon, 13 Dec 2021, Mark Dixon wrote:
[EXTERNAL EMAIL] Hi all, Just wondering if anyone else had seen this. Running slurm 21.08.2, we're seeing srun work normally if it is able to run immediately. However, if there is a delay in job start, for example after a wait for another job to end, srun fails. e.g. [test@foo ~]$ srun -p test --pty bash [test@bar ~]$ exit exit [test@foo ~]$ [test@foo ~]$ sbatch -p test --exclusive sleep.sh Submitted batch job 3407 [test@foo ~]$ srun -p test --pty bash srun: job 3409 queued and waiting for resources srun: error: Security violation, slurm message from uid 456 srun: error: Security violation, slurm message from uid 456 srun: error: Job allocation 3409 has been revoked [test@foo ~]$ With --slurmd-debug=verbose, I see: srun: job 3390 queued and waiting for resources srun: error: Security violation, slurm message from uid 456 srun: error: Security violation, slurm message from uid 456 srun: error: Job allocation 3390 has been revoked Meanwhile, the slurmd log shows: [2021-12-13T13:08:06.028] Job 3390 already killed, do not launch extern step Any ideas, please? Thanks! Mark