Hello all, I did some modification in my slurm.conf and I’ve restarted the slurmctld on the master and then the slurmd on the nodes. During this process I’ve lost some jobs (*), curiously all these jobs were on ubuntu nodes . These jobs were ok with the consumed resources (**).
Any Idea what could be the problem ? Thanks in advance Best regards, Christine Leroy (*) [2021-11-29T14:17:09.205] error: Node xxx appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf. [2021-11-29T14:17:10.162] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2 [2021-11-29T14:17:42.223] _job_complete: JobId=4546 WTERMSIG 15 [2021-11-29T14:17:42.223] _job_complete: JobId=4546 done [2021-11-29T14:17:42.224] _job_complete: JobId=4666 WTERMSIG 15 [2021-11-29T14:17:42.224] _job_complete: JobId=4666 done [2021-11-29T14:17:42.236] _job_complete: JobId=4665 WTERMSIG 15 [2021-11-29T14:17:42.236] _job_complete: JobId=4665 done [2021-11-29T14:17:46.072] _job_complete: JobId=4533 WTERMSIG 15 [2021-11-29T14:17:46.072] _job_complete: JobId=4533 done [2021-11-29T14:17:59.005] _job_complete: JobId=4664 WTERMSIG 15 [2021-11-29T14:17:59.005] _job_complete: JobId=4664 done [2021-11-29T14:17:59.006] _job_complete: JobId=4663 WTERMSIG 15 [2021-11-29T14:17:59.007] _job_complete: JobId=4663 done [2021-11-29T14:17:59.021] _job_complete: JobId=4539 WTERMSIG 15 [2021-11-29T14:17:59.021] _job_complete: JobId=4539 done (**) # sacct --format=JobID,JobName,ReqCPUS,ReqMem,Start,State,CPUTime,MaxRSS | grep -f /tmp/job15 4533 xterm 1 16Gn 2021-11-24T16:31:32 FAILED 4-21:46:14 4533.batch batch 1 16Gn 2021-11-24T16:31:32 CANCELLED 4-21:46:14 8893664K 4533.extern extern 1 16Gn 2021-11-24T16:31:32 COMPLETED 4-21:46:11 0 4539 xterm 16 16Gn 2021-11-24T16:34:25 FAILED 78-11:37:04 4539.batch batch 16 16Gn 2021-11-24T16:34:25 CANCELLED 78-11:37:04 23781384K 4539.extern extern 16 16Gn 2021-11-24T16:34:25 COMPLETED 78-11:32:48 0 4546 xterm 16 16Gn 2021-11-24T17:17:54 FAILED 77-23:56:48 4546.batch batch 16 16Gn 2021-11-24T17:17:54 CANCELLED 77-23:56:48 18541468K 4546.extern extern 16 16Gn 2021-11-24T17:17:54 COMPLETED 77-23:56:00 0 4663 xterm 1 12Gn 2021-11-26T16:51:12 FAILED 2-21:26:47 4663.batch batch 1 12Gn 2021-11-26T16:51:12 CANCELLED 2-21:26:47 2275232K 4663.extern extern 1 12Gn 2021-11-26T16:51:12 COMPLETED 2-21:26:34 0 4664 xterm 1 12Gn 2021-11-26T17:13:42 FAILED 2-21:04:17 4664.batch batch 1 12Gn 2021-11-26T17:13:42 CANCELLED 2-21:04:17 1484036K 4664.extern extern 1 12Gn 2021-11-26T17:13:42 COMPLETED 2-21:04:17 0 4665 xterm 1 8Gn 2021-11-26T17:18:12 FAILED 2-20:59:30 4665.batch batch 1 8Gn 2021-11-26T17:18:12 CANCELLED 2-20:59:30 1159140K 4665.extern extern 1 8Gn 2021-11-26T17:18:12 COMPLETED 2-20:59:27 0 4666 xterm 1 8Gn 2021-11-26T17:22:12 FAILED 2-20:55:30 4666.batch batch 1 8Gn 2021-11-26T17:22:12 CANCELLED 2-20:55:30 2090708K 4666.extern extern 1 8Gn 2021-11-26T17:22:12 COMPLETED 2-20:55:27 0 4711 xterm 4 3Gn 2021-11-29T14:47:09 FAILED 00:20:08 4711.batch batch 4 3Gn 2021-11-29T14:47:09 CANCELLED 00:20:08 37208K 4711.extern extern 4 3Gn 2021-11-29T14:47:09 COMPLETED 00:20:00 0 4714 deckbuild 10 30Gn 2021-11-29T14:51:46 FAILED 00:05:20 4714.batch batch 10 30Gn 2021-11-29T14:51:46 CANCELLED 00:05:20 4036K 4714.extern extern 10 30Gn 2021-11-29T14:51:46 COMPLETED 00:05:10 0