Hi, I experience with SLURM slurmctld an error on Ubuntu20.04, when starting the service (through systemctl):
I installed munge and SLURM version 19.05.5-1 through the package manager from the default repository: apt-get install munge slurm-client slurm-wlm slurm-wlm-doc slurmctld slurmd systemctl start slurmctld & [1] 2735 18:55 [root@slurm ~]# systemctl status slurmctld ● slurmctld.service - Slurm controller daemon Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled; vendor preset: enabled) Active: activating (start) since Wed 2021-03-17 18:55:49 CET; 5s ago Docs: man:slurmctld(8) Process: 2737 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS) Tasks: 12 Memory: 2.5M CGroup: /system.slice/slurmctld.service └─2759 /usr/sbin/slurmctld Mar 17 18:55:49 slurm systemd[1]: Starting Slurm controller daemon... Mar 17 18:55:49 slurm systemd[1]: slurmctld.service: Can't open PID file /run/slurmctld.pid (yet?) after start: Operation not permitted After about 60 seconds slurmctld terminates: -- A stop job for unit slurmctld.service has finished. -- -- The job identifier is 1043 and the job result is done. Mar 17 18:55:49 slurm systemd[1]: Starting Slurm controller daemon... -- Subject: A start job for unit slurmctld.service has begun execution -- Defined-By: systemd -- Support: http://www.ubuntu.com/support -- -- A start job for unit slurmctld.service has begun execution. -- -- The job identifier is 1044. Mar 17 18:55:49 slurm systemd[1]: slurmctld.service: Can't open PID file /run/slurmctld.pid (yet?) after start: Operation not permitted Mar 17 18:57:19 slurm systemd[1]: slurmctld.service: start operation timed out. Terminating. Mar 17 18:57:19 slurm systemd[1]: slurmctld.service: Failed with result 'timeout'. My slurm.conf file lists custom PID file locations for slurmctld and slurmd: /etc/slurm-llnl/slurm.conf SlurmctldPidFile=/run/slurm-llnl/slurmctld.pid SlurmdPidFile=/run/slurm-llnl/slurmd.pid Starting the slurmctld executable by hand works fine: /usr/sbin/slurmctld & pgrep slurmctld 2819 [1]+ Done /usr/sbin/slurmctld pgrep slurmctld 2819 squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) sinfo -lNe Wed Mar 17 19:01:45 2021 NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON ekgen1 1 cluster* unknown* 16 2:8:1 480000 0 1 (null) none ekgen2 1 cluster* down* 16 2:8:1 250000 0 1 (null) Not responding ekgen3 1 debian unknown* 16 2:8:1 250000 0 1 (null) none ekgen4 1 cluster* unknown* 16 2:8:1 250000 0 1 (null) none ekgen5 1 cluster* unknown* 16 2:8:1 250000 0 1 (null) none ekgen6 1 debian unknown* 16 2:8:1 250000 0 1 (null) none ekgen7 1 cluster* unknown* 16 2:8:1 250000 0 1 (null) none ekgen8 1 debian down* 16 2:8:1 250000 0 1 (null) Not responding ekgen9 1 cluster* unknown* 16 2:8:1 192000 0 1 (null) none I tried then to modify /lib/systemd/system/slurmd.service cp /lib/systemd/system/slurmd.service /lib/systemd/system/slurmd.service.orig changed PIDFile=/run/slurmd.pid to PIDFile=/run/slurm-llnl/slurmd.pid systemctl start slurmctld & [1] 1869 pgrep slurm 1875 squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) after ca. 60 seconds: Job for slurmctld.service failed because a timeout was exceeded. See "systemctl status slurmctld.service" and "journalctl -xe" for details - Subject: A start job for unit packagekit.service has finished successfully -- Defined-By: systemd -- Support: http://www.ubuntu.com/support -- -- A start job for unit packagekit.service has finished successfully. -- -- The job identifier is 586. Mar 17 18:28:08 slurm systemd[1]: slurmctld.service: start operation timed out. Terminating. Mar 17 18:28:08 slurm systemd[1]: slurmctld.service: Failed with result 'timeout'. -- Subject: Unit failed -- Defined-By: systemd -- Support: http://www.ubuntu.com/support -- -- The unit slurmctld.service has entered the 'failed' state with result 'timeout'. Mar 17 18:28:08 slurm systemd[1]: Failed to start Slurm controller daemon. -- Subject: A start job for unit slurmctld.service has failed -- Defined-By: systemd -- Support: http://www.ubuntu.com/support -- -- A start job for unit slurmctld.service has finished with a failure. -- -- The job identifier is 511 and the job result is failed. Mar 17 18:31:18 slurm systemd[1]: Starting Slurm controller daemon... -- Subject: A start job for unit slurmctld.service has begun execution -- Defined-By: systemd -- Support: http://www.ubuntu.com/support -- -- A start job for unit slurmctld.service has begun execution. -- -- The job identifier is 662. Mar 17 18:31:18 slurm systemd[1]: slurmctld.service: Can't open PID file /run/slurm-lnll/slurmctld.pid (yet?) after start: Operation not permitted Mar 17 18:32:48 slurm systemd[1]: slurmctld.service: start operation timed out. Terminating. Mar 17 18:32:48 slurm systemd[1]: slurmctld.service: Failed with result 'timeout'. -- Subject: Unit failed -- Defined-By: systemd -- Support: http://www.ubuntu.com/support mkdir /run/slurm-lnll/ chown slurm: /run/slurm-lnll/ ls -lthrd /run/slurm-lnll/ drwxr-xr-x 2 slurm slurm 40 Mar 17 18:34 /run/slurm-lnll/ It doesn't create the PID file ls -lthr /run/slurm-lnll/ total 0 A work-around, writing the PID manually to the PID file, does work: systemctl start slurmctld & sleep 10; echo `pgrep slurmctld` > /run/slurm-lnll/slurmctld.pid && chown slurm: /run/slurm-lnll/slurmctld.pid && cat /run/slurm-lnll/slurmctld.pid Still status problem reported: systemctl status slurmctld ● slurmctld.service - Slurm controller daemon Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled; vendor preset: enabled) Active: active (running) since Wed 2021-03-17 18:37:28 CET; 1min 4s ago Docs: man:slurmctld(8) Process: 2272 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS) Main PID: 2287 (slurmctld) Tasks: 7 Memory: 2.3M CGroup: /system.slice/slurmctld.service └─2287 /usr/sbin/slurmctld Mar 17 18:37:18 slurm systemd[1]: Starting Slurm controller daemon... Mar 17 18:37:18 slurm systemd[1]: slurmctld.service: Can't open PID file /run/slurm-lnll/slurmctld.pid (yet?) after start: Operation not permitted Mar 17 18:37:28 slurm systemd[1]: Started Slurm controller daemon. But the slurmctld process doesn't crash anymore. Stopping the service does work: systemctl stop slurmctld.service systemctl status slurmctld ● slurmctld.service - Slurm controller daemon Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled; vendor preset: enabled) Active: inactive (dead) since Wed 2021-03-17 18:50:47 CET; 1s ago Docs: man:slurmctld(8) Process: 2272 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS) Main PID: 2287 (code=exited, status=0/SUCCESS) Mar 17 18:37:18 slurm systemd[1]: Starting Slurm controller daemon... Mar 17 18:37:18 slurm systemd[1]: slurmctld.service: Can't open PID file /run/slurm-lnll/slurmctld.pid (yet?) after start: Operation not permitted Mar 17 18:37:28 slurm systemd[1]: Started Slurm controller daemon. Mar 17 18:50:47 slurm systemd[1]: Stopping Slurm controller daemon... Mar 17 18:50:47 slurm systemd[1]: slurmctld.service: Succeeded. Mar 17 18:50:47 slurm systemd[1]: Stopped Slurm controller daemon. I am a little astonished that the default package shows this strange behaviour regarding slurmctld installed through the package manager. The base installation is Ubuntu 20.04 server installation, where I did no modifications apart from installing the SLURM-wlm packages and importing my existing configuration and munge.key. Best wishes, Sven Duscha -- Sven Duscha Deutsches Herzzentrum München Technische Universität München Lazarettstraße 36 80636 München
smime.p7s
Description: S/MIME Cryptographic Signature