Hi, thanks for all the responses.
On 18.03.21 11:29, Stefan Staeglich wrote: > I think it makes more sense to adjust the config file > /etc/slurm-llnl/slurm.conf > and not the systemd units: > SlurmctldPidFile=/run/slurmctld.pid > SlurmdPidFile=/run/slurmd.pid That was of course my first approach. I had used the directory /run/slurm-lnll/ on my CentOS 7 installations, where I copied the slurm.conf file from over. It turned out that those directories I defined there weren't used. The error message suggested that slurmctld still tried to write to /run/slurmctld.pid. Changing the systemd file was my last resort. And as mentioned I don't expect to have to do that much fiddling with an (relative old 19.05-5) package manager version. It seems "snap" provides a more current version 20.02.1: snap install slurm # version 20.02.1, or apt install slurm-client # version 19.05.5-1 The underlying distribution installation also hasn't been modified by me, I want to use Ubuntu20.04 as my future cluster OS, and the kvm-virtualized SLURM controller was the first I tried. Brian Andrus suggested: On 17.03.21 21:32, Brian Andrus wrote: > That is looking like your /run folder does not have world execute > permissions, making it impossible for anything to access sub-directories. But I can write as user "sven" (I didn't set up the LDAP connection, yet) in a subdirectory of /run/slurm-lnll, if it belongs to user "sven". Furthermore, I used the option "SlurmUser=slurm" in my slurm.conf file, because it is good practice to not use root. Changing this to "root", which should give universal access to all directories, doesn't make a difference: #SlurmUser=slurm SlurmdUser=root My initial response, that /var/run/slurm-lnll/slurmctld.pid worked me; was also premature. It kind of works for the first start after a reboot with systemctl start slurmctld and systemctl stop slurmctld works, but then lingers around in the timeout. During that time slurmctld still runs, I see the process, and can use squeue, sinfo etc. After the pid file writing timeout it shows the service to be terminated. This time not due to the inability of writing the slurmctld.pid file, but instead suggesting my modification to the legacy location /var/run - which itself is only a reference to /run: Mar 18 12:30:43 slurm systemd[1]: Reloading. Mar 18 12:30:43 slurm systemd[1]: /lib/systemd/system/dbus.socket:5: ListenStream= references a path below legacy directory /var/run/, updating /var/> Mar 18 12:30:43 slurm systemd[1]: /lib/systemd/system/slurmd.service:12: PIDFile= references a path below legacy directory /var/run/, updating /var/r> Mar 18 12:31:59 slurm systemd[1]: slurmctld.service: start operation timed out. Terminating. Mar 18 12:31:59 slurm systemd[1]: slurmctld.service: Failed with result 'timeout'. time systemctl start slurmctld Job for slurmctld.service failed because a timeout was exceeded. See "systemctl status slurmctld.service" and "journalctl -xe" for details. real 1m1.314s user 0m0.003s sys 0m0.002s -- A session with the ID 1 has been terminated. Mar 18 12:30:43 slurm systemd[1]: Reloading. Mar 18 12:30:43 slurm systemd[1]: /lib/systemd/system/dbus.socket:5: ListenStream= references a path below legacy directory /var/run/, updating /var/> Mar 18 12:30:43 slurm systemd[1]: /lib/systemd/system/slurmd.service:12: PIDFile= references a path below legacy directory /var/run/, updating /var/r> Mar 18 12:31:59 slurm systemd[1]: slurmctld.service: start operation timed out. Terminating. Mar 18 12:31:59 slurm systemd[1]: slurmctld.service: Failed with result 'timeout'. The initial "&" I put after the systemctl, because I wanted to get to my prompt to investigate the problem. Normal behaviour, as I expect it, would be a starting time of 1-2 seconds. I am back to my work-around: systemctl start slurmctld & sleep 10; echo `pgrep slurmctld` > /run/slurm-lnll/slurmctld.pid && chown slurm: /run/slurm-lnll/slurmctld.pid && cat /run/slurm-lnll/slurmctld.pid My configuration file is read, though, as I can check with scontrol: scontrol show config | grep run SlurmdPidFile = /var/run/slurm-llnl/slurmd.pid SlurmctldPidFile = /var/run/slurm-llnl/slurmctld.pid So, all of this hassle shouldn't occur, my fiddling with systemd should be entirely unnecessary. Mar 18 12:37:13 slurm systemd[1]: slurmctld.service: Can't open PID file /run/slurm-lnll/slurmctld.pid (yet?) after start: Operation not permitted Mar 18 12:38:43 slurm systemd[1]: slurmctld.service: start operation timed out. Terminating. Mar 18 12:38:43 slurm systemd[1]: slurmctld.service: Failed with result 'timeout'. Unmodified systemd file: [Unit] Description=Slurm controller daemon After=network.target munge.service ConditionPathExists=/etc/slurm-llnl/slurm.conf Documentation=man:slurmctld(8) [Service] Type=forking EnvironmentFile=-/etc/default/slurmctld ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS ExecReload=/bin/kill -HUP $MAINPID PIDFile=/run/slurm-lnll/slurmctld.pid LimitNOFILE=65536 TasksMax=infinity [Install] WantedBy=multi-user.target ~ I do know some file permissions issues, I encountered on CentOS-7, but by all apparent means, i.e. checking the permissions, it should work with those permissions in the subdirectory ls -lthrd /run/slurm-lnll/ drwxrwxr-x 2 root slurm 40 Mar 18 12:31 /run/slurm-lnll/ But this suggests, it ignores the setting in the slurm.conf file: SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid -- The job identifier is 2259. Mar 18 12:41:34 slurm systemd[1]: slurmctld.service: Can't open PID file /run/slurm-lnll/slurmctld.pid (yet?) after start: Operation not permitted Mar 18 12:43:04 slurm systemd[1]: slurmctld.service: start operation timed out. Terminating. Mar 18 12:43:04 slurm systemd[1]: slurmctld.service: Failed with result 'timeout'. Though scontrol show config claims otherwise: scontrol show config | grep run SlurmdPidFile = /var/run/slurm-llnl/slurmd.pid SlurmctldPidFile = /var/run/slurm-llnl/slurmctld.pid SrunEpilog = (null) SrunPortRange = 0-0 SrunProlog = (null) I would attribute it to my fault, but I started yesterday with a "vanilla" installation of Ubuntu20.04 server, and the purpose of this VM is only to run sclurmctld. This "should" occur to many more people, or I am missing something obvious. If it was to permissions, making the directory /run/slurm-lnll world-wirteable: ls -lthrd /run/slurm-lnll/ drwxrwxrwx 2 root slurm 40 Mar 18 12:31 /run/slurm-lnll/ should "fix" the problem. I could live with that, even though I try to adhere to strict permission management. That also doesn't work Mar 18 12:46:33 slurm systemd[1]: slurmctld.service: Can't open PID file /run/slurm-lnll/slurmctld.pid (yet?) after start: Operation not permitted Mar 18 12:46:38 slurm systemd[1]: Reloading. So, I am turning in circles here. Best wishes, Sven -- Sven Duscha Deutsches Herzzentrum München Technische Universität München Lazarettstraße 36 80636 München +49 89 1218 2602
smime.p7s
Description: S/MIME Cryptographic Signature