Re: [slurm-users] How to fix “slurmd.service: Can't open PID file” error

2019-06-18 Thread mercan
Hi; Using the noki user, would you try to read /var/run/slurm-llnl/slurmd.pid and /var/run/slurm-llnl/slurmctld.pid files. Are there these files present, readable and writeable? May be upper directories don't have the permission to read/execute. Regards; Ahmet M. On 19.06.2019 07:26, Noki

Re: [slurm-users] status of cloud nodes

2019-06-18 Thread nathan norton
Hi, It just shows "Node $NODE not found" Whereas others all work as expected (ie, they are running) Without knowing the internals of slurm it feels like nodes that are turned off+cloud state don't exist in the system until they are on? Any other ideas? Thanks Nathan On Wed., 19 Jun. 2019, 4:

Re: [slurm-users] status of cloud nodes

2019-06-18 Thread Chris Samuel
On Tuesday, 18 June 2019 9:36:56 PM PDT nathan norton wrote: > Just tried running that command, but it only shows nodes that are up and > running, doesn’t tell me about any nodes that are down and turned off, as > an example please see below. There is a job running that should be using > the 100 n

Re: [slurm-users] status of cloud nodes

2019-06-18 Thread nathan norton
Hi, Just tried running that command, but it only shows nodes that are up and running, doesn’t tell me about any nodes that are down and turned off, as an example please see below. There is a job running that should be using the 100 nodes but only 52 are allocated (plus 2 down* (that I know about

Re: [slurm-users] How to fix “slurmd.service: Can't open PID file” error

2019-06-18 Thread mercan
Hi; Sorry, as you can see, I did a mistake again.  I wrote two different directories: "The owner of the /var/run/slurm-llnl directory and the slurmctld.pid and slurmd.pid files should be "noki" user. chown -R noki:root /var/spool/slurm-llnl" You should run: chown -R noki:root /var/run/slurm

Re: [slurm-users] How to fix “slurmd.service: Can't open PID file” error

2019-06-18 Thread Noki Lee
Hi, slurm-users and mercan. I tried what you said. noki@noki-System-Product-Name:~$ sudo chown -R noki:root /var/spool/slurm-llnl/noki@noki-System-Product-Name:/var/spool/slurm-llnl$ ls -l total 92 -rw--- 1 noki root 198 Jun 19 11:36 assoc_mgr_state -rw--- 1 noki root 198 Jun 18 20:31 ass

[slurm-users] Manage access to specialized nodes: Reservation, Queue, or Features

2019-06-18 Thread E.M. Dragowsky
Greetings -- We're running Slurm 17.02.2. - We have implemented OnDemand in our cluster, including the Jupyter app across all the compute nodes. The Interactive Desktop application, however, is installed on a small set of compute nodes during an extended validation period. Installatio

Re: [slurm-users] How to fix “slurmd.service: Can't open PID file” error

2019-06-18 Thread mercan
Hi; I did not notice SlurmUser=noki line. The owner of the /var/run/slurm-llnl directory and the slurmctld.pid and slurmd.pid files should be "noki" user. chown -R noki:root /var/spool/slurm-llnl Regards; Ahmet M. On 18.06.2019 15:15, mercan wrote: Hi; The owner of the /var/run/slurm-l

Re: [slurm-users] How to fix “slurmd.service: Can't open PID file” error

2019-06-18 Thread mercan
Hi; The owner of the /var/run/slurm-llnl directory and the slurmctld.pid and slurmd.pid files should be "slurm" user. Your files owner are root and noki. chown -R slurm:slurm /var/spool/slurm-llnl Regards; Ahmet M. On 18.06.2019 15:03, Noki Lee wrote: Though SLURM works fine for job su

[slurm-users] How to fix “slurmd.service: Can't open PID file” error

2019-06-18 Thread Noki Lee
Though SLURM works fine for job submitting, running, and queueing, I got a minor error below. sudo systemctl status slurmd Jun 12 10:20:40 noki-System-Product-Name systemd[1]: slurmd.service: Can't open PID file /var/run/slurm-llnl/slurmd.pid (yet?) after start: No such file or directory sudo sy

Re: [slurm-users] status of cloud nodes

2019-06-18 Thread Sam Gallop (NBI)
Hi Nathan, The command I use to get the reason for failed nodes is ... 'sinfo -Ral'. If you need to extend the width of the output then ... 'sinfo -Ral -O reason:35,user,timestamp,statelong,nodelist'. Using the timestamp of the failure look in the slurmd or slurmctld logs. --- Sam Gallop

[slurm-users] status of cloud nodes

2019-06-18 Thread nathan norton
Hi all, I am using slurm with a cloud provider it is all working a treat. lets say i have 100 nodes all working fine and able to be scheduled, everything works fine. $ srun -N100 hostname works fine. For some unknown reason after machines shut down for example over the weekend if no jobs g