Hi,
Managed to isolate the problem, and it was that the slurm uid was not
the same across the network.
Now (simple) job runs without problem.
Regards,
P
On 2019-07-05 11:39, Pär Lundö wrote:
Hi,
I am running Slurm 19.05 on Ubuntu 18.05 (controller and server) and
18.10 (nodes).
My problem is that I cannot get the nodes to change its state to UP or
IDLE from "DOWN*" ("*" indicating that the communication is lost).
I can ping both the node´s name (its hostname) and the IP address of
the node. I have added the IP address of the node (with only one node
running) in the "NodeAddr"-filed in the "slurm.conf"-file as follows:
"NodeName=lxclient10 NodeAddr=192.168.1.10 "... As stated by the
configurator-tool.
Running "scontrol show node" the stated "REASON" is "Node unexpectedly
rebooted".
However running "scontrol update NodeName=lxclient10 State=RESUME" the
state is changed to IDLE. Happy with that I submit a job, the job is
queued and submitted but job is noted as "PD" and waiting "Nodes
required for job are DOWN, DRAINED or reserved for jobs in higher
priority partitions" and the nod is noted as "IDLE*+COMPLETING" (noted
via the "scontrol show node"-command).
After a while, and running "squeue" to check what is happening the
job´s state is "CG" ("Completing").
Simultanously running "scontrol show node" I can see that the CPULoad
is small, or 0 and no CPUs are allocated ("CPUAlloc=0").
My network is a gigabit network, no firewalls are active. Node can
ping server and server can ping node (both IP and hostname).
Any thoughts on why this is happening?
Best regards,
P
--
Hälsningar, Pär
________________________________
Pär Lundö
Forskare
Avdelningen för Ledningssystem
FOI
Totalförsvarets forskningsinstitut
164 90 Stockholm
Besöksadress:
Olau Magnus väg 33, Linköping
Tel: +46 13 37 86 01
Mob: +46 734 447 815
Vxl: +46 13 37 80 00
par.lu...@foi.se
www.foi.se