Hi,

Managed to isolate the problem, and it was that the slurm uid was not the same across the network.
Now (simple) job runs without problem.

Regards,

P

On 2019-07-05 11:39, Pär Lundö wrote:
Hi,

I am running Slurm 19.05 on Ubuntu 18.05 (controller and server) and 18.10 (nodes).

My problem is that I cannot get the nodes to change its state to UP or IDLE from "DOWN*" ("*" indicating that the communication is lost).

I can ping both the node´s name  (its hostname) and the IP address of the node. I have added the IP address of the node (with only one node running) in the "NodeAddr"-filed in the "slurm.conf"-file as follows: "NodeName=lxclient10 NodeAddr=192.168.1.10 "... As stated by the configurator-tool.

Running "scontrol show node" the stated "REASON" is "Node unexpectedly rebooted".

However running "scontrol update NodeName=lxclient10 State=RESUME" the state is changed to IDLE. Happy with that I submit a job, the job is queued and submitted but job is noted as "PD" and waiting "Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions" and the nod is noted as "IDLE*+COMPLETING" (noted via the "scontrol show node"-command).

After a while, and running "squeue" to check what is happening the job´s state is "CG" ("Completing").

Simultanously running "scontrol show node" I can see that the CPULoad is small, or 0 and no CPUs are allocated ("CPUAlloc=0").

My network is a gigabit network, no firewalls are active. Node can ping server and server can ping node (both IP and hostname).

Any thoughts on why this is happening?

Best regards,

P


--
Hälsningar, Pär
________________________________
Pär Lundö
Forskare
Avdelningen för Ledningssystem

FOI
Totalförsvarets forskningsinstitut
164 90 Stockholm

Besöksadress:
Olau Magnus väg 33, Linköping


Tel: +46 13 37 86 01
Mob: +46 734 447 815
Vxl: +46 13 37 80 00
par.lu...@foi.se
www.foi.se


Reply via email to