Hi,

I am running Slurm 19.05 on Ubuntu 18.05 (controller and server) and 18.10 (nodes).

My problem is that I cannot get the nodes to change its state to UP or IDLE from "DOWN*" ("*" indicating that the communication is lost).

I can ping both the node´s name  (its hostname) and the IP address of the node. I have added the IP address of the node (with only one node running) in the "NodeAddr"-filed in the "slurm.conf"-file as follows: "NodeName=lxclient10 NodeAddr=192.168.1.10 "... As stated by the configurator-tool.

Running "scontrol show node" the stated "REASON" is "Node unexpectedly rebooted".

However running "scontrol update NodeName=lxclient10 State=RESUME" the state is changed to IDLE. Happy with that I submit a job, the job is queued and submitted but job is noted as "PD" and waiting "Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions" and the nod is noted as "IDLE*+COMPLETING" (noted via the "scontrol show node"-command).

After a while, and running "squeue" to check what is happening the job´s state is "CG" ("Completing").

Simultanously running "scontrol show node" I can see that the CPULoad is small, or 0 and no CPUs are allocated ("CPUAlloc=0").

My network is a gigabit network, no firewalls are active. Node can ping server and server can ping node (both IP and hostname).

Any thoughts on why this is happening?

Best regards,

P


Reply via email to