Re: [slurm-users] squeue reports ReqNodeNotAvail but node is available

Ole Holm Nielsen Sun, 12 Jul 2020 23:24:24 -0700

Hi Janna,

If you're running an old Slurm version, there may be bugs already resolvedin the later versions. You can search for bugs with ReqNodeNotAvail inthe title:

https://bugs.schedmd.com/buglist.cgi?quicksearch=ReqNodeNotAvail


For example, this one might be relevant:
https://bugs.schedmd.com/show_bug.cgi?id=9257

Upgrade to Slurm 20.02 is highly recommended.

/Ole

On 7/12/20 3:36 PM, Ole Holm Nielsen wrote:

In case your Arp cache is the problem, there is some advice in the Wiki page:
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-arp-cache-for-large-networks
I think there are other causes for ReqNodeNotAvail, for example, the nodebeing allocated for other jobs. The "scontrol show node/job" shouldreveal more details.
/Ole


On 11-07-2020 06:00, mercan wrote:
Hi Janna;
It sounds like a Arp cache table problem to me. If your slurm head nodecan reachable ~1000 or more network devices (all connected networkcards, switches etc., even they are reachable by different ports of theserver), you need to increse some network settings at headnode andservers which can reach same amount of network device :
http://docs.adaptivecomputing.com/torque/5-0-3/Content/topics/torque/12-appendices/otherConsiderations.htm
Also some advices for big cluster at slurm documentation:

https://slurm.schedmd.com/big_sys.html

Regards,

Ahmet M.


11.07.2020 01:34 tarihinde Janna Ore Nugent yazdı:
Hi All,
I’ve got an intermittent situation with gpu nodes that sinfo says areavailable and idle, but squeue reports as “ReqNodeNotAvail”. We’vecycled the nodes to restart services but it hasn’t helped. Anysuggestions for resolving this or digging into it more deeply?

Re: [slurm-users] squeue reports ReqNodeNotAvail but node is available

Reply via email to