Hi Jeremy, What is the value of TreeWidth in your slurm.conf? If there is no entry then I recommend setting it to a value a bit larger than the number of nodes you have in your cluster and then restarting slurmctld.
Best, Steve On Wed, Feb 2, 2022 at 12:59 AM Jeremy Fix <jeremy....@centralesupelec.fr> wrote: > Hi, > > A follow-up. I though some of nodes were ok but that's not the case; > This morning, another pool of consecutive (why consecutive by the way? > they are always consecutively failing) compute nodes are idle* . And now > of the nodes which were drained came back to life in idle and now again > switched to idle*. > > One thing I should mention is that the master is now handling a total of > 148 nodes; That's the new pool of 100 nodes which have a cycling state. > The previous 48 nodes that already handled by this master are ok. > > I do not know if this should be considered a large system but we tried > to have a look to settings such as the ARP cache [1] on the slurm > master. I'm not very familiar with that, it seems to me it enlarges the > cache of the node names/IPs table. This morning, the master has 125 > lines in "arp -a" (before changing the settings in systctl , it was > like, 20 or so); Do you think this settings is also necessary on the > compute nodes ? > > Best; > > Jeremy. > > > [1] > > https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-arp-cache-for-large-networks > > > > > -- ________________________________________________________________ Steve Cousins Supercomputer Engineer/Administrator Advanced Computing Group University of Maine System 244 Neville Hall (UMS Data Center) (207) 581-3574 Orono ME 04469 steve.cousins at maine.edu