Re: [slurm-users] enable_configless, srun and DNS vs. hosts file

Mark Dixon Tue, 16 Nov 2021 06:02:38 -0800

Hi Paul,

Thanks for the thought but no, we'd restarted all slurmctld, slurmdbd andslurmd daemons since changing any of the slurm config files.

I have a very cut-down slurm.conf on the non-slurmctld nodes, which seemsto be consulted when running srun (regardless of whether slurmd is runningor not).

Removing the simplified NodeName lines from the cut-down slurm.conf causessrun to immediately return to its "can't find address for host" behaviourI outlined at the start. Seen this both on clients running slurmd andthose that don't.

The cut-down slurm.conf is slowly growing: I've found that I also need toadd GresTypes, otherwise srun/sbatch don't know what users can put intheir "--gres" flag and so reject it. I guess at least that makes sense -the tools need to get that information from somewhere.


Interesting!

Best,

Mark

On Fri, 12 Nov 2021, Paul Brunk wrote:

[EXTERNAL EMAIL]

Hi:

We run configless. If we add a node to slurm.conf and don't restartslurmd on our submit nodes, then attempts to submit to that new nodewill get the error you saw. Restarting slurmd on the submit node fixesit. This is the documented behavior (adding nodes needs slurmdrestarted everywhere). Could this be what you're seeing (as opposed to/etc/hosts vs DNS)?


--
Wishing that I'd just listened this time,
Paul Brunk, system administrator, Workstation Support Group
GACRC (formerly RCC)
UGA EITS  (formerly UCNS)


-----Original Message-----
From: slurm-users <slurm-users-boun...@lists.schedmd.com> On Behalf Of Mark 
Dixon
Sent: Wednesday, November 10, 2021 10:14
To: slurm-users@lists.schedmd.com
Subject: [slurm-users] enable_configless, srun and DNS vs. hosts file

[EXTERNAL SENDER - PROCEED CAUTIOUSLY]


Hi,

I'm using the "enable_configless" mode to avoid the need for a shared slurm.conf file, 
and am having similar trouble to others when running "srun", e.g.

  srun: error: fwd_tree_thread: can't find address for host cn120, check 
slurm.conf
  srun: error: Task launch for StepId=113.0 failed on node cn120: Can't find an 
address, check slurm.conf
  srun: error: Application launch failed: Can't find an address, check 
slurm.conf
  srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

I understand that the accepted solution is to add the nodenames to DNS. Is that 
really correct?

I ask because it would be a great help if slurm instead used the more usual 
mechanism and consult the sources listed in /etc/nsswitch.conf. We use a large 
/etc/hosts file instead of DNS for our cluster and would rather not start 
running named if we can help it.

Thanks,

Mark

PS Adding a line like "NodeName=cn[001-999]" to the submit/compute host
   slurm.conf file makes this go away (I hope skipping the node detail, or
   adding nodes that don't exist [yet] won't cause other problems).

Re: [slurm-users] enable_configless, srun and DNS vs. hosts file

Reply via email to