Please check your slurm.conf on the compute nodes, I'm thinking that your compute node isn't appearing in slurm.conf properly.
On Jan 15, 2018 07:45, "John Hearns" <hear...@googlemail.com> wrote: > That's it. I am calling JohnH's Law: > "Any problem with a batch queueing system is due to hostname resolution" > > > On 15 January 2018 at 16:30, Elisabetta Falivene <e.faliv...@ilabroma.com> > wrote: > >> slurmd -Dvvv says >> >> slurmd: fatal: Unable to determine this slurmd's NodeName >> >> b >> >> 2018-01-15 15:58 GMT+01:00 Douglas Jacobsen <dmjacob...@lbl.gov>: >> >>> The fact that sinfo is responding shows that at least slurmctld is >>> running. Slumd, on the other hand is not. Please also get output of >>> slurmd log or running "slurmd -Dvvv" >>> >> >> >> >> >>> >>> On Jan 15, 2018 06:42, "Elisabetta Falivene" <e.faliv...@ilabroma.com> >>> wrote: >>> >>>> > Anyway I suggest to update the operating system to stretch and fix >>>> your >>>> > configuration under a more recent version of slurm. >>>> >>>> I think I'll soon arrive to that :) >>>> b >>>> >>>> 2018-01-15 14:08 GMT+01:00 Gennaro Oliva <oliv...@na.icar.cnr.it>: >>>> >>>>> Ciao Elisabetta, >>>>> >>>>> On Mon, Jan 15, 2018 at 01:13:27PM +0100, Elisabetta Falivene wrote: >>>>> > Error messages are not much helping me in guessing what is going on. >>>>> What >>>>> > should I check to get what is failing? >>>>> >>>>> check slurmctld.log and slurmd.log, you can find them under >>>>> /var/log/slurm-llnl >>>>> >>>>> > *PARTITION AVAIL TIMELIMIT NODES STATE NODELIST* >>>>> > *batch* up infinite 8 unk* node[01-08]* >>>>> > >>>>> > >>>>> > Running >>>>> > *systemctl status slurmctld.service* >>>>> > >>>>> > returns >>>>> > >>>>> > *slurmctld.service - Slurm controller daemon* >>>>> > * Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled)* >>>>> > * Active: failed (Result: timeout) since Mon 2018-01-15 13:03:39 >>>>> CET; 41s >>>>> > ago* >>>>> > * Process: 2098 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS >>>>> > (code=exited, status=0/SUCCESS)* >>>>> > >>>>> > * slurmctld[2100]: cons_res: select_p_reconfigure* >>>>> > * slurmctld[2100]: cons_res: select_p_node_init* >>>>> > * slurmctld[2100]: cons_res: preparing for 1 partitions* >>>>> > * slurmctld[2100]: Running as primary controller* >>>>> > * slurmctld[2100]: >>>>> > SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,ma >>>>> x_sched_time=4,partition_job_depth=0* >>>>> > * slurmctld.service start operation timed out. Terminating.* >>>>> > *Terminate signal (SIGINT or SIGTERM) received* >>>>> > * slurmctld[2100]: Saving all slurm state* >>>>> > * Failed to start Slurm controller daemon.* >>>>> > * Unit slurmctld.service entered failed state.* >>>>> >>>>> Do you have a backup controller? >>>>> Check your slurm.conf under: >>>>> /etc/slurm-llnl >>>>> >>>>> Anyway I suggest to update the operating system to stretch and fix your >>>>> configuration under a more recent version of slurm. >>>>> Best regards >>>>> -- >>>>> Gennaro Oliva >>>>> >>>>> >>>> >> >