Hello David, slurmd daemon is not running (while slurmctld and slurmdbd are).
slurmd.log (different from slurmctld.log) should contain more information. Regards, Pierre-Marie Le Biot From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of david vilanova Sent: Thursday, November 30, 2017 9:32 AM To: Slurm User Community List <slurm-users@lists.schedmd.com> Subject: Re: [slurm-users] slurm conf with single machine with multi cores. Sorry for the delay, was trying to fix it but still not working. The node is always down. The master machine is also the compute machine. It's a single server that i use for that. 1 node and 12 cpus. In the log below i see this line [2017-11-30T09:24:41.764] agent/is_node_resp: node:linuxcluster RPC:REQUEST_NODE_REGISTRATION_STATUS : Communication connection failure Here below my slurm.conf file: ControlMachine=linuxcluster AuthType=auth/munge CryptoType=crypto/munge MailProg=/usr/bin/mail MpiDefault=none PluginDir=/usr/local/lib/slurm ProctrackType=proctrack/cgroup ReturnToService=1 SlurmctldPidFile=/var/run/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurm/d SlurmUser=slurm StateSaveLocation=/var/spool/slurm/ctld SwitchType=switch/none TaskPlugin=task/none InactiveLimit=0 KillWait=30 MinJobAge=300 SlurmctldTimeout=120 SlurmdTimeout=300 Waittime=0 FastSchedule=1 SchedulerType=sched/backfill AccountingStorageHost=linuxcluster AccountingStorageType=accounting_storage/slurmdbd AccountingStorageUser=slurm AccountingStoreJobComment=YES ClusterName=linuxcluster JobCompType=jobcomp/none JobCompUser=slurm JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/cgroup SlurmctldDebug=5 SlurmctldLogFile=/var/log/slurm/slurmctrl.log SlurmdDebug=5 SelectType=select/cons_res SelectTypeParameters=CR_CPU NodeName=linuxcluster CPUs=12 PartitionName=testq Nodes=linuxclusterDefault=YES MaxTime=INFINITE State=UP slurmctrld.log: [2017-11-30T09:24:28.025] debug: Log file re-opened [2017-11-30T09:24:28.025] debug: sched: slurmctld starting [2017-11-30T09:24:28.025] slurmctld version 17.11.0 started on cluster linuxcluster [2017-11-30T09:24:28.026] Munge cryptographic signature plugin loaded [2017-11-30T09:24:28.026] Consumable Resources (CR) Node Selection plugin loaded with argument 1 [2017-11-30T09:24:28.026] preempt/none loaded [2017-11-30T09:24:28.026] debug: Checkpoint plugin loaded: checkpoint/none [2017-11-30T09:24:28.026] debug: AcctGatherEnergy NONE plugin loaded [2017-11-30T09:24:28.026] debug: AcctGatherProfile NONE plugin loaded [2017-11-30T09:24:28.026] debug: AcctGatherInterconnect NONE plugin loaded [2017-11-30T09:24:28.026] debug: AcctGatherFilesystem NONE plugin loaded [2017-11-30T09:24:28.026] debug: Job accounting gather cgroup plugin loaded [2017-11-30T09:24:28.026] ExtSensors NONE plugin loaded [2017-11-30T09:24:28.026] debug: switch NONE plugin loaded [2017-11-30T09:24:28.026] debug: power_save module disabled, SuspendTime < 0 [2017-11-30T09:24:28.026] debug: No backup controller to shutdown [2017-11-30T09:24:28.026] Accounting storage SLURMDBD plugin loaded with AuthInfo=(null) [2017-11-30T09:24:28.027] debug: Munge authentication plugin loaded [2017-11-30T09:24:28.030] debug: slurmdbd: Sent PersistInit msg [2017-11-30T09:24:28.030] slurmdbd: recovered 0 pending RPCs [2017-11-30T09:24:28.429] debug: Reading slurm.conf file: /usr/local/etc/slurm.conf [2017-11-30T09:24:28.430] layouts: no layout to initialize [2017-11-30T09:24:28.430] topology NONE plugin loaded [2017-11-30T09:24:28.430] debug: No DownNodes [2017-11-30T09:24:28.435] debug: Log file re-opened [2017-11-30T09:24:28.435] sched: Backfill scheduler plugin loaded [2017-11-30T09:24:28.435] route default plugin loaded [2017-11-30T09:24:28.435] layouts: loading entities/relations information [2017-11-30T09:24:28.435] debug: layouts: 1/1 nodes in hash table, rc=0 [2017-11-30T09:24:28.435] debug: layouts: loading stage 1 [2017-11-30T09:24:28.435] debug: layouts: loading stage 1.1 (restore state) [2017-11-30T09:24:28.435] debug: layouts: loading stage 2 [2017-11-30T09:24:28.435] debug: layouts: loading stage 3 [2017-11-30T09:24:28.435] Recovered state of 1 nodes [2017-11-30T09:24:28.435] Down nodes: linuxcluster [2017-11-30T09:24:28.435] Recovered JobID=15 State=0x4 NodeCnt=0 Assoc=6 [2017-11-30T09:24:28.435] Recovered information about 1 jobs [2017-11-30T09:24:28.435] cons_res: select_p_node_init [2017-11-30T09:24:28.436] cons_res: preparing for 1 partitions [2017-11-30T09:24:28.436] debug: Updating partition uid access list [2017-11-30T09:24:28.436] Recovered state of 0 reservations [2017-11-30T09:24:28.436] State of 0 triggers recovered [2017-11-30T09:24:28.436] _preserve_plugins: backup_controller not specified [2017-11-30T09:24:28.436] cons_res: select_p_reconfigure [2017-11-30T09:24:28.436] cons_res: select_p_node_init [2017-11-30T09:24:28.436] cons_res: preparing for 1 partitions [2017-11-30T09:24:28.436] Running as primary controller [2017-11-30T09:24:28.436] debug: No BackupController, not launching heartbeat. [2017-11-30T09:24:28.436] Registering slurmctld at port 6817 with slurmdbd. [2017-11-30T09:24:28.677] debug: No feds to retrieve from state [2017-11-30T09:24:28.757] debug: Priority BASIC plugin loaded [2017-11-30T09:24:28.758] No parameter for mcs plugin, default values set [2017-11-30T09:24:28.758] mcs: MCSParameters = (null). ondemand set. [2017-11-30T09:24:28.758] debug: mcs none plugin loaded [2017-11-30T09:24:28.758] debug: power_save mode not enabled [2017-11-30T09:24:31.761] debug: Spawning registration agent for linuxcluster1 hosts [2017-11-30T09:24:41.764] agent/is_node_resp: node:linuxcluster RPC:REQUEST_NODE_REGISTRATION_STATUS : Communication connection failure [2017-11-30T09:24:58.435] debug: backfill: beginning [2017-11-30T09:24:58.435] debug: backfill: no jobs to backfill [2017-11-30T09:25:28.435] debug: backfill: beginning [2017-11-30T09:25:28.436] debug: backfill: no jobs to backfill [2017-11-30T09:25:28.830] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_sta rt=0,sched_min_interval=2 [2017-11-30T09:25:28.830] debug: sched: Running job scheduler [2017-11-30T09:25:58.436] debug: backfill: beginning [2017-11-30T09:25:58.436] debug: backfill: no jobs to backfill ps -ef | grep slurm ubuntu@linuxcluster:/home/dvi/$ ps -ef | grep slurm slurm 11388 1 0 09:24 ? 00:00:00 /usr/local/sbin/slurmdbd slurm 11430 1 0 09:24 ? 00:00:00 /usr/local/sbin/slurmctld Any idea ? El El mié, 29 nov 2017 a las 18:21, Le Biot, Pierre-Marie <pierre-marie.leb...@hpe.com<mailto:pierre-marie.leb...@hpe.com>> escribió: Hello David, So linuxcluster is the Head node and also a Compute node ? Is slurmd running ? What does /var/log/slurm/slurmd.log say ? Regards, Pierre-Marie Le Biot From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com<mailto:slurm-users-boun...@lists.schedmd.com>] On Behalf Of david vilanova Sent: Wednesday, November 29, 2017 4:33 PM To: Slurm User Community List <slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com>> Subject: Re: [slurm-users] slurm conf with single machine with multi cores. Hi, I have updated the slurm.conf as follows: SelectType=select/cons_res SelectTypeParameters=CR_CPU NodeName=linuxcluster CPUs=2 PartitionName=testq Nodes=linuxcluster Default=YES MaxTime=INFINITE State=UP Still get testq node in down status ??? Any idea ? Below log from db and controller: ==> /var/log/slurm/slurmctrl.log <== [2017-11-29T16:28:30.446] slurmctld version 17.11.0 started on cluster linuxcluster [2017-11-29T16:28:30.850] error: SelectType specified more than once, latest value used [2017-11-29T16:28:30.851] layouts: no layout to initialize [2017-11-29T16:28:30.855] layouts: loading entities/relations information [2017-11-29T16:28:30.855] Recovered state of 1 nodes [2017-11-29T16:28:30.855] Down nodes: linuxcluster [2017-11-29T16:28:30.855] Recovered information about 0 jobs [2017-11-29T16:28:30.855] cons_res: select_p_node_init [2017-11-29T16:28:30.855] cons_res: preparing for 1 partitions [2017-11-29T16:28:30.856] Recovered state of 0 reservations [2017-11-29T16:28:30.856] _preserve_plugins: backup_controller not specified [2017-11-29T16:28:30.856] cons_res: select_p_reconfigure [2017-11-29T16:28:30.856] cons_res: select_p_node_init [2017-11-29T16:28:30.856] cons_res: preparing for 1 partitions [2017-11-29T16:28:30.856] Running as primary controller [2017-11-29T16:28:30.856] Registering slurmctld at port 6817 with slurmdbd. [2017-11-29T16:28:31.098] No parameter for mcs plugin, default values set [2017-11-29T16:28:31.098] mcs: MCSParameters = (null). ondemand set. [2017-11-29T16:29:31.169] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2 David El El mié, 29 nov 2017 a las 15:59, Steffen Grunewald <steffen.grunew...@aei.mpg.de<mailto:steffen.grunew...@aei.mpg.de>> escribió: Hi David, On Wed, 2017-11-29 at 14:45:06 +0000, david vilanova wrote: > Hello, > I have installed latest 7.11 release and my node is shown as down. > I hava a single physical server with 12 cores so not sure the conf below is > correct ?? can you help ?? > > In slurm.conf the node is configure as follows: > > NodeName=linuxcluster CPUs=1 RealMemory=991 Sockets=12 CoresPerSocket=1 > ThreadsPerCore=1 Feature=local 12 Sockets? Certainly not... 12 Cores per socket, yes. (IIRC CPUS shouldn't be specified if the detailed topology is given. You may try CPUs=12 and drop the details.) > PartitionName=testq Nodes=inuxcluster Default=YES MaxTime=INFINITE State=UP ^^ typo? Cheers, Steffen