We do have hyperthreading enabled. Here are some log extracts fomr various attempts to get it working.
[2017-11-28T15:52:30.466] error: we don't have select plugin type 101 [2017-11-28T15:52:30.466] error: select_g_select_jobinfo_unpack: unpack error [2017-11-28T15:52:30.466] error: Malformed RPC of type REQUEST_ABORT_JOB(6013) received [2017-11-28T15:52:30.466] error: slurm_receive_msg_and_forward: Header lengths are longer than data received [2017-11-28T15:52:30.476] error: service_connection: slurm_receive_msg: Header lengths are longer than data received [2017-11-28T15:58:55.683] error: we don't have select plugin type 101 [2017-11-28T15:58:55.683] error: select_g_select_jobinfo_unpack: unpack error [2017-11-28T16:02:21.490] error: we don't have select plugin type 101 [2017-11-28T16:02:21.490] error: select_g_select_jobinfo_unpack: unpack error [2017-11-28T16:02:21.490] error: Malformed RPC of type REQUEST_TERMINATE_JOB(6011) received [2017-11-28T16:02:21.490] error: slurm_receive_msg_and_forward: Header lengths are longer than data received [2017-11-28T16:02:21.491] error: service_connection: slurm_receive_msg: Header lengths are longer than data received [2017-11-28T16:02:21.491] error: service_connection: slurm_receive_msg: Header lengths are longer than data received [2017-11-28T16:02:21.492] error: service_connection: slurm_receive_msg: Header lengths are longer than data received [2017-11-28T16:02:21.492] error: service_connection: slurm_receive_msg: Header lengths are longer than data received [2017-11-28T16:02:21.493] error: service_connection: slurm_receive_msg: Header lengths are longer than data received [2017-11-28T16:02:21.496] error: service_connection: slurm_receive_msg: Header lengths are longer than data received [2017-11-28T16:02:21.496] error: service_connection: slurm_receive_msg: Header lengths are longer than data received [2017-11-28T16:02:21.498] error: service_connection: slurm_receive_msg: Header lengths are longer than data received [2017-11-28T16:02:21.498] error: service_connection: slurm_receive_msg: Header lengths are longer than data received [2017-11-28T16:02:21.498] error: service_connection: slurm_receive_msg: Header lengths are longer than data received [2017-11-28T16:02:21.498] error: service_connection: slurm_receive_msg: Header lengths are longer than data received [2017-11-28T16:02:21.498] error: service_connection: slurm_receive_msg: Header lengths are longer than data received [2017-11-28T16:02:21.500] error: service_connection: slurm_receive_msg: Header lengths are longer than data received [2017-11-28T16:02:21.500] error: service_connection: slurm_receive_msg: Header lengths are longer than data received [2017-11-28T16:02:21.500] error: service_connection: slurm_receive_msg: Header lengths are longer than data received [2017-11-28T16:02:21.500] error: service_connection: slurm_receive_msg: Header lengths are longer than data received [2017-11-28T16:02:21.500] error: service_connection: slurm_receive_msg: Header lengths are longer than data received [2017-11-28T16:02:21.500] error: service_connection: slurm_receive_msg: Header lengths are longer than data received [2017-11-28T16:03:21.535] error: we don't have select plugin type 101 At one point using linear ( I think) I was able to get 4 jobs to run at once on this node/ We have 40 cpu.... [2017-11-28T16:37:36.023] _run_prolog: run job script took usec=4 [2017-11-28T16:37:36.023] _run_prolog: prolog with lock for job 6637 ran for 0 seconds [2017-11-28T16:37:36.023] Launching batch job 6637 for UID 1000 [2017-11-28T16:37:36.024] _run_prolog: run job script took usec=4 [2017-11-28T16:37:36.024] _run_prolog: prolog with lock for job 6638 ran for 0 seconds [2017-11-28T16:37:36.024] _run_prolog: run job script took usec=5 [2017-11-28T16:37:36.024] _run_prolog: prolog with lock for job 6639 ran for 0 seconds [2017-11-28T16:37:36.025] _run_prolog: run job script took usec=4 [2017-11-28T16:37:36.025] _run_prolog: prolog with lock for job 6640 ran for 0 seconds [2017-11-28T16:37:36.030] Launching batch job 6640 for UID 1000 [2017-11-28T16:37:36.037] Launching batch job 6638 for UID 1000 [2017-11-28T16:37:36.044] Launching batch job 6639 for UID 1000 [2017-11-28T16:38:18.011] [6639] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 59648 [2017-11-28T16:38:18.011] [6638] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 59648 [2017-11-28T16:38:18.011] [6640] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 59648 [2017-11-28T16:38:18.012] [6637] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 59648 [2017-11-28T16:38:18.015] [6640] done with job [2017-11-28T16:38:18.015] [6639] done with job [2017-11-28T16:38:18.015] [6638] done with job [2017-11-28T16:38:18.015] [6637] done with job Ethan VanMatre Informatics Research Analyst Institute on Development and Disability Oregon Health & Science University CSLU - GH40 3181 SW Sam Jackson Park Rd Portland, OR 97239 (503) 346-3764 vanma...@ohsu.edu ________________________________ From: slurm-users [slurm-users-boun...@lists.schedmd.com] on behalf of Williams, Jenny Avis [jen...@email.unc.edu] Sent: Tuesday, November 28, 2017 5:45 PM To: Slurm User Community List Subject: Re: [slurm-users] fail when trying to set up selection=con_res We run in that manner using this config on v.3.10.0-693.5.2.el7.x86_64 This is slurm 17.02.4 Do your compute nodes have hyperthreading enabled ? AuthType=auth/munge CryptoType=crypto/munge AccountingStorageEnforce=limits,qos,safe AccountingStoragePort=ANumber AccountingStorageType=accounting_storage/slurmdbd AccountingStoreJobComment='yes' AccountingStorageUser=slurm CacheGroups=0 EnforcePartLimits='yes' FastSchedule=1 GresTypes=gpu InactiveLimit=0 JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/cgroup KillWait=30 Licenses=mplus:1 MaxArraySize=40001 MaxJobCount=350000 MinJobAge=300 MpiDefault=none PriorityDecayHalfLife=14-0 PriorityFavorSmall='no' PriorityFlags=fair_tree PriorityMaxAge=60-0 PriorityType=priority/multifactor PriorityWeightAge=1000 PriorityWeightFairshare=10000 PriorityWeightJobSize=1000 PriorityWeightPartition=1000 PriorityWeightQOS=1000 ProctrackType=proctrack/cgroup RebootProgram=/usr/sbin/reboot ReturnToService=2 SallocDefaultCommand='"srun -n1 -N1 --gres=gpu:0 --mem-per-cpu=0 --pty --preserve-env --mpi=none $SHELL"' SchedulerPort=ANumber SchedulerParameters=kill_invalid_depend SchedulerType=sched/backfill SelectTypeParameters=CR_CPU_Memory SelectType=select/cons_res SlurmctldDebug=3 SlurmctldPort=NumberRange SlurmctldTimeout=120 SlurmdDebug=3 SlurmdPort=ANumber SlurmdTimeout=300 SlurmUser=slurm SwitchType=switch/none TaskPlugin=task/cgroup Waittime=0 ANumber are port numbers or ranges. From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of Ethan Van Matre Sent: Tuesday, November 28, 2017 7:32 PM To: slurm-users@lists.schedmd.com Subject: [slurm-users] fail when trying to set up selection=con_res I've been trying to setup a slurm cluster with con_res enabled. No luck. Running on ubuntu 16.04 When using linear selection all works as expected. Jobs are schedules and run their course then exit. Exclusive use of the node is granted. We would like to schedule based on cpu (cores actually) and set thus: # SCHEDULING FastSchedule=1 SchedulerType=sched/backfill #SchedulerPort=7321 #SelectType=select/linear SelectType=select/cons_res SelectTypeParameters=CR_CORE #SelectTypeParameters=CR_CPU When launching jobs we more than one at a time per node but the jobs become hung in a COMPLETING state. Not sure if they ever started. Can anyone point me to how to set up slurm so that allocation is on a cpu (core) basis with as many jobs as there are cores running on each node? Regards Ethan VanMatre Informatics Research Analyst Institute on Development and Disability Oregon Health & Science University CSLU - GH40 3181 SW Sam Jackson Park Rd Portland, OR 97239 (503) 346-3764 vanma...@ohsu.edu<mailto:vanma...@ohsu.edu>