Hi, i have a node that is registered in two separate partitions: "ML" and "shared". i'm running 19-05-5-1.
everything in batch works well; i have users submitting into shared partition with QoS "scavenger" which is pre-emptable by "normal" QoS submissions. the default QoS is set up to be "scavenger" on "shared" and "normal" elsewhere. users can only submit to the shared partition with scavenger QoS. however, users are getting stalls when running srun; but it only seems to occur for srun/salloc within the "shared" partition, but not with the "ML" partition. ie $ srun -vv -A shared -p shared -w ml-gpu01 --pty /bin/bash srun: defined options srun: -------------------- -------------------- srun: account : shared srun: nodelist : ml-gpu01 srun: partition : shared srun: pty : set srun: verbose : 2 srun: -------------------- -------------------- srun: end of defined options srun: debug: propagating RLIMIT_CPU=18446744073709551615 srun: debug: propagating RLIMIT_FSIZE=18446744073709551615 srun: debug: propagating RLIMIT_DATA=18446744073709551615 srun: debug: propagating RLIMIT_STACK=8388608 srun: debug: propagating RLIMIT_CORE=0 srun: debug: propagating RLIMIT_RSS=18446744073709551615 srun: debug: propagating RLIMIT_NPROC=4096 srun: debug: propagating RLIMIT_NOFILE=1024 srun: debug: propagating RLIMIT_AS=18446744073709551615 srun: debug: propagating SLURM_PRIO_PROCESS=0 srun: debug: propagating UMASK=0022 srun: debug: Entering slurm_allocation_msg_thr_create() srun: debug: _is_port_ok: bind() failed port 61704 sock 6 Address already in use srun: debug: port from net_stream_listen is 61705 srun: debug: Entering _msg_thr_internal srun: debug: _is_port_ok: bind() failed port 61704 sock 9 Address already in use srun: debug: _is_port_ok: bind() failed port 61705 sock 9 Address already in use srun: debug: Munge authentication plugin loaded srun: job 1432 queued and waiting for resources <stalled> $ scontrol show jobid 1432 JobId=1432 JobName=bash UserId=ytl(7017) GroupId=sf(1051) MCS_label=N/A Priority=9311 Nice=0 Account=shared QOS=scavenger JobState=FAILED Reason=NonZeroExitCode Dependency=(null) Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=1:0 RunTime=00:00:10 TimeLimit=04:00:00 TimeMin=N/A SubmitTime=2020-02-05T19:48:08 EligibleTime=2020-02-05T19:48:08 AccrueTime=2020-02-05T19:48:08 StartTime=2020-02-05T19:48:09 EndTime=2020-02-05T19:48:19 Deadline=N/A PreemptEligibleTime=2020-02-05T19:48:09 PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-02-05T19:48:09 Partition=shared AllocNode:Sid=ocio-gpu01:28326 ReqNodeList=ml-gpu01 ExcNodeList=(null) NodeList=ml-gpu01 BatchHost=ml-gpu01 NumNodes=1 NumCPUs=1 NumTasks=0 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,mem=2000M,node=1,billing=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/bin/bash WorkDir=/gpfs/slac/cryo/fs1/u/ytl Power= however, if i submit into the ML partition (or indeed any other "normal" qos partition, it works fine... $ srun -vv -A ml -p ml -w ml-gpu01 --pty /bin/bash srun: defined options srun: -------------------- -------------------- srun: account : ml srun: nodelist : ml-gpu01 srun: partition : ml srun: pty : set srun: verbose : 2 srun: -------------------- -------------------- srun: end of defined options srun: debug: propagating RLIMIT_CPU=18446744073709551615 srun: debug: propagating RLIMIT_FSIZE=18446744073709551615 srun: debug: propagating RLIMIT_DATA=18446744073709551615 srun: debug: propagating RLIMIT_STACK=8388608 srun: debug: propagating RLIMIT_CORE=0 srun: debug: propagating RLIMIT_RSS=18446744073709551615 srun: debug: propagating RLIMIT_NPROC=4096 srun: debug: propagating RLIMIT_NOFILE=1024 srun: debug: propagating RLIMIT_AS=18446744073709551615 srun: debug: propagating SLURM_PRIO_PROCESS=0 srun: debug: propagating UMASK=0022 srun: debug: Entering slurm_allocation_msg_thr_create() srun: debug: _is_port_ok: bind() failed port 61323 sock 6 Address already in use srun: debug: port from net_stream_listen is 61324 srun: debug: Entering _msg_thr_internal srun: debug: _is_port_ok: bind() failed port 61323 sock 9 Address already in use srun: debug: _is_port_ok: bind() failed port 61324 sock 9 Address already in use srun: debug: Munge authentication plugin loaded srun: jobid 1434: nodes(1):`ml-gpu01', cpu counts: 1(x1) srun: debug: requesting job 1434, user 7017, nodes 1 including (ml-gpu01) srun: debug: cpus 1, tasks 1, name bash, relative 65534 srun: debug: _is_port_ok: bind() failed port 61323 sock 9 Address already in use srun: debug: _is_port_ok: bind() failed port 61324 sock 9 Address already in use srun: debug: _is_port_ok: bind() failed port 61323 sock 10 Address already in use srun: debug: _is_port_ok: bind() failed port 61324 sock 10 Address already in use srun: debug: _is_port_ok: bind() failed port 61325 sock 10 Address already in use srun: CpuBindType=(null type) srun: debug: Entering slurm_step_launch srun: debug: mpi type = (null) srun: debug: Using mpi/none srun: debug: Entering _msg_thr_create() srun: debug: _is_port_ok: bind() failed port 61323 sock 15 Address already in use srun: debug: _is_port_ok: bind() failed port 61324 sock 15 Address already in use srun: debug: _is_port_ok: bind() failed port 61325 sock 15 Address already in use srun: debug: _is_port_ok: bind() failed port 61326 sock 15 Address already in use srun: debug: _is_port_ok: bind() failed port 61323 sock 18 Address already in use srun: debug: _is_port_ok: bind() failed port 61324 sock 18 Address already in use srun: debug: _is_port_ok: bind() failed port 61325 sock 18 Address already in use srun: debug: _is_port_ok: bind() failed port 61326 sock 18 Address already in use srun: debug: _is_port_ok: bind() failed port 61327 sock 18 Address already in use srun: debug: initialized stdio listening socket, port 61328 srun: debug: Started IO server thread (140594085246720) srun: debug: Entering _launch_tasks srun: launching 1434.0 on host ml-gpu01, 1 tasks: 0 srun: route default plugin loaded srun: debug: launch returned msg_rc=0 err=0 type=8001 srun: Node ml-gpu01, 1 tasks started $ scontrol show jobid 1434 JobId=1434 JobName=bash UserId=ytl(7017) GroupId=sf(1051) MCS_label=N/A Priority=17988 Nice=0 Account=ml QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 RunTime=00:00:12 TimeLimit=04:00:00 TimeMin=N/A SubmitTime=2020-02-05T19:53:50 EligibleTime=2020-02-05T19:53:50 AccrueTime=Unknown StartTime=2020-02-05T19:53:50 EndTime=2020-02-05T23:53:50 Deadline=N/A PreemptEligibleTime=2020-02-05T19:53:50 PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-02-05T19:53:50 Partition=ml AllocNode:Sid=ocio-gpu01:28326 ReqNodeList=ml-gpu01 ExcNodeList=(null) NodeList=ml-gpu01 BatchHost=ml-gpu01 NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,mem=4000M,node=1,billing=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=4000M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/bin/bash WorkDir=/gpfs/slac/cryo/fs1/u/ytl Power= i do not have any limits set in any of the QoS's yet. i'm also pretty sure that there is no firewall in between the node where srun is and the the node the job lands on (ml-gpu01). any ideas? cheers,