Re: [slurm-users] [External] Cancel "reboot ASAP" for a node

2020-08-07 Thread Prentice Bisbal
Are you looking at the man page on the SchedMD website, or on your computer? If you're looking at the website, those pages are for the latest version and may not match what you have installed, so this could be a feature in a later version tha 18.08. -- Prentice On 8/7/20 11:43 AM, Hanby, Mike

[slurm-users] SlurmdTimeout and keeping jobs running

2020-08-07 Thread Jacob Chappell
Dear Slurm Community, We recognize that the SlurmdTimeout has a default value of 300 seconds, and that if the controller is unable to communicate with a node during that time it will mark it down. We have two questions regarding this: 1. Won't also individual compute nodes kill their own jobs if

Re: [slurm-users] Only 2 jobs will start per GPU node despite 4 GPU's being present

2020-08-07 Thread Renfro, Michael
I’ve only got 2 GPUs in my nodes, but I’ve always used non-overlapping CPUs= or COREs= settings. Currently, they’re: NodeName=gpunode00[1-4] Name=gpu Type=k80 File=/dev/nvidia[0-1] COREs=0-7,9-15 and I’ve got 2 jobs currently running on each node that’s available. So maybe: NodeName=c0005

Re: [slurm-users] Only 2 jobs will start per GPU node despite 4 GPU's being present

2020-08-07 Thread Jodie H. Sprouse
HI Tina, Thank you so much for looking at this. slurm 18.08.8 nvidia-smi topo -m !sysGPU0GPU1GPU2GPU3mlx5_0 CPU Affinity GPU0 X NV2 NV2 NV2 NODE 0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22,24-24,26-26,28-28,30-30,32-32,34-34,36-36,38-

Re: [slurm-users] Cancel "reboot ASAP" for a node

2020-08-07 Thread Hanby, Mike
This is what's in /var/log/slurmctld Invalid node state transition requested for node c01 from=DRAINING to=CANCEL_REBOOT So it looks like, for version 18.08 at least, you have to first undrain, then cancel reboot: scontrol update NodeName="c01" State=undrain Reason="cancelling reboot" scontrol

Re: [slurm-users] Only 2 jobs will start per GPU node despite 4 GPU's being present

2020-08-07 Thread Tina Friedrich
Hi Jodie, what version of SLURM are you using? I'm pretty sure newer versions pick the topology up automatically (although I'm on 18.08 so I can't verify that). Is what you're wanting to do - basically - forcefully feed a 'wrong' gres.conf to make SLURM assume all GPUs are on one CPU? (I don

[slurm-users] Cancel "reboot ASAP" for a node

2020-08-07 Thread Hanby, Mike
Howdy, (Slurm 18.08) We have a bunch of node that we've updated to "scontrol reboot ASAP". We'd like to cancel a few of those. From the man page, it's suggested that either of the following should work, however both report the same error " slurm_update error: Invalid node state specified": sco

Re: [slurm-users] Only 2 jobs will start per GPU node despite 4 GPU's being present

2020-08-07 Thread Jodie H. Sprouse
Tina, Thank you. Yes, jobs will run on all 4 gpus if I submit with: --gres-flags=disable-binding Yet my goal is to have the gpus bind to a cpu in order to allow a cpu-only job to never run on that particular cpu (having it bound to the gpu and always free for a gpu job) and give the cpu job t

Re: [slurm-users] Only 2 jobs will start per GPU node despite 4 GPU's being present

2020-08-07 Thread Tina Friedrich
Hello, This is something I've seen once on our systems & it took me a while to figure out what was going on. The solution was that the system topology was such that all GPUs were connected to one CPU. There were no free cores on that particular CPU; so SLURM did not schedule any more jobs to

Re: [slurm-users] Only 2 jobs will start per GPU node despite 4 GPU's being present

2020-08-07 Thread Jodie H. Sprouse
Good morning. I have having the same experience here. Wondering if you had a resolution? Thank you. Jodie On Jun 11, 2020, at 3:27 PM, Rhian Resnick mailto:rresn...@fau.edu>> wrote: We have several users submitting single GPU jobs to our cluster. We expected the jobs to fill each node and fu

Re: [slurm-users] Tuning MaxJobs and MaxJobsSubmit per user and for the whole cluster?

2020-08-07 Thread Paul Edmon
My rule of thumb is that the MaxJobs for the entire cluster is twice the number of cores you have available.  That way you have enough jobs running to fill all the cores and enough jobs pending to refill them. As for per user MaxJobs, it just depends on what you think the maximum number any u

[slurm-users] PrivateData does not filter the billing info "scontrol show assoc_mgr flags=qos"

2020-08-07 Thread Hemanta Sahu
Hi All, I have configured in our test cluster "PrivateData" parameter in "slurm.conf" as below. >> [testuser1@centos7vm01 ~]$ cat /etc/slurm/clurm.conf|less PrivateData=accounts,jobs,reservations,usage,users,events,partitions,nodes MCSPlugin=mcs/user MCSParameters=enforced,select,privatedat