Re: [slurm-users] Single Node cluster. How to manage oversubscribing

Doug Meyer Sun, 26 Feb 2023 10:12:53 -0800

Hi,

Suggest removing "boards=1",  The docs say to include it but in previous
discussions with schedmd we were advised to remove it.


When you are running execute "scontrol show node <nodename>" and look at
the lines ConfigTres and AllocTres.  The former is what the maitre d
believes is available, the latter what has been allocated.

Then "scontrol show job <jobid>" looking down at the "NumNodes" like which
will show you what the job requested.

I suspect there is a syntax error in the submit.

Doug


On Sun, Feb 26, 2023 at 2:43 AM Analabha Roy <hariseldo...@gmail.com> wrote:

> Hi Doug,
>
> Again, many thanks for your detailed response.
> Based on my understanding of your previous note, I did the following:
>
> I set the nodename with CPUs=64 Boards=1 SocketsPerBoard=2
> CoresPerSocket=16 ThreadsPerCore=2
>
> and the partitions with oversubscribe=force:2
>
> then I put further restrictions with the default qos
> to MaxTRESPerNode:cpu=32, MaxJobsPU=MaxSubmit=2
>
> That way, no single user can request more than 2 X 32 cores legally.
>
> I launched two jobs, sbatch -n 32 each as one user. They started running
> immediately, taking up all 64 cores.
>
> Then I logged in as another user and launched the same job with sbatch -n
> 2. To my dismay, it started to run!
>
> Shouldn't slurm have figured out that all 64 cores were occupied and
> queued the -n 2 job to pending?
>
> AR
>
>
> On Sun, 26 Feb 2023 at 02:18, Doug Meyer <dameye...@gmail.com> wrote:
>
>> Hi,
>>
>> You got me, I didn't know that " oversubscribe=FORCE:2" is an option.
>> I'll need to explore that.
>>
>> I missed the question about srun.  srun is the preferred I believe.  I am
>> not associated with drafting the submit scripts but can ask my peer.  You
>> do need to stipulate the number of cores you want.  Your "sbatch -n 1"
>> should be changed to the number of MPI ranks you desire.
>>
>> As good as slurm is, many come to assume it does far more than it does.
>> I explain slurm as a maître d' in a very exclusive restaurant, aware of
>> every table and the resources they afford.  When a reservation is placed, a
>> job submitted, a review of the request versus the resources matches the
>> pending  guest/job against the resources and when the other diners/jobs are
>> expected to finish.  If a guest requests resources that are not available
>> in the restaurant, the reservation is denied.  If a guest arrives and does
>> not need all the resources, the place settings requested but unused are
>> left in reservation until the job finishes.  Slurm manages requests against
>> an inventory.  Without enforcement, a job that requests 1 core but uses 12
>> will run.  If your 64 core system accepts 64 single core reservations,
>> slurm believing 64 cores are needed, 64 jobs wll start.  and then the wait
>> staff (the OS) is left to deal with 768 tasks running on 64 cores.  It
>> becomes a sad comedy as the system will probably run out of RAM triggering
>> OOM killer or just run horribly slow.  Never assume slurm is going to
>> prevent bad actors once they begin running unless you have configured it to
>> do so.
>>
>> We run a very lax environment.  We set a standard of 6 GB per job unless
>> the sbatch declares otherwise and a max runtime default.  Without an
>> estimated runtime to work with the backfill scheduler is crippled.  In an
>> environment mixing single thread and MPI jobs of various sizes it is
>> critical the jobs are honest in their requirements providing slurm the
>> information needed to correctly assign resources.
>>
>> Doug
>>
>> On Sat, Feb 25, 2023 at 12:04 PM Analabha Roy <hariseldo...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> Thanks for your considered response. Couple of questions linger...
>>>
>>> On Sat, 25 Feb 2023 at 21:46, Doug Meyer <dameye...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> Declaring cores=64 will absolutely work but if you start running MPI
>>>> you'll want a more detailed config description.  The easy way to read it is
>>>> "128=2 sockets * 32 corespersocket * 2 threads per core".
>>>>
>>>> NodeName=hpc[306-308] CPUs=128 Sockets=2 CoresPerSocket=32
>>>> ThreadsPerCore=2 RealMemory=512000 TmpDisk=100
>>>>
>>>> But if you just want to work with logical cores the "cpus=128" will
>>>> work.
>>>>
>>>> If you go with the more detailed description then you need to declare
>>>> oversubscription (hyperthreading) in the partition declaration.
>>>>
>>>
>>>
>>> Yeah, I'll try that.
>>>
>>>
>>>> By default slurm will not let two different jobs share the logical
>>>> cores comprising a physical core.  For example if Sue has an Array of
>>>> 1-1000 her array tasks could each take a logical core on a physical core.
>>>> But if Jamal is also running they would not be able to share the physical
>>>> core. (as I understand it).
>>>>
>>>> PartitionName=a Nodes= [301-308] Default=No OverSubscribe=YES:2
>>>> MaxTime=Infinite State=Up AllowAccounts=cowboys
>>>>
>>>>
>>>> In the sbatch/srun the user needs to add a declaration
>>>> "oversubscribe=yes" telling slurm the job can run on both logical cores
>>>> available.
>>>>
>>>
>>> How about setting oversubscribe=FORCE:2? That way, users need not add a
>>> setting in their scripts.
>>>
>>>
>>>
>>>
>>>> In the days on Knight's Landing each core could handle four logical
>>>> cores but I don't believe there are any current AMD or Intel processors
>>>> supporting more then two logical cores (hyperthreads per core).  The
>>>> conversation about hyperthreads is difficult as the Intel terminology is
>>>> logical cores for hyperthreading and cores for physical cores but the
>>>> tendency is to call the logical cores threads or hyperthreaded cores.  This
>>>> can be very confusing for consumers of the resources.
>>>>
>>>>
>>>> In any case, if you create an array job of 1-100 sleep jobs, my
>>>> simplest logical test job, then you can use scontrol show node <nodename>
>>>> to see the nodes resource configuration as well as consumption.  squeue -w
>>>> <nodename> -i 10 will iteratate every ten seconds to show you the node
>>>> chomping through the job.
>>>>
>>>>
>>>> Hope this helps.  Once you are comfortable I would urge you to use the
>>>> NodeName/Partition descriptor format above and encourage your users to
>>>> declare oversubscription in their jobs.  It is a little more work up front
>>>> but far easier than correcting scripts later.
>>>>
>>>>
>>>> Doug
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Feb 23, 2023 at 9:41 PM Analabha Roy <hariseldo...@gmail.com>
>>>> wrote:
>>>>
>>>>> Howdy, and thanks for the warm welcome,
>>>>>
>>>>> On Fri, 24 Feb 2023 at 07:31, Doug Meyer <dameye...@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Did you configure your node definition with the outputs of slurmd
>>>>>> -C?  Ignore boards.  Don't know if it is still true but several years ago
>>>>>> declaring boards made things difficult.
>>>>>>
>>>>>>
>>>>> $ slurmd -C
>>>>> NodeName=shavak-DIT400TR-55L CPUs=64 Boards=1 SocketsPerBoard=2
>>>>> CoresPerSocket=16 ThreadsPerCore=2 RealMemory=95311
>>>>> UpTime=0-00:47:51
>>>>> $ grep NodeName /etc/slurm-llnl/slurm.conf
>>>>> NodeName=shavak-DIT400TR-55L CPUs=64 RealMemory=95311 Gres=gpu:1
>>>>>
>>>>> There is a difference. I, too, discarded the Boards and sockets in
>>>>> slurmd.conf . Is that the problem?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Also, if you have hyperthreaded AMD or Intel processors your
>>>>>> partition declaration should be overscribe:2
>>>>>>
>>>>>>
>>>>> Yes I do, It's actually 16 X 2 cores with hyperthreading, but the BIOS
>>>>> is set to show them as 64 cores.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Start with a very simple job with a script containing sleep 100 or
>>>>>> something else without any runtime issues.
>>>>>>
>>>>>>
>>>>> I ran this MPI hello world thing
>>>>> <https://github.com/hariseldon99/buparamshavak/blob/main/shavak_root/usr/local/share/examples/mpi_runs_inf/mpi_count.c>with
>>>>> this sbatch script.
>>>>> <https://github.com/hariseldon99/buparamshavak/blob/main/shavak_root/usr/local/share/examples/mpi_runs_inf/mpi_count_normal.sbatch>
>>>>> Should be the same thing as your suggestion, basically.
>>>>> Should I switch to 'srun' in the batch file?
>>>>>
>>>>> AR
>>>>>
>>>>>
>>>>>> When I started with slurm I built the sbatch one small step at a
>>>>>> time.  Nodes, cores. memory, partition, mail, etc
>>>>>>
>>>>>> It sounds like your config is very close but your problem may be in
>>>>>> the submit script.
>>>>>>
>>>>>> Best of luck and welcome to slurm. It is very powerful with a huge
>>>>>> community.
>>>>>>
>>>>>> Doug
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Feb 23, 2023 at 6:58 AM Analabha Roy <hariseldo...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi folks,
>>>>>>>
>>>>>>> I have a single-node "cluster" running Ubuntu 20.04 LTS with the
>>>>>>> distribution packages for slurm (slurm-wlm 19.05.5)
>>>>>>> Slurm only ran one job in the node at a time with the default
>>>>>>> configuration, leaving all other jobs pending.
>>>>>>> This happened even if that one job only requested like a few cores
>>>>>>> (the node has 64 cores, and slurm.conf is configged accordingly).
>>>>>>>
>>>>>>> in slurm conf, SelectType is set to select/cons_res, and
>>>>>>> SelectTypeParameters to CR_Core. NodeName is set with CPUs=64. Path to 
>>>>>>> file
>>>>>>> is referenced below.
>>>>>>>
>>>>>>> So I set OverSubscribe=FORCE in the partition config and restarted
>>>>>>> the daemons.
>>>>>>>
>>>>>>> Multiple jobs are now run concurrently, but when Slurm is
>>>>>>> oversubscribed, it is *truly* *oversubscribed*. That is to say, it
>>>>>>> runs so many jobs that there are more processes running than 
>>>>>>> cores/threads.
>>>>>>> How should I config slurm so that it runs multiple jobs at once per
>>>>>>> node, but ensures that it doesn't run more processes than there are 
>>>>>>> cores?
>>>>>>> Is there some TRES magic for this that I can't seem to figure out?
>>>>>>>
>>>>>>> My slurm.conf is here on github:
>>>>>>> https://github.com/hariseldon99/buparamshavak/blob/main/shavak_root/etc/slurm-llnl/slurm.conf
>>>>>>> The only gres I've set is for the GPU:
>>>>>>> https://github.com/hariseldon99/buparamshavak/blob/main/shavak_root/etc/slurm-llnl/gres.conf
>>>>>>>
>>>>>>> Thanks for your attention,
>>>>>>> Regards,
>>>>>>> AR
>>>>>>> --
>>>>>>> Analabha Roy
>>>>>>> Assistant Professor
>>>>>>> Department of Physics
>>>>>>> <http://www.buruniv.ac.in/academics/department/physics>
>>>>>>> The University of Burdwan <http://www.buruniv.ac.in/>
>>>>>>> Golapbag Campus, Barddhaman 713104
>>>>>>> West Bengal, India
>>>>>>> Emails: dan...@utexas.edu, a...@phys.buruniv.ac.in,
>>>>>>> hariseldo...@gmail.com
>>>>>>> Webpage: http://www.ph.utexas.edu/~daneel/
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Analabha Roy
>>>>> Assistant Professor
>>>>> Department of Physics
>>>>> <http://www.buruniv.ac.in/academics/department/physics>
>>>>> The University of Burdwan <http://www.buruniv.ac.in/>
>>>>> Golapbag Campus, Barddhaman 713104
>>>>> West Bengal, India
>>>>> Emails: dan...@utexas.edu, a...@phys.buruniv.ac.in,
>>>>> hariseldo...@gmail.com
>>>>> Webpage: http://www.ph.utexas.edu/~daneel/
>>>>>
>>>>
>>>
>>> --
>>> Analabha Roy
>>> Assistant Professor
>>> Department of Physics
>>> <http://www.buruniv.ac.in/academics/department/physics>
>>> The University of Burdwan <http://www.buruniv.ac.in/>
>>> Golapbag Campus, Barddhaman 713104
>>> West Bengal, India
>>> Emails: dan...@utexas.edu, a...@phys.buruniv.ac.in,
>>> hariseldo...@gmail.com
>>> Webpage: http://www.ph.utexas.edu/~daneel/
>>>
>>
>
> --
> Analabha Roy
> Assistant Professor
> Department of Physics
> <http://www.buruniv.ac.in/academics/department/physics>
> The University of Burdwan <http://www.buruniv.ac.in/>
> Golapbag Campus, Barddhaman 713104
> West Bengal, India
> Emails: dan...@utexas.edu, a...@phys.buruniv.ac.in, hariseldo...@gmail.com
> Webpage: http://www.ph.utexas.edu/~daneel/
>

Re: [slurm-users] Single Node cluster. How to manage oversubscribing

Reply via email to