Hi, Suggest removing "boards=1", The docs say to include it but in previous discussions with schedmd we were advised to remove it.
When you are running execute "scontrol show node <nodename>" and look at the lines ConfigTres and AllocTres. The former is what the maitre d believes is available, the latter what has been allocated. Then "scontrol show job <jobid>" looking down at the "NumNodes" like which will show you what the job requested. I suspect there is a syntax error in the submit. Doug On Sun, Feb 26, 2023 at 2:43 AM Analabha Roy <hariseldo...@gmail.com> wrote: > Hi Doug, > > Again, many thanks for your detailed response. > Based on my understanding of your previous note, I did the following: > > I set the nodename with CPUs=64 Boards=1 SocketsPerBoard=2 > CoresPerSocket=16 ThreadsPerCore=2 > > and the partitions with oversubscribe=force:2 > > then I put further restrictions with the default qos > to MaxTRESPerNode:cpu=32, MaxJobsPU=MaxSubmit=2 > > That way, no single user can request more than 2 X 32 cores legally. > > I launched two jobs, sbatch -n 32 each as one user. They started running > immediately, taking up all 64 cores. > > Then I logged in as another user and launched the same job with sbatch -n > 2. To my dismay, it started to run! > > Shouldn't slurm have figured out that all 64 cores were occupied and > queued the -n 2 job to pending? > > AR > > > On Sun, 26 Feb 2023 at 02:18, Doug Meyer <dameye...@gmail.com> wrote: > >> Hi, >> >> You got me, I didn't know that " oversubscribe=FORCE:2" is an option. >> I'll need to explore that. >> >> I missed the question about srun. srun is the preferred I believe. I am >> not associated with drafting the submit scripts but can ask my peer. You >> do need to stipulate the number of cores you want. Your "sbatch -n 1" >> should be changed to the number of MPI ranks you desire. >> >> As good as slurm is, many come to assume it does far more than it does. >> I explain slurm as a maƮtre d' in a very exclusive restaurant, aware of >> every table and the resources they afford. When a reservation is placed, a >> job submitted, a review of the request versus the resources matches the >> pending guest/job against the resources and when the other diners/jobs are >> expected to finish. If a guest requests resources that are not available >> in the restaurant, the reservation is denied. If a guest arrives and does >> not need all the resources, the place settings requested but unused are >> left in reservation until the job finishes. Slurm manages requests against >> an inventory. Without enforcement, a job that requests 1 core but uses 12 >> will run. If your 64 core system accepts 64 single core reservations, >> slurm believing 64 cores are needed, 64 jobs wll start. and then the wait >> staff (the OS) is left to deal with 768 tasks running on 64 cores. It >> becomes a sad comedy as the system will probably run out of RAM triggering >> OOM killer or just run horribly slow. Never assume slurm is going to >> prevent bad actors once they begin running unless you have configured it to >> do so. >> >> We run a very lax environment. We set a standard of 6 GB per job unless >> the sbatch declares otherwise and a max runtime default. Without an >> estimated runtime to work with the backfill scheduler is crippled. In an >> environment mixing single thread and MPI jobs of various sizes it is >> critical the jobs are honest in their requirements providing slurm the >> information needed to correctly assign resources. >> >> Doug >> >> On Sat, Feb 25, 2023 at 12:04 PM Analabha Roy <hariseldo...@gmail.com> >> wrote: >> >>> Hi, >>> >>> Thanks for your considered response. Couple of questions linger... >>> >>> On Sat, 25 Feb 2023 at 21:46, Doug Meyer <dameye...@gmail.com> wrote: >>> >>>> Hi, >>>> >>>> Declaring cores=64 will absolutely work but if you start running MPI >>>> you'll want a more detailed config description. The easy way to read it is >>>> "128=2 sockets * 32 corespersocket * 2 threads per core". >>>> >>>> NodeName=hpc[306-308] CPUs=128 Sockets=2 CoresPerSocket=32 >>>> ThreadsPerCore=2 RealMemory=512000 TmpDisk=100 >>>> >>>> But if you just want to work with logical cores the "cpus=128" will >>>> work. >>>> >>>> If you go with the more detailed description then you need to declare >>>> oversubscription (hyperthreading) in the partition declaration. >>>> >>> >>> >>> Yeah, I'll try that. >>> >>> >>>> By default slurm will not let two different jobs share the logical >>>> cores comprising a physical core. For example if Sue has an Array of >>>> 1-1000 her array tasks could each take a logical core on a physical core. >>>> But if Jamal is also running they would not be able to share the physical >>>> core. (as I understand it). >>>> >>>> PartitionName=a Nodes= [301-308] Default=No OverSubscribe=YES:2 >>>> MaxTime=Infinite State=Up AllowAccounts=cowboys >>>> >>>> >>>> In the sbatch/srun the user needs to add a declaration >>>> "oversubscribe=yes" telling slurm the job can run on both logical cores >>>> available. >>>> >>> >>> How about setting oversubscribe=FORCE:2? That way, users need not add a >>> setting in their scripts. >>> >>> >>> >>> >>>> In the days on Knight's Landing each core could handle four logical >>>> cores but I don't believe there are any current AMD or Intel processors >>>> supporting more then two logical cores (hyperthreads per core). The >>>> conversation about hyperthreads is difficult as the Intel terminology is >>>> logical cores for hyperthreading and cores for physical cores but the >>>> tendency is to call the logical cores threads or hyperthreaded cores. This >>>> can be very confusing for consumers of the resources. >>>> >>>> >>>> In any case, if you create an array job of 1-100 sleep jobs, my >>>> simplest logical test job, then you can use scontrol show node <nodename> >>>> to see the nodes resource configuration as well as consumption. squeue -w >>>> <nodename> -i 10 will iteratate every ten seconds to show you the node >>>> chomping through the job. >>>> >>>> >>>> Hope this helps. Once you are comfortable I would urge you to use the >>>> NodeName/Partition descriptor format above and encourage your users to >>>> declare oversubscription in their jobs. It is a little more work up front >>>> but far easier than correcting scripts later. >>>> >>>> >>>> Doug >>>> >>>> >>>> >>>> >>>> >>>> On Thu, Feb 23, 2023 at 9:41 PM Analabha Roy <hariseldo...@gmail.com> >>>> wrote: >>>> >>>>> Howdy, and thanks for the warm welcome, >>>>> >>>>> On Fri, 24 Feb 2023 at 07:31, Doug Meyer <dameye...@gmail.com> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> Did you configure your node definition with the outputs of slurmd >>>>>> -C? Ignore boards. Don't know if it is still true but several years ago >>>>>> declaring boards made things difficult. >>>>>> >>>>>> >>>>> $ slurmd -C >>>>> NodeName=shavak-DIT400TR-55L CPUs=64 Boards=1 SocketsPerBoard=2 >>>>> CoresPerSocket=16 ThreadsPerCore=2 RealMemory=95311 >>>>> UpTime=0-00:47:51 >>>>> $ grep NodeName /etc/slurm-llnl/slurm.conf >>>>> NodeName=shavak-DIT400TR-55L CPUs=64 RealMemory=95311 Gres=gpu:1 >>>>> >>>>> There is a difference. I, too, discarded the Boards and sockets in >>>>> slurmd.conf . Is that the problem? >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>> Also, if you have hyperthreaded AMD or Intel processors your >>>>>> partition declaration should be overscribe:2 >>>>>> >>>>>> >>>>> Yes I do, It's actually 16 X 2 cores with hyperthreading, but the BIOS >>>>> is set to show them as 64 cores. >>>>> >>>>> >>>>> >>>>> >>>>>> Start with a very simple job with a script containing sleep 100 or >>>>>> something else without any runtime issues. >>>>>> >>>>>> >>>>> I ran this MPI hello world thing >>>>> <https://github.com/hariseldon99/buparamshavak/blob/main/shavak_root/usr/local/share/examples/mpi_runs_inf/mpi_count.c>with >>>>> this sbatch script. >>>>> <https://github.com/hariseldon99/buparamshavak/blob/main/shavak_root/usr/local/share/examples/mpi_runs_inf/mpi_count_normal.sbatch> >>>>> Should be the same thing as your suggestion, basically. >>>>> Should I switch to 'srun' in the batch file? >>>>> >>>>> AR >>>>> >>>>> >>>>>> When I started with slurm I built the sbatch one small step at a >>>>>> time. Nodes, cores. memory, partition, mail, etc >>>>>> >>>>>> It sounds like your config is very close but your problem may be in >>>>>> the submit script. >>>>>> >>>>>> Best of luck and welcome to slurm. It is very powerful with a huge >>>>>> community. >>>>>> >>>>>> Doug >>>>>> >>>>>> >>>>>> >>>>>> On Thu, Feb 23, 2023 at 6:58 AM Analabha Roy <hariseldo...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi folks, >>>>>>> >>>>>>> I have a single-node "cluster" running Ubuntu 20.04 LTS with the >>>>>>> distribution packages for slurm (slurm-wlm 19.05.5) >>>>>>> Slurm only ran one job in the node at a time with the default >>>>>>> configuration, leaving all other jobs pending. >>>>>>> This happened even if that one job only requested like a few cores >>>>>>> (the node has 64 cores, and slurm.conf is configged accordingly). >>>>>>> >>>>>>> in slurm conf, SelectType is set to select/cons_res, and >>>>>>> SelectTypeParameters to CR_Core. NodeName is set with CPUs=64. Path to >>>>>>> file >>>>>>> is referenced below. >>>>>>> >>>>>>> So I set OverSubscribe=FORCE in the partition config and restarted >>>>>>> the daemons. >>>>>>> >>>>>>> Multiple jobs are now run concurrently, but when Slurm is >>>>>>> oversubscribed, it is *truly* *oversubscribed*. That is to say, it >>>>>>> runs so many jobs that there are more processes running than >>>>>>> cores/threads. >>>>>>> How should I config slurm so that it runs multiple jobs at once per >>>>>>> node, but ensures that it doesn't run more processes than there are >>>>>>> cores? >>>>>>> Is there some TRES magic for this that I can't seem to figure out? >>>>>>> >>>>>>> My slurm.conf is here on github: >>>>>>> https://github.com/hariseldon99/buparamshavak/blob/main/shavak_root/etc/slurm-llnl/slurm.conf >>>>>>> The only gres I've set is for the GPU: >>>>>>> https://github.com/hariseldon99/buparamshavak/blob/main/shavak_root/etc/slurm-llnl/gres.conf >>>>>>> >>>>>>> Thanks for your attention, >>>>>>> Regards, >>>>>>> AR >>>>>>> -- >>>>>>> Analabha Roy >>>>>>> Assistant Professor >>>>>>> Department of Physics >>>>>>> <http://www.buruniv.ac.in/academics/department/physics> >>>>>>> The University of Burdwan <http://www.buruniv.ac.in/> >>>>>>> Golapbag Campus, Barddhaman 713104 >>>>>>> West Bengal, India >>>>>>> Emails: dan...@utexas.edu, a...@phys.buruniv.ac.in, >>>>>>> hariseldo...@gmail.com >>>>>>> Webpage: http://www.ph.utexas.edu/~daneel/ >>>>>>> >>>>>> >>>>> >>>>> -- >>>>> Analabha Roy >>>>> Assistant Professor >>>>> Department of Physics >>>>> <http://www.buruniv.ac.in/academics/department/physics> >>>>> The University of Burdwan <http://www.buruniv.ac.in/> >>>>> Golapbag Campus, Barddhaman 713104 >>>>> West Bengal, India >>>>> Emails: dan...@utexas.edu, a...@phys.buruniv.ac.in, >>>>> hariseldo...@gmail.com >>>>> Webpage: http://www.ph.utexas.edu/~daneel/ >>>>> >>>> >>> >>> -- >>> Analabha Roy >>> Assistant Professor >>> Department of Physics >>> <http://www.buruniv.ac.in/academics/department/physics> >>> The University of Burdwan <http://www.buruniv.ac.in/> >>> Golapbag Campus, Barddhaman 713104 >>> West Bengal, India >>> Emails: dan...@utexas.edu, a...@phys.buruniv.ac.in, >>> hariseldo...@gmail.com >>> Webpage: http://www.ph.utexas.edu/~daneel/ >>> >> > > -- > Analabha Roy > Assistant Professor > Department of Physics > <http://www.buruniv.ac.in/academics/department/physics> > The University of Burdwan <http://www.buruniv.ac.in/> > Golapbag Campus, Barddhaman 713104 > West Bengal, India > Emails: dan...@utexas.edu, a...@phys.buruniv.ac.in, hariseldo...@gmail.com > Webpage: http://www.ph.utexas.edu/~daneel/ >