Hey,
Thanks for sticking with this. On Sun, 26 Feb 2023 at 23:43, Doug Meyer <dameye...@gmail.com> wrote: > Hi, > > Suggest removing "boards=1", The docs say to include it but in previous > discussions with schedmd we were advised to remove it. > > I just did. Then ran scontrol reconfigure. > When you are running execute "scontrol show node <nodename>" and look at > the lines ConfigTres and AllocTres. The former is what the maitre d > believes is available, the latter what has been allocated. > > Then "scontrol show job <jobid>" looking down at the "NumNodes" like which > will show you what the job requested. > > I suspect there is a syntax error in the submit. > > Okay. Now this is strange. First, I launched this job twice <https://pastebin.com/s21yXFH2> This should take up 20 + 20 = 40 cores, because of the 1. #SBATCH -n 20 # Number of tasks 2. #SBATCH --cpus-per-task=1 running scontrol show job on both jobids yields - NumNodes=1 NumCPUs=20 NumTasks=20 CPUs/Task=1 ReqB:S:C:T=0:0:*:* - NumNodes=1 NumCPUs=20 NumTasks=20 CPUs/Task=1 ReqB:S:C:T=0:0:*:* Then, running scontrol on the node yields: - scontrol show node $HOSTNAME - CfgTRES=cpu=64,mem=95311M,billing=64,gres/gpu=1 - AllocTRES=cpu=40 So far so good. Both show 40 cores allocated. However, if I now add another job with 60 cores <https://pastebin.com/C0uW0Aut>,this happens: scontrol on the node: CfgTRES=cpu=64,mem=95311M,billing=64,gres/gpu=1 AllocTRES=cpu=60 squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 413 CPU normal admin R 21:22 1 shavak-DIT400TR-55L 414 CPU normal admin R 19:53 1 shavak-DIT400TR-55L 417 CPU elevated admin R 1:31 1 shavak-DIT400TR-55L scontrol on the jobids: admin@shavak-DIT400TR-55L:~/mpi_runs_inf$ scontrol show job 413|grep NumCPUs NumNodes=1 NumCPUs=20 NumTasks=20 CPUs/Task=1 ReqB:S:C:T=0:0:*:* admin@shavak-DIT400TR-55L:~/mpi_runs_inf$ scontrol show job 414|grep NumCPUs NumNodes=1 NumCPUs=20 NumTasks=20 CPUs/Task=1 ReqB:S:C:T=0:0:*:* admin@shavak-DIT400TR-55L:~/mpi_runs_inf$ scontrol show job 417|grep NumCPUs NumNodes=1 NumCPUs=60 NumTasks=60 CPUs/Task=1 ReqB:S:C:T=0:0:*:* So there are 100 CPUs running, according to this, but 60 according to scontrol on the node?????? The submission scripts are on pastebin: https://pastebin.com/s21yXFH2 https://pastebin.com/C0uW0Aut AR > Doug > > > On Sun, Feb 26, 2023 at 2:43 AM Analabha Roy <hariseldo...@gmail.com> > wrote: > >> Hi Doug, >> >> Again, many thanks for your detailed response. >> Based on my understanding of your previous note, I did the following: >> >> I set the nodename with CPUs=64 Boards=1 SocketsPerBoard=2 >> CoresPerSocket=16 ThreadsPerCore=2 >> >> and the partitions with oversubscribe=force:2 >> >> then I put further restrictions with the default qos >> to MaxTRESPerNode:cpu=32, MaxJobsPU=MaxSubmit=2 >> >> That way, no single user can request more than 2 X 32 cores legally. >> >> I launched two jobs, sbatch -n 32 each as one user. They started running >> immediately, taking up all 64 cores. >> >> Then I logged in as another user and launched the same job with sbatch -n >> 2. To my dismay, it started to run! >> >> Shouldn't slurm have figured out that all 64 cores were occupied and >> queued the -n 2 job to pending? >> >> AR >> >> >> On Sun, 26 Feb 2023 at 02:18, Doug Meyer <dameye...@gmail.com> wrote: >> >>> Hi, >>> >>> You got me, I didn't know that " oversubscribe=FORCE:2" is an option. >>> I'll need to explore that. >>> >>> I missed the question about srun. srun is the preferred I believe. I >>> am not associated with drafting the submit scripts but can ask my peer. >>> You do need to stipulate the number of cores you want. Your "sbatch -n 1" >>> should be changed to the number of MPI ranks you desire. >>> >>> As good as slurm is, many come to assume it does far more than it does. >>> I explain slurm as a maƮtre d' in a very exclusive restaurant, aware of >>> every table and the resources they afford. When a reservation is placed, a >>> job submitted, a review of the request versus the resources matches the >>> pending guest/job against the resources and when the other diners/jobs are >>> expected to finish. If a guest requests resources that are not available >>> in the restaurant, the reservation is denied. If a guest arrives and does >>> not need all the resources, the place settings requested but unused are >>> left in reservation until the job finishes. Slurm manages requests against >>> an inventory. Without enforcement, a job that requests 1 core but uses 12 >>> will run. If your 64 core system accepts 64 single core reservations, >>> slurm believing 64 cores are needed, 64 jobs wll start. and then the wait >>> staff (the OS) is left to deal with 768 tasks running on 64 cores. It >>> becomes a sad comedy as the system will probably run out of RAM triggering >>> OOM killer or just run horribly slow. Never assume slurm is going to >>> prevent bad actors once they begin running unless you have configured it to >>> do so. >>> >>> We run a very lax environment. We set a standard of 6 GB per job unless >>> the sbatch declares otherwise and a max runtime default. Without an >>> estimated runtime to work with the backfill scheduler is crippled. In an >>> environment mixing single thread and MPI jobs of various sizes it is >>> critical the jobs are honest in their requirements providing slurm the >>> information needed to correctly assign resources. >>> >>> Doug >>> >>> On Sat, Feb 25, 2023 at 12:04 PM Analabha Roy <hariseldo...@gmail.com> >>> wrote: >>> >>>> Hi, >>>> >>>> Thanks for your considered response. Couple of questions linger... >>>> >>>> On Sat, 25 Feb 2023 at 21:46, Doug Meyer <dameye...@gmail.com> wrote: >>>> >>>>> Hi, >>>>> >>>>> Declaring cores=64 will absolutely work but if you start running MPI >>>>> you'll want a more detailed config description. The easy way to read it >>>>> is >>>>> "128=2 sockets * 32 corespersocket * 2 threads per core". >>>>> >>>>> NodeName=hpc[306-308] CPUs=128 Sockets=2 CoresPerSocket=32 >>>>> ThreadsPerCore=2 RealMemory=512000 TmpDisk=100 >>>>> >>>>> But if you just want to work with logical cores the "cpus=128" will >>>>> work. >>>>> >>>>> If you go with the more detailed description then you need to declare >>>>> oversubscription (hyperthreading) in the partition declaration. >>>>> >>>> >>>> >>>> Yeah, I'll try that. >>>> >>>> >>>>> By default slurm will not let two different jobs share the logical >>>>> cores comprising a physical core. For example if Sue has an Array of >>>>> 1-1000 her array tasks could each take a logical core on a physical core. >>>>> But if Jamal is also running they would not be able to share the physical >>>>> core. (as I understand it). >>>>> >>>>> PartitionName=a Nodes= [301-308] Default=No OverSubscribe=YES:2 >>>>> MaxTime=Infinite State=Up AllowAccounts=cowboys >>>>> >>>>> >>>>> In the sbatch/srun the user needs to add a declaration >>>>> "oversubscribe=yes" telling slurm the job can run on both logical cores >>>>> available. >>>>> >>>> >>>> How about setting oversubscribe=FORCE:2? That way, users need not add a >>>> setting in their scripts. >>>> >>>> >>>> >>>> >>>>> In the days on Knight's Landing each core could handle four logical >>>>> cores but I don't believe there are any current AMD or Intel processors >>>>> supporting more then two logical cores (hyperthreads per core). The >>>>> conversation about hyperthreads is difficult as the Intel terminology is >>>>> logical cores for hyperthreading and cores for physical cores but the >>>>> tendency is to call the logical cores threads or hyperthreaded cores. >>>>> This >>>>> can be very confusing for consumers of the resources. >>>>> >>>>> >>>>> In any case, if you create an array job of 1-100 sleep jobs, my >>>>> simplest logical test job, then you can use scontrol show node <nodename> >>>>> to see the nodes resource configuration as well as consumption. squeue -w >>>>> <nodename> -i 10 will iteratate every ten seconds to show you the node >>>>> chomping through the job. >>>>> >>>>> >>>>> Hope this helps. Once you are comfortable I would urge you to use the >>>>> NodeName/Partition descriptor format above and encourage your users to >>>>> declare oversubscription in their jobs. It is a little more work up front >>>>> but far easier than correcting scripts later. >>>>> >>>>> >>>>> Doug >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Thu, Feb 23, 2023 at 9:41 PM Analabha Roy <hariseldo...@gmail.com> >>>>> wrote: >>>>> >>>>>> Howdy, and thanks for the warm welcome, >>>>>> >>>>>> On Fri, 24 Feb 2023 at 07:31, Doug Meyer <dameye...@gmail.com> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> Did you configure your node definition with the outputs of slurmd >>>>>>> -C? Ignore boards. Don't know if it is still true but several years >>>>>>> ago >>>>>>> declaring boards made things difficult. >>>>>>> >>>>>>> >>>>>> $ slurmd -C >>>>>> NodeName=shavak-DIT400TR-55L CPUs=64 Boards=1 SocketsPerBoard=2 >>>>>> CoresPerSocket=16 ThreadsPerCore=2 RealMemory=95311 >>>>>> UpTime=0-00:47:51 >>>>>> $ grep NodeName /etc/slurm-llnl/slurm.conf >>>>>> NodeName=shavak-DIT400TR-55L CPUs=64 RealMemory=95311 Gres=gpu:1 >>>>>> >>>>>> There is a difference. I, too, discarded the Boards and sockets in >>>>>> slurmd.conf . Is that the problem? >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> Also, if you have hyperthreaded AMD or Intel processors your >>>>>>> partition declaration should be overscribe:2 >>>>>>> >>>>>>> >>>>>> Yes I do, It's actually 16 X 2 cores with hyperthreading, but the >>>>>> BIOS is set to show them as 64 cores. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> Start with a very simple job with a script containing sleep 100 or >>>>>>> something else without any runtime issues. >>>>>>> >>>>>>> >>>>>> I ran this MPI hello world thing >>>>>> <https://github.com/hariseldon99/buparamshavak/blob/main/shavak_root/usr/local/share/examples/mpi_runs_inf/mpi_count.c>with >>>>>> this sbatch script. >>>>>> <https://github.com/hariseldon99/buparamshavak/blob/main/shavak_root/usr/local/share/examples/mpi_runs_inf/mpi_count_normal.sbatch> >>>>>> Should be the same thing as your suggestion, basically. >>>>>> Should I switch to 'srun' in the batch file? >>>>>> >>>>>> AR >>>>>> >>>>>> >>>>>>> When I started with slurm I built the sbatch one small step at a >>>>>>> time. Nodes, cores. memory, partition, mail, etc >>>>>>> >>>>>>> It sounds like your config is very close but your problem may be in >>>>>>> the submit script. >>>>>>> >>>>>>> Best of luck and welcome to slurm. It is very powerful with a huge >>>>>>> community. >>>>>>> >>>>>>> Doug >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, Feb 23, 2023 at 6:58 AM Analabha Roy <hariseldo...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi folks, >>>>>>>> >>>>>>>> I have a single-node "cluster" running Ubuntu 20.04 LTS with the >>>>>>>> distribution packages for slurm (slurm-wlm 19.05.5) >>>>>>>> Slurm only ran one job in the node at a time with the default >>>>>>>> configuration, leaving all other jobs pending. >>>>>>>> This happened even if that one job only requested like a few cores >>>>>>>> (the node has 64 cores, and slurm.conf is configged accordingly). >>>>>>>> >>>>>>>> in slurm conf, SelectType is set to select/cons_res, and >>>>>>>> SelectTypeParameters to CR_Core. NodeName is set with CPUs=64. Path to >>>>>>>> file >>>>>>>> is referenced below. >>>>>>>> >>>>>>>> So I set OverSubscribe=FORCE in the partition config and restarted >>>>>>>> the daemons. >>>>>>>> >>>>>>>> Multiple jobs are now run concurrently, but when Slurm is >>>>>>>> oversubscribed, it is *truly* *oversubscribed*. That is to say, it >>>>>>>> runs so many jobs that there are more processes running than >>>>>>>> cores/threads. >>>>>>>> How should I config slurm so that it runs multiple jobs at once per >>>>>>>> node, but ensures that it doesn't run more processes than there are >>>>>>>> cores? >>>>>>>> Is there some TRES magic for this that I can't seem to figure out? >>>>>>>> >>>>>>>> My slurm.conf is here on github: >>>>>>>> https://github.com/hariseldon99/buparamshavak/blob/main/shavak_root/etc/slurm-llnl/slurm.conf >>>>>>>> The only gres I've set is for the GPU: >>>>>>>> https://github.com/hariseldon99/buparamshavak/blob/main/shavak_root/etc/slurm-llnl/gres.conf >>>>>>>> >>>>>>>> Thanks for your attention, >>>>>>>> Regards, >>>>>>>> AR >>>>>>>> -- >>>>>>>> Analabha Roy >>>>>>>> Assistant Professor >>>>>>>> Department of Physics >>>>>>>> <http://www.buruniv.ac.in/academics/department/physics> >>>>>>>> The University of Burdwan <http://www.buruniv.ac.in/> >>>>>>>> Golapbag Campus, Barddhaman 713104 >>>>>>>> West Bengal, India >>>>>>>> Emails: dan...@utexas.edu, a...@phys.buruniv.ac.in, >>>>>>>> hariseldo...@gmail.com >>>>>>>> Webpage: http://www.ph.utexas.edu/~daneel/ >>>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> Analabha Roy >>>>>> Assistant Professor >>>>>> Department of Physics >>>>>> <http://www.buruniv.ac.in/academics/department/physics> >>>>>> The University of Burdwan <http://www.buruniv.ac.in/> >>>>>> Golapbag Campus, Barddhaman 713104 >>>>>> West Bengal, India >>>>>> Emails: dan...@utexas.edu, a...@phys.buruniv.ac.in, >>>>>> hariseldo...@gmail.com >>>>>> Webpage: http://www.ph.utexas.edu/~daneel/ >>>>>> >>>>> >>>> >>>> -- >>>> Analabha Roy >>>> Assistant Professor >>>> Department of Physics >>>> <http://www.buruniv.ac.in/academics/department/physics> >>>> The University of Burdwan <http://www.buruniv.ac.in/> >>>> Golapbag Campus, Barddhaman 713104 >>>> West Bengal, India >>>> Emails: dan...@utexas.edu, a...@phys.buruniv.ac.in, >>>> hariseldo...@gmail.com >>>> Webpage: http://www.ph.utexas.edu/~daneel/ >>>> >>> >> >> -- >> Analabha Roy >> Assistant Professor >> Department of Physics >> <http://www.buruniv.ac.in/academics/department/physics> >> The University of Burdwan <http://www.buruniv.ac.in/> >> Golapbag Campus, Barddhaman 713104 >> West Bengal, India >> Emails: dan...@utexas.edu, a...@phys.buruniv.ac.in, >> hariseldo...@gmail.com >> Webpage: http://www.ph.utexas.edu/~daneel/ >> > -- Analabha Roy Assistant Professor Department of Physics <http://www.buruniv.ac.in/academics/department/physics> The University of Burdwan <http://www.buruniv.ac.in/> Golapbag Campus, Barddhaman 713104 West Bengal, India Emails: dan...@utexas.edu, a...@phys.buruniv.ac.in, hariseldo...@gmail.com Webpage: http://www.ph.utexas.edu/~daneel/