Hi Juergen, Thanks for the guidance.
>> is PrivateData also set in your slurmdbd.conf? No. it is not set in slurmdbd.conf. I will set and verify. Thanks Hemanta On Fri, Aug 20, 2021 at 2:02 PM <slurm-users-requ...@lists.schedmd.com> wrote: > Send slurm-users mailing list submissions to > slurm-users@lists.schedmd.com > > To subscribe or unsubscribe via the World Wide Web, visit > https://lists.schedmd.com/cgi-bin/mailman/listinfo/slurm-users > or, via email, send a message with subject or body 'help' to > slurm-users-requ...@lists.schedmd.com > > You can reach the person managing the list at > slurm-users-ow...@lists.schedmd.com > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of slurm-users digest..." > > > Today's Topics: > > 1. Re: PrivateData does not filter the billing info "scontrol > show assoc_mgr flags=qos" (Juergen Salk) > 2. Preemption not working for jobs in higher priority partition > (Russell Jones) > 3. GPU jobs not running correctly (Andrey Malyutin) > 4. Re: GPU jobs not running correctly (Fulcomer, Samuel) > 5. jobs stuck in "CG" state (Durai Arasan) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Thu, 19 Aug 2021 22:51:57 +0200 > From: Juergen Salk <juergen.s...@uni-ulm.de> > To: Slurm User Community List <slurm-users@lists.schedmd.com> > Subject: Re: [slurm-users] PrivateData does not filter the billing > info "scontrol show assoc_mgr flags=qos" > Message-ID: <20210819205157.ga1331...@qualle.rz.uni-ulm.de> > Content-Type: text/plain; charset=us-ascii > > Hi Hemanta, > > is PrivateData also set in your slurmdbd.conf? > > Best regards > Juergen > > > > * Hemanta Sahu <hemantaku.s...@gmail.com> [210818 15:04]: > > I am still searching for a solution for this . > > > > On Fri, Aug 7, 2020 at 1:15 PM Hemanta Sahu <hemantaku.s...@gmail.com> > > wrote: > > > > > Hi All, > > > > > > I have configured in our test cluster "PrivateData" parameter in > > > "slurm.conf" as below. > > > > > > >> > > > [testuser1@centos7vm01 ~]$ cat /etc/slurm/clurm.conf|less > > > > > > > PrivateData=accounts,jobs,reservations,usage,users,events,partitions,nodes > > > MCSPlugin=mcs/user > > > MCSParameters=enforced,select,privatedata > > > >> > > > > > > The command "scontrol show assoc_mgr flags=Association" filetrs the > > > relvant information for the user. > > > But "scontrol show assoc_mgr flags=qos" did not filter anything rather > it > > > show the information about all QOS > > > to the normal users who even don't have privilege of Slurm > Operator/slurm > > > Administaror.Basically I want to Hide the billing details to users who > are > > > not co-ordinator for a particular account > > > > > > Appreciate any help or guidance. > > > > > > >> > > > [testuser1@centos7vm01 ~]$ scontrol show assoc_mgr flags=qos|egrep > > > "QOS|GrpTRESMins" > > > QOS Records > > > QOS=normal(1) > > > > > > > GrpTRESMins=cpu=N(0),mem=N(78),energy=N(0),node=N(0),billing=N(0),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0) > > > QOS=testfac1(7) > > > > > > > GrpTRESMins=cpu=N(0),mem=N(143),energy=N(0),node=N(0),billing=6000000(0),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0) > > > QOS=cdac_fac1(10) > > > > > > > GrpTRESMins=cpu=N(10),mem=N(163830),energy=N(0),node=N(4),billing=10000000(11),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0) > > > QOS=iitkgp_fac1(11) > > > > > > > GrpTRESMins=cpu=N(0),mem=N(20899),energy=N(0),node=N(0),billing=10000000(0),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0) > > > QOS=iitkgp_faculty(13) > > > > > > > GrpTRESMins=cpu=N(92),mem=N(379873),energy=N(0),node=N(35),billing=N(175),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0) > > > > > > > > > [testuser1@centos7vm01 ~]$ scontrol show assoc_mgr > flags=Association|grep > > > GrpTRESMins > > > > > > > GrpTRESMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),billing=N(0),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0) > > > [testuser1@centos7vm01 ~]$ > > > >> > > > > > > Regards, > > > Hemanta > > > > > > Hemanta Kumar Sahu > > > Senior System Engineer > > > CCDS,JC Bose Annexe > > > Phone:03222-304604/Ext:84604 > > > I I T Kharagpur-721302 > > > E-Mail: hks...@iitkgp.ac.in > > > hemantaku.s...@gmail.com > > > > > > > > ------------------------------ > > Message: 2 > Date: Thu, 19 Aug 2021 16:49:05 -0500 > From: Russell Jones <arjone...@gmail.com> > To: Slurm User Community List <slurm-users@lists.schedmd.com> > Subject: [slurm-users] Preemption not working for jobs in higher > priority partition > Message-ID: > <CABb1d=hx54= > jb9uc+zpf3jae+v5f0wdpmaqd1ku0uekpdnk...@mail.gmail.com> > Content-Type: text/plain; charset="utf-8" > > Hi all, > > I could use some help to understand why preemption is not working for me > properly. I have a job blocking other jobs that doesn't make sense to me. > Any assistance is appreciated, thank you! > > > I have two partitions defined in slurm, a day time and a night time > pariition: > > Day partition - PriorityTier of 5, always Up. Limited resources under this > QOS. > Night partition - PriorityTier of 5 during night time, during day time set > to Down and PriorityTier changed to 1. Jobs can be submitted to night queue > for an unlimited QOS as long as resources are available. > > The thought here is jobs can continue to run in the night partition, even > during the day time, until resources are requested from the day partition. > Jobs would then be requeued/canceled in the night partition to > satisfy those requirements. > > > > Current output of "scontrol show part" : > > PartitionName=day > AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL > AllocNodes=ALL Default=NO QoS=part_day > DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 > Hidden=NO > MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=0 LLN=NO > MaxCPUsPerNode=UNLIMITED > Nodes=cluster-r1n[01-13],cluster-r2n[01-08] > PriorityJobFactor=1 PriorityTier=5 RootOnly=NO ReqResv=NO > OverSubscribe=NO > OverTimeLimit=NONE PreemptMode=REQUEUE > State=UP TotalCPUs=336 TotalNodes=21 SelectTypeParameters=NONE > JobDefaults=(null) > DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED > > > PartitionName=night > AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL > AllocNodes=ALL Default=NO QoS=part_night > DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 > Hidden=NO > MaxNodes=22 MaxTime=7-00:00:00 MinNodes=0 LLN=NO > MaxCPUsPerNode=UNLIMITED > Nodes=cluster-r1n[01-13],cluster-r2n[01-08] > PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO > OverSubscribe=NO > OverTimeLimit=NONE PreemptMode=REQUEUE > State=DOWN TotalCPUs=336 TotalNodes=21 SelectTypeParameters=NONE > JobDefaults=(null) > DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED > > > > > I currently have a job in the night partition that is blocking jobs in the > day partition, even though the day partition has a PriorityTier of 5, and > night partition is Down with a PriorityTier of 1. > > My current slurm.conf preemption settings are: > > PreemptMode=REQUEUE > PreemptType=preempt/partition_prio > > > > The blocking job's scontrol show job output is: > > JobId=105713 JobName=jobname > Priority=1986 Nice=0 Account=xxx QOS=normal > JobState=RUNNING Reason=None Dependency=(null) > Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 > RunTime=17:49:39 TimeLimit=7-00:00:00 TimeMin=N/A > SubmitTime=2021-08-18T22:36:36 EligibleTime=2021-08-18T22:36:36 > AccrueTime=2021-08-18T22:36:36 > StartTime=2021-08-18T22:36:39 EndTime=2021-08-25T22:36:39 Deadline=N/A > PreemptEligibleTime=2021-08-18T22:36:39 PreemptTime=None > SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-08-18T22:36:39 > Partition=night AllocNode:Sid=cluster-1:1341505 > ReqNodeList=(null) ExcNodeList=(null) > NodeList=cluster-r1n[12-13],cluster-r2n[04-06] > BatchHost=cluster-r1n12 > NumNodes=5 NumCPUs=80 NumTasks=5 CPUs/Task=1 ReqB:S:C:T=0:0:*:* > TRES=cpu=80,node=5,billing=80,gres/gpu=20 > Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* > MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 > Features=(null) DelayBoot=00:00:00 > OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null) > > > > The job that is being blocked: > > JobId=105876 JobName=bash > Priority=2103 Nice=0 Account=xxx QOS=normal > JobState=PENDING > > Reason=Nodes_required_for_job_are_DOWN,_DRAINED_or_reserved_for_jobs_in_higher_priority_partitions > Dependency=(null) > Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 > RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A > SubmitTime=2021-08-19T16:19:23 EligibleTime=2021-08-19T16:19:23 > AccrueTime=2021-08-19T16:19:23 > StartTime=Unknown EndTime=Unknown Deadline=N/A > SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-08-19T16:26:43 > Partition=day AllocNode:Sid=cluster-1:2776451 > ReqNodeList=(null) ExcNodeList=(null) > NodeList=(null) > NumNodes=3 NumCPUs=40 NumTasks=40 CPUs/Task=1 ReqB:S:C:T=0:0:*:* > TRES=cpu=40,node=1,billing=40 > Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* > MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 > Features=(null) DelayBoot=00:00:00 > OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null) > > > > Why is the day job not preempting the night job? > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > http://lists.schedmd.com/pipermail/slurm-users/attachments/20210819/bdecefbc/attachment-0001.htm > > > > ------------------------------ > > Message: 3 > Date: Thu, 19 Aug 2021 17:35:29 -0700 > From: Andrey Malyutin <malyuti...@gmail.com> > To: slurm-users@lists.schedmd.com > Subject: [slurm-users] GPU jobs not running correctly > Message-ID: > <CAGiFTXK6cT= > mrv2fuewccpbvuwtfeorsmjcjao9ultfvtue...@mail.gmail.com> > Content-Type: text/plain; charset="utf-8" > > Hello, > > We are in the process of finishing up the setup of a cluster with 3 nodes, > 4 GPUs each. One node has RTX3090s and the other 2 have RTX6000s.Any job > asking for 1 GPU in the submission script will wait to run on the 3090 > node, no matter resource availability. Same job requesting 2 or more GPUs > will run on any node. I don't even know where to begin troubleshooting this > issue; entries for the 3 nodes are effectively identical in slurm.conf. Any > help would be appreciated. (If helpful - this cluster is used for > structural biology, with cryosparc and relion packages). > > Thank you, > Andrey > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > http://lists.schedmd.com/pipermail/slurm-users/attachments/20210819/10e1b1b7/attachment-0001.htm > > > > ------------------------------ > > Message: 4 > Date: Thu, 19 Aug 2021 21:05:28 -0400 > From: "Fulcomer, Samuel" <samuel_fulco...@brown.edu> > To: Slurm User Community List <slurm-users@lists.schedmd.com> > Subject: Re: [slurm-users] GPU jobs not running correctly > Message-ID: > <CAOORAuFa+ahMxY--8=a1dVu4cPGUuVSojEDv=Sxg6kfaJLi= > z...@mail.gmail.com> > Content-Type: text/plain; charset="utf-8" > > What SLURM version are you running? > > What are the #SLURM directives in the batch script? (or the sbatch > arguments) > > When the single GPU jobs are pending, what's the output of 'scontrol show > job JOBID'? > > What are the node definitions in slurm.conf, and the lines in gres.conf? > > Are the nodes all the same host platform (motherboard)? > > We have P100s, TitanVs, Titan RTXs, Quadro RTX 6000s, 3090s, V100s, DGX 1s, > A6000s, and A40s, with a mix of single and dual-root platforms, and haven't > seen this problem with SLURM 20.02.6 or earlier versions. > > On Thu, Aug 19, 2021 at 8:38 PM Andrey Malyutin <malyuti...@gmail.com> > wrote: > > > Hello, > > > > We are in the process of finishing up the setup of a cluster with 3 > nodes, > > 4 GPUs each. One node has RTX3090s and the other 2 have RTX6000s.Any job > > asking for 1 GPU in the submission script will wait to run on the 3090 > > node, no matter resource availability. Same job requesting 2 or more GPUs > > will run on any node. I don't even know where to begin troubleshooting > this > > issue; entries for the 3 nodes are effectively identical in slurm.conf. > Any > > help would be appreciated. (If helpful - this cluster is used for > > structural biology, with cryosparc and relion packages). > > > > Thank you, > > Andrey > > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > http://lists.schedmd.com/pipermail/slurm-users/attachments/20210819/4e2636a0/attachment-0001.htm > > > > ------------------------------ > > Message: 5 > Date: Fri, 20 Aug 2021 10:31:40 +0200 > From: Durai Arasan <arasan.du...@gmail.com> > To: Slurm User Community List <slurm-users@lists.schedmd.com> > Subject: [slurm-users] jobs stuck in "CG" state > Message-ID: > <CA+WZHCZsT4OiL9p3i9BfYArERYzqhyM9eNrYH= > cr7cwlepw...@mail.gmail.com> > Content-Type: text/plain; charset="utf-8" > > Hello! > > We have a huge number of jobs stuck in CG state from a user who probably > wrote code with bad I/O. "scancel" does not make them go away. Is there a > way for admins to get rid of these jobs without draining and rebooting the > nodes. I read somewhere that killing the respective slurmstepd process will > do the job. Is this possible? Any other solutions? Also are there any > parameters in slurm.conf one can set to manage such situations better? > > Best, > Durai > MPI T?bingen > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > http://lists.schedmd.com/pipermail/slurm-users/attachments/20210820/f34971c1/attachment.htm > > > > End of slurm-users Digest, Vol 46, Issue 20 > ******************************************* >