[slurm-users] Re: Selecting only a subset of GPU's from all available GPU's

2024-02-10 Thread Minulakshmi S via slurm-users
Hi Loris,

I have different kind of GPU's in same node , and I beleive feature applies
to particular node , and cant be applied to only a few GPU's connected to a
single node.

On Mon, 18 Dec 2023, 12:14 Loris Bennett, 
wrote:

> Hi Minu,
>
> Minulakshmi S  writes:
>
> > I'm submitting jobs to a cluster via SLURM scheduler, and let's say I
> have access to 8 GPUs in my cluster in same node. They are GPUs of type
> A,B,C,D,E,F,G,H. I
> > would like to submit a job that requests the use of GPUs of type A or B
> or C but NOT of type D/E/F/G/H. So I need some type of OR logic with the
> --gres flag.
> >
> > Eg : When I request GPU of type A , I can do sbatch –gres=gpu:TypeA:1,
> I need to input a subset of GPU’s and let slurm schedule job utilizing one
> of the GPU
> > from this allowed list.
> >
> > Regards
> > Minu
>
> Assuming the GPUs within a node are all of the same type, could you define
> a feature for each GPU type, assign the features to the appropriate
> nodes, and then run the job with
>
>--contraint=gpu_TypeA  --gres=gpu:1
>
> ?
>
> This is obviously rather clunky and it would be much nicer if
> multiple GPUs types passed to '--gres' were ORed.
>
> Cheers,
>
> Loris
>
> --
> Dr. Loris Bennett (Herr/Mr)
> ZEDAT, Freie Universität Berlin
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: SLURM GRES reservation not working properly on 24.05.1

2024-10-07 Thread Minulakshmi S via slurm-users
Would appreciate any leads on the above query. Thanks in advance.

On Fri, 20 Sept 2024 at 14:31, Minulakshmi S 
wrote:

> Hello,
>
> *Issue 1:*
> I am using slurm version 24.05.1 , my slurmd has a single node where I
> connect multiple gres by enabling the overscribe feature.
> I am able to use the advance reservation of gres only using *gres** name*
>  (tres=gres/gpu:*SYSTEM12*).
>
>
> i.e while in reservation period , if other users submits job with SYSTEM12
> , then slurm places this job in queue
>
> *user1@host$ srun --gres=gpu:SYSTEM12:1 hostname*
> *srun: job 333 queued and waiting for resources *
>
> but when other users just submit a job without any system  name , slurm
> jobs goes through on that gres immediately even though it is reserved.
>
> *user1@host$ srun --gres=gpu:1 hostname
> *
> *mylinux.wbi.com  *
>
>
> Also I can see GresUsed in busy mode using "*scontrol show node -d*"   ,
> this means the job is running on Gres/GPU and not on cpu etc.
>
>
> Same way , job submission based on Feature "rev1 in my case" is also going
> through even though it is reserved for other users in multiple partition
> slurm.
>
> *snippet of slurm.conf file*
> NodeName=cluster01 NodeAddr=cluster Port=6002CPUs=8 Boards=1
> SocketsPerBoard=1 CoresPerSocket=8 ThreadsPerCore=2 Feature="rev1"
> Gres=gpu:SYSTEM12:1 RealMemory=64171 State=IDLE
>
> *Issue 2:*
>
> while execution , Slurm o/p's some extra prints in the srun output
>
> user1@host$ srun --gres=gpu:1 hostname
>
>
> srun: error: extract_net_cred: net_cred not provided
>
> srun: error: Malformed RPC of type RESPONSE_NODE_ALIAS_ADDRS(3017)
> received
>   srun: error:
> slurm_unpack_received_msg: [mylinux.wbi.com]:41242] Header lengths are
> longer than data received
> *mylinux.wbi.com *
>
> Regards,
> MS
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] SLURM GRES reservation not working properly on 24.05.1

2024-09-20 Thread Minulakshmi S via slurm-users
Hello,

*Issue 1:*
I am using slurm version 24.05.1 , my slurmd has a single node where I
connect multiple gres by enabling the overscribe feature.
I am able to use the advance reservation of gres only using *gres** name*
 (tres=gres/gpu:*SYSTEM12*).


i.e while in reservation period , if other users submits job with SYSTEM12
, then slurm places this job in queue

*user1@host$ srun --gres=gpu:SYSTEM12:1 hostname*
*srun: job 333 queued and waiting for resources *

but when other users just submit a job without any system  name , slurm
jobs goes through on that gres immediately even though it is reserved.

*user1@host$ srun --gres=gpu:1 hostname
*
*mylinux.wbi.com  *


Also I can see GresUsed in busy mode using "*scontrol show node -d*"   ,
this means the job is running on Gres/GPU and not on cpu etc.


Same way , job submission based on Feature "rev1 in my case" is also going
through even though it is reserved for other users in multiple partition
slurm.

*snippet of slurm.conf file*
NodeName=cluster01 NodeAddr=cluster Port=6002CPUs=8 Boards=1
SocketsPerBoard=1 CoresPerSocket=8 ThreadsPerCore=2 Feature="rev1"
Gres=gpu:SYSTEM12:1 RealMemory=64171 State=IDLE

*Issue 2:*

while execution , Slurm o/p's some extra prints in the srun output

user1@host$ srun --gres=gpu:1 hostname


srun: error: extract_net_cred: net_cred not provided

srun: error: Malformed RPC of type RESPONSE_NODE_ALIAS_ADDRS(3017)
received
  srun: error:
slurm_unpack_received_msg: [mylinux.wbi.com]:41242] Header lengths are
longer than data received
*mylinux.wbi.com *

Regards,
MS

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] SLURM GRES reservation not working properly on 24.05.1

2024-09-20 Thread Minulakshmi S via slurm-users
Hello,

*Issue 1:*
I am using slurm version 24.05.1 , my slurmd has a single node where I
connect multiple gres by enabling the overscribe feature.
I am able to use the advance reservation of gres only using *gres** name*
(tres=gres/gpu:*SYSTEM12*).


i.e while in reservation period , if other users submits job with SYSTEM12
, then slurm places this job in queue

*user1@host$ srun --gres=gpu:SYSTEM12:1 hostname*
*srun: job 333 queued and waiting for resources *

but when other users just submit a job without any system  name , slurm
jobs goes through on that gres immediately even though it is reserved.

*user1@host$ srun --gres=gpu:1 hostname
*
*mylinux.wbi.com  *


Also I can see GresUsed in busy mode using "*scontrol show node -d*"   ,
this means the job is running on Gres/GPU and not on cpu etc.


Same way , job submission based on Feature "rev1 in my case" is also going
through even though it is reserved for other users in multiple partition
slurm.

*snippet of slurm.conf file*
NodeName=cluster01 NodeAddr=cluster Port=6002CPUs=8 Boards=1
SocketsPerBoard=1 CoresPerSocket=8 ThreadsPerCore=2 Feature="rev1"
Gres=gpu:SYSTEM12:1 RealMemory=64171 State=IDLE

*Issue 2:*

while execution , Slurm o/p's some extra prints in the srun output

user1@host$ srun --gres=gpu:1 hostname


srun: error: extract_net_cred: net_cred not provided

srun: error: Malformed RPC of type RESPONSE_NODE_ALIAS_ADDRS(3017)
received
  srun: error:
slurm_unpack_received_msg: [[inv1715771615.nxdi.us-aus01.nxp.com]:41242]
Header lengths are longer than data received
*mylinux.wbi.com *

Regards,
MS

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com