[slurm-users] when running `salloc --gres=gpu:1` should I see all gpus in nvidia-smi ?
Hello, When I am running this command: `salloc --nodelist=gpu03 -p A4500_Features --gres=gpu:1` and then automatically ssh to the job, what should I see when I run nvidia-smi? All the GPUs in the host or just a single one? Thanks -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: when running `salloc --gres=gpu:1` should I see all gpus in nvidia-smi ?
Hi James, I am sort of the admin and trying to understand what the goal should be. Thanks Roberto, I'll have a look on ConstrainDevices <https://slurm.schedmd.com/cgroup.conf.html#OPT_ConstrainDevices> On Mon, 5 Aug 2024 at 18:14, Roberto Polverelli Monti via slurm-users < slurm-users@lists.schedmd.com> wrote: > Hello Oren, > > On 8/5/24 3:20 PM, Oren via slurm-users wrote: > > When I am running this command: > > `salloc --nodelist=gpu03 -p A4500_Features --gres=gpu:1` > > and then automatically ssh to the job, what should I see when I run > > nvidia-smi? All the GPUs in the host or just a single one? > > That should depend on the ConstrainDevices parameter. [1] You can > quickly verify this with: > > $ scontrol show conf | grep Constr > > 1. https://slurm.schedmd.com/cgroup.conf.html#OPT_ConstrainDevices > > Best, > > -- > Roberto Polverelli Monti > HPC Engineer > Do IT Now | https://doit-now.tech/ > > -- > slurm-users mailing list -- slurm-users@lists.schedmd.com > To unsubscribe send an email to slurm-users-le...@lists.schedmd.com > -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: How can I make sure my user have only one job per node (Job array --exclusive=user,)
Thanks, nice workaround. It will be great if there was a way to actually set it so that one can use only one node per job, a bit like ---exclusive. Thanks On Tue, 3 Dec 2024 at 16:24, Renfro, Michael wrote: > I’ve never done this myself, but others probably have. At the end of [1], > there’s an example of making a generic resource for bandwidth. You could > set that to any convenient units (bytes/second or bits/second, most > likely), and assign your nodes a certain amount. Then any network-intensive > job could reserve all the node’s bandwidth, without locking other > less-intensive jobs off the node. It’s identical to reserving 1 or more > GPUs per node, just without any hardware permissions. > > > > [1] https://slurm.schedmd.com/gres.conf.html#SECTION_EXAMPLES > > > > *From: *Oren > *Date: *Tuesday, December 3, 2024 at 3:15 PM > *To: *Renfro, Michael > *Cc: *slurm-us...@schedmd.com > *Subject: *Re: [slurm-users] How can I make sure my user have only one > job per node (Job array --exclusive=user,) > > *External Email Warning* > > *This email originated from outside the university. Please use caution > when opening attachments, clicking links, or responding to requests.* > -- > > Thank you Michael, > > yeah, you guessed right, Networking. > My job is mostly IO (Networking) intensive, my nodes connect to the > network via a non blocking switch, but the ethernet cards are not the best, > > So I don't need many CPUs per node, but I do want to run on all nodes to > fully utilize the network connection that each node has. > > > > Assuming I don't want to change the scheduler, is there anything else I > can do? > > Thanks, > > Oren > > > > On Tue, 3 Dec 2024 at 15:10, Renfro, Michael wrote: > > I’ll start with the question of “why spread the jobs out more than > required?” and move on to why the other items didn’t work: > > > >1. exclusive only ensures that others’ jobs don’t run on a node with >your jobs, and does nothing about other jobs you own. >2. spread-job distributes the work of one job across multiple nodes, >but does nothing about multiple jobs >3. distribution also distributes the work of one job > > > > You might get something similar to what you want by changing the scheduler > to use CR_LLN instead of CR_Core_Memory (or whatever you’re using), but > that’ll potentially have serious side effects for others’ jobs. > > > > So back to the original question: why **not** pack 20 jobs onto fewer > nodes if those nodes have the capacity to run the full set of jobs? You > shouldn’t have a constraint with memory or CPUs. Are you trying to spread > out an I/O load somehow? Networking? > > > > *From: *Oren via slurm-users > *Date: *Tuesday, December 3, 2024 at 1:35 PM > *To: *slurm-us...@schedmd.com > *Subject: *[slurm-users] How can I make sure my user have only one job > per node (Job array --exclusive=user,) > > *External Email Warning* > > *This email originated from outside the university. Please use caution > when opening attachments, clicking links, or responding to requests.* > -- > > Hi, > I have a cluster of 20-nodes, and I want to run a jobarray on that > cluster, but I want each node to get one job per node. > > > > When I do the following: > > #!/bin/bash > > #SBATCH --job-name=process_images_train# Job name > > #SBATCH --time=50:00:00 # Time limit hrs:min:sec > > #SBATCH --tasks=1 > > #SBATCH --cpus-per-task=4 > > #SBATCH --mem=5 > > #SBATCH --array=0-19# 19 # Job array with 20 jobs (0 to 19) > > > > I get 10 jobs in node #1 and 10 jobs in node #20, I want a job in each > node. > > > > I've tried: > #SBATCH --exclusive=user > > Also > > #SBATCH --spread-job > > #SBATCH --distribution=cyclic > > > > > > Nothing changes, node#1 got 10 jobs and node#2 got 10 jobs. > > Thanks > > -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: How can I make sure my user have only one job per node (Job array --exclusive=user,)
Thanks, but yeah I do not want to use ` --exclusive` I just want it to be exclusive for me.. Thanks On Tue, 3 Dec 2024 at 16:40, Renfro, Michael wrote: > As Thomas had mentioned earlier in the thread, there is --exclusive with > no extra additions. But that’d prevent **every** other job from running > on that node, which unless this is a cluster for you and you alone, sounds > like wasting 90% of the resources. I’d be most perturbed at a user doing > that here without some astoundingly good reasons. > > > > *From: *Oren > *Date: *Tuesday, December 3, 2024 at 3:36 PM > *To: *Renfro, Michael > *Cc: *slurm-us...@schedmd.com > *Subject: *Re: [slurm-users] How can I make sure my user have only one > job per node (Job array --exclusive=user,) > > *External Email Warning* > > *This email originated from outside the university. Please use caution > when opening attachments, clicking links, or responding to requests.* > -- > > Thanks, nice workaround. > It will be great if there was a way to actually set it so that one can use > only one node per job, a bit like ---exclusive. > Thanks > > > > On Tue, 3 Dec 2024 at 16:24, Renfro, Michael wrote: > > I’ve never done this myself, but others probably have. At the end of [1], > there’s an example of making a generic resource for bandwidth. You could > set that to any convenient units (bytes/second or bits/second, most > likely), and assign your nodes a certain amount. Then any network-intensive > job could reserve all the node’s bandwidth, without locking other > less-intensive jobs off the node. It’s identical to reserving 1 or more > GPUs per node, just without any hardware permissions. > > > > [1] https://slurm.schedmd.com/gres.conf.html#SECTION_EXAMPLES > > > > *From: *Oren > *Date: *Tuesday, December 3, 2024 at 3:15 PM > *To: *Renfro, Michael > *Cc: *slurm-us...@schedmd.com > *Subject: *Re: [slurm-users] How can I make sure my user have only one > job per node (Job array --exclusive=user,) > > *External Email Warning* > > *This email originated from outside the university. Please use caution > when opening attachments, clicking links, or responding to requests.* > -- > > Thank you Michael, > > yeah, you guessed right, Networking. > My job is mostly IO (Networking) intensive, my nodes connect to the > network via a non blocking switch, but the ethernet cards are not the best, > > So I don't need many CPUs per node, but I do want to run on all nodes to > fully utilize the network connection that each node has. > > > > Assuming I don't want to change the scheduler, is there anything else I > can do? > > Thanks, > > Oren > > > > On Tue, 3 Dec 2024 at 15:10, Renfro, Michael wrote: > > I’ll start with the question of “why spread the jobs out more than > required?” and move on to why the other items didn’t work: > > > >1. exclusive only ensures that others’ jobs don’t run on a node with >your jobs, and does nothing about other jobs you own. >2. spread-job distributes the work of one job across multiple nodes, >but does nothing about multiple jobs >3. distribution also distributes the work of one job > > > > You might get something similar to what you want by changing the scheduler > to use CR_LLN instead of CR_Core_Memory (or whatever you’re using), but > that’ll potentially have serious side effects for others’ jobs. > > > > So back to the original question: why **not** pack 20 jobs onto fewer > nodes if those nodes have the capacity to run the full set of jobs? You > shouldn’t have a constraint with memory or CPUs. Are you trying to spread > out an I/O load somehow? Networking? > > > > *From: *Oren via slurm-users > *Date: *Tuesday, December 3, 2024 at 1:35 PM > *To: *slurm-us...@schedmd.com > *Subject: *[slurm-users] How can I make sure my user have only one job > per node (Job array --exclusive=user,) > > *External Email Warning* > > *This email originated from outside the university. Please use caution > when opening attachments, clicking links, or responding to requests.* > -- > > Hi, > I have a cluster of 20-nodes, and I want to run a jobarray on that > cluster, but I want each node to get one job per node. > > > > When I do the following: > > #!/bin/bash > > #SBATCH --job-name=process_images_train# Job name > > #SBATCH --time=50:00:00 # Time limit hrs:min:sec > > #SBATCH --tasks=1 > > #SBATCH --cpus-per-task=4 > > #SBATCH --mem=5 > > #SBATCH --array=0-19# 19 # Job array with 20 jobs (0 to 19) > > > > I get 10 jobs in node #1 and 10 jobs in node #20, I want a job in each > node. > > > > I've tried: > #SBATCH --exclusive=user > > Also > > #SBATCH --spread-job > > #SBATCH --distribution=cyclic > > > > > > Nothing changes, node#1 got 10 jobs and node#2 got 10 jobs. > > Thanks > > -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] How can I make sure my user have only one job per node (Job array --exclusive=user,)
Hi, I have a cluster of 20-nodes, and I want to run a jobarray on that cluster, but I want each node to get one job per node. When I do the following: #!/bin/bash #SBATCH --job-name=process_images_train# Job name #SBATCH --time=50:00:00 # Time limit hrs:min:sec #SBATCH --tasks=1 #SBATCH --cpus-per-task=4 #SBATCH --mem=5 #SBATCH --array=0-19# 19 # Job array with 20 jobs (0 to 19) I get 10 jobs in node #1 and 10 jobs in node #20, I want a job in each node. I've tried: #SBATCH --exclusive=user Also #SBATCH --spread-job #SBATCH --distribution=cyclic Nothing changes, node#1 got 10 jobs and node#2 got 10 jobs. Thanks -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: How can I make sure my user have only one job per node (Job array --exclusive=user,)
Thank you Michael, yeah, you guessed right, Networking. My job is mostly IO (Networking) intensive, my nodes connect to the network via a non blocking switch, but the ethernet cards are not the best, So I don't need many CPUs per node, but I do want to run on all nodes to fully utilize the network connection that each node has. Assuming I don't want to change the scheduler, is there anything else I can do? Thanks, Oren On Tue, 3 Dec 2024 at 15:10, Renfro, Michael wrote: > I’ll start with the question of “why spread the jobs out more than > required?” and move on to why the other items didn’t work: > > > >1. exclusive only ensures that others’ jobs don’t run on a node with >your jobs, and does nothing about other jobs you own. >2. spread-job distributes the work of one job across multiple nodes, >but does nothing about multiple jobs >3. distribution also distributes the work of one job > > > > You might get something similar to what you want by changing the scheduler > to use CR_LLN instead of CR_Core_Memory (or whatever you’re using), but > that’ll potentially have serious side effects for others’ jobs. > > > > So back to the original question: why **not** pack 20 jobs onto fewer > nodes if those nodes have the capacity to run the full set of jobs? You > shouldn’t have a constraint with memory or CPUs. Are you trying to spread > out an I/O load somehow? Networking? > > > > *From: *Oren via slurm-users > *Date: *Tuesday, December 3, 2024 at 1:35 PM > *To: *slurm-us...@schedmd.com > *Subject: *[slurm-users] How can I make sure my user have only one job > per node (Job array --exclusive=user,) > > *External Email Warning* > > *This email originated from outside the university. Please use caution > when opening attachments, clicking links, or responding to requests.* > -- > > Hi, > I have a cluster of 20-nodes, and I want to run a jobarray on that > cluster, but I want each node to get one job per node. > > > > When I do the following: > > #!/bin/bash > > #SBATCH --job-name=process_images_train# Job name > > #SBATCH --time=50:00:00 # Time limit hrs:min:sec > > #SBATCH --tasks=1 > > #SBATCH --cpus-per-task=4 > > #SBATCH --mem=5 > > #SBATCH --array=0-19# 19 # Job array with 20 jobs (0 to 19) > > > > I get 10 jobs in node #1 and 10 jobs in node #20, I want a job in each > node. > > > > I've tried: > #SBATCH --exclusive=user > > Also > > #SBATCH --spread-job > > #SBATCH --distribution=cyclic > > > > > > Nothing changes, node#1 got 10 jobs and node#2 got 10 jobs. > > Thanks > -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] slurm send email status with no details
Hi, We set up a slurm system with email notification, this is the slurm.conf `MailProg=/usr/sbin/sendmail` But the email that I get has not status, just an empty message: [image: image.png] no subject, no info, what are we missing? Thanks~ -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com