[slurm-users] when running `salloc --gres=gpu:1` should I see all gpus in nvidia-smi ?

2024-08-05 Thread Oren via slurm-users
Hello,
When I am running this command:
`salloc --nodelist=gpu03 -p A4500_Features  --gres=gpu:1`
and then automatically ssh to the job, what should I see when I run
nvidia-smi? All the GPUs in the host or just a single one?
Thanks

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: when running `salloc --gres=gpu:1` should I see all gpus in nvidia-smi ?

2024-08-05 Thread Oren via slurm-users
Hi James, I am sort of the admin and trying to understand what the goal
should be.
Thanks Roberto, I'll have a look on ConstrainDevices
<https://slurm.schedmd.com/cgroup.conf.html#OPT_ConstrainDevices>

On Mon, 5 Aug 2024 at 18:14, Roberto Polverelli Monti via slurm-users <
slurm-users@lists.schedmd.com> wrote:

> Hello Oren,
>
> On 8/5/24 3:20 PM, Oren via slurm-users wrote:
> > When I am running this command:
> > `salloc --nodelist=gpu03 -p A4500_Features  --gres=gpu:1`
> > and then automatically ssh to the job, what should I see when I run
> > nvidia-smi? All the GPUs in the host or just a single one?
>
> That should depend on the ConstrainDevices parameter. [1]  You can
> quickly verify this with:
>
> $ scontrol show conf | grep Constr
>
> 1. https://slurm.schedmd.com/cgroup.conf.html#OPT_ConstrainDevices
>
> Best,
>
> --
> Roberto Polverelli Monti
> HPC Engineer
> Do IT Now | https://doit-now.tech/
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: How can I make sure my user have only one job per node (Job array --exclusive=user,)

2024-12-03 Thread Oren via slurm-users
Thanks, nice workaround.
It will be great if there was a way to actually set it so that one can use
only one node per job, a bit like ---exclusive.
Thanks

On Tue, 3 Dec 2024 at 16:24, Renfro, Michael  wrote:

> I’ve never done this myself, but others probably have. At the end of [1],
> there’s an example of making a generic resource for bandwidth. You could
> set that to any convenient units (bytes/second or bits/second, most
> likely), and assign your nodes a certain amount. Then any network-intensive
> job could reserve all the node’s bandwidth, without locking other
> less-intensive jobs off the node. It’s identical to reserving 1 or more
> GPUs per node, just without any hardware permissions.
>
>
>
> [1] https://slurm.schedmd.com/gres.conf.html#SECTION_EXAMPLES
>
>
>
> *From: *Oren 
> *Date: *Tuesday, December 3, 2024 at 3:15 PM
> *To: *Renfro, Michael 
> *Cc: *slurm-us...@schedmd.com 
> *Subject: *Re: [slurm-users] How can I make sure my user have only one
> job per node (Job array --exclusive=user,)
>
> *External Email Warning*
>
> *This email originated from outside the university. Please use caution
> when opening attachments, clicking links, or responding to requests.*
> --
>
> Thank you Michael,
>
> yeah, you guessed right, Networking.
> My job is mostly IO (Networking) intensive, my nodes connect to the
> network via a non blocking switch, but the ethernet cards are not the best,
>
> So I don't need many CPUs per node, but I do want to run on all nodes to
> fully utilize the network connection that each node has.
>
>
>
> Assuming I don't want to change the scheduler, is there anything else I
> can do?
>
> Thanks,
>
> Oren
>
>
>
> On Tue, 3 Dec 2024 at 15:10, Renfro, Michael  wrote:
>
> I’ll start with the question of “why spread the jobs out more than
> required?” and move on to why the other items didn’t work:
>
>
>
>1. exclusive only ensures that others’ jobs don’t run on a node with
>your jobs, and does nothing about other jobs you own.
>2. spread-job distributes the work of one job across multiple nodes,
>but does nothing about multiple jobs
>3. distribution also distributes the work of one job
>
>
>
> You might get something similar to what you want by changing the scheduler
> to use CR_LLN instead of CR_Core_Memory (or whatever you’re using), but
> that’ll potentially have serious side effects for others’ jobs.
>
>
>
> So back to the original question: why **not** pack 20 jobs onto fewer
> nodes if those nodes have the capacity to run the full set of jobs? You
> shouldn’t have a constraint with memory or CPUs. Are you trying to spread
> out an I/O load somehow? Networking?
>
>
>
> *From: *Oren via slurm-users 
> *Date: *Tuesday, December 3, 2024 at 1:35 PM
> *To: *slurm-us...@schedmd.com 
> *Subject: *[slurm-users] How can I make sure my user have only one job
> per node (Job array --exclusive=user,)
>
> *External Email Warning*
>
> *This email originated from outside the university. Please use caution
> when opening attachments, clicking links, or responding to requests.*
> --
>
> Hi,
> I have a cluster of 20-nodes, and I want to run a jobarray on that
> cluster, but I want each node to get one job per node.
>
>
>
> When I do the following:
>
> #!/bin/bash
>
> #SBATCH --job-name=process_images_train# Job name
>
> #SBATCH --time=50:00:00   # Time limit hrs:min:sec
>
> #SBATCH --tasks=1
>
> #SBATCH --cpus-per-task=4
>
> #SBATCH --mem=5
>
> #SBATCH --array=0-19# 19 # Job array with 20 jobs (0 to 19)
>
>
>
> I get 10 jobs in node #1 and 10 jobs in node #20, I want a job in each
> node.
>
>
>
> I've tried:
> #SBATCH --exclusive=user
>
> Also
>
> #SBATCH --spread-job
>
> #SBATCH  --distribution=cyclic
>
>
>
>
>
> Nothing changes, node#1 got 10 jobs and node#2 got 10 jobs.
>
> Thanks
>
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: How can I make sure my user have only one job per node (Job array --exclusive=user,)

2024-12-03 Thread Oren via slurm-users
Thanks, but yeah I do not want to use ` --exclusive` I just want it to be
exclusive for me..
Thanks

On Tue, 3 Dec 2024 at 16:40, Renfro, Michael  wrote:

> As Thomas had mentioned earlier in the thread, there is --exclusive with
> no extra additions. But that’d prevent **every** other job from running
> on that node, which unless this is a cluster for you and you alone, sounds
> like wasting 90% of the resources. I’d be most perturbed at a user doing
> that here without some astoundingly good reasons.
>
>
>
> *From: *Oren 
> *Date: *Tuesday, December 3, 2024 at 3:36 PM
> *To: *Renfro, Michael 
> *Cc: *slurm-us...@schedmd.com 
> *Subject: *Re: [slurm-users] How can I make sure my user have only one
> job per node (Job array --exclusive=user,)
>
> *External Email Warning*
>
> *This email originated from outside the university. Please use caution
> when opening attachments, clicking links, or responding to requests.*
> --
>
> Thanks, nice workaround.
> It will be great if there was a way to actually set it so that one can use
> only one node per job, a bit like ---exclusive.
> Thanks
>
>
>
> On Tue, 3 Dec 2024 at 16:24, Renfro, Michael  wrote:
>
> I’ve never done this myself, but others probably have. At the end of [1],
> there’s an example of making a generic resource for bandwidth. You could
> set that to any convenient units (bytes/second or bits/second, most
> likely), and assign your nodes a certain amount. Then any network-intensive
> job could reserve all the node’s bandwidth, without locking other
> less-intensive jobs off the node. It’s identical to reserving 1 or more
> GPUs per node, just without any hardware permissions.
>
>
>
> [1] https://slurm.schedmd.com/gres.conf.html#SECTION_EXAMPLES
>
>
>
> *From: *Oren 
> *Date: *Tuesday, December 3, 2024 at 3:15 PM
> *To: *Renfro, Michael 
> *Cc: *slurm-us...@schedmd.com 
> *Subject: *Re: [slurm-users] How can I make sure my user have only one
> job per node (Job array --exclusive=user,)
>
> *External Email Warning*
>
> *This email originated from outside the university. Please use caution
> when opening attachments, clicking links, or responding to requests.*
> --
>
> Thank you Michael,
>
> yeah, you guessed right, Networking.
> My job is mostly IO (Networking) intensive, my nodes connect to the
> network via a non blocking switch, but the ethernet cards are not the best,
>
> So I don't need many CPUs per node, but I do want to run on all nodes to
> fully utilize the network connection that each node has.
>
>
>
> Assuming I don't want to change the scheduler, is there anything else I
> can do?
>
> Thanks,
>
> Oren
>
>
>
> On Tue, 3 Dec 2024 at 15:10, Renfro, Michael  wrote:
>
> I’ll start with the question of “why spread the jobs out more than
> required?” and move on to why the other items didn’t work:
>
>
>
>1. exclusive only ensures that others’ jobs don’t run on a node with
>your jobs, and does nothing about other jobs you own.
>2. spread-job distributes the work of one job across multiple nodes,
>but does nothing about multiple jobs
>3. distribution also distributes the work of one job
>
>
>
> You might get something similar to what you want by changing the scheduler
> to use CR_LLN instead of CR_Core_Memory (or whatever you’re using), but
> that’ll potentially have serious side effects for others’ jobs.
>
>
>
> So back to the original question: why **not** pack 20 jobs onto fewer
> nodes if those nodes have the capacity to run the full set of jobs? You
> shouldn’t have a constraint with memory or CPUs. Are you trying to spread
> out an I/O load somehow? Networking?
>
>
>
> *From: *Oren via slurm-users 
> *Date: *Tuesday, December 3, 2024 at 1:35 PM
> *To: *slurm-us...@schedmd.com 
> *Subject: *[slurm-users] How can I make sure my user have only one job
> per node (Job array --exclusive=user,)
>
> *External Email Warning*
>
> *This email originated from outside the university. Please use caution
> when opening attachments, clicking links, or responding to requests.*
> --
>
> Hi,
> I have a cluster of 20-nodes, and I want to run a jobarray on that
> cluster, but I want each node to get one job per node.
>
>
>
> When I do the following:
>
> #!/bin/bash
>
> #SBATCH --job-name=process_images_train# Job name
>
> #SBATCH --time=50:00:00   # Time limit hrs:min:sec
>
> #SBATCH --tasks=1
>
> #SBATCH --cpus-per-task=4
>
> #SBATCH --mem=5
>
> #SBATCH --array=0-19# 19 # Job array with 20 jobs (0 to 19)
>
>
>
> I get 10 jobs in node #1 and 10 jobs in node #20, I want a job in each
> node.
>
>
>
> I've tried:
> #SBATCH --exclusive=user
>
> Also
>
> #SBATCH --spread-job
>
> #SBATCH  --distribution=cyclic
>
>
>
>
>
> Nothing changes, node#1 got 10 jobs and node#2 got 10 jobs.
>
> Thanks
>
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] How can I make sure my user have only one job per node (Job array --exclusive=user,)

2024-12-03 Thread Oren via slurm-users
Hi,
I have a cluster of 20-nodes, and I want to run a jobarray on that cluster,
but I want each node to get one job per node.

When I do the following:
#!/bin/bash
#SBATCH --job-name=process_images_train# Job name
#SBATCH --time=50:00:00   # Time limit hrs:min:sec
#SBATCH --tasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=5
#SBATCH --array=0-19# 19 # Job array with 20 jobs (0 to 19)

I get 10 jobs in node #1 and 10 jobs in node #20, I want a job in each node.

I've tried:
#SBATCH --exclusive=user
Also
#SBATCH --spread-job
#SBATCH  --distribution=cyclic


Nothing changes, node#1 got 10 jobs and node#2 got 10 jobs.

Thanks

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: How can I make sure my user have only one job per node (Job array --exclusive=user,)

2024-12-03 Thread Oren via slurm-users
Thank you Michael,
yeah, you guessed right, Networking.
My job is mostly IO (Networking) intensive, my nodes connect to the network
via a non blocking switch, but the ethernet cards are not the best,
So I don't need many CPUs per node, but I do want to run on all nodes to
fully utilize the network connection that each node has.

Assuming I don't want to change the scheduler, is there anything else I can
do?
Thanks,
Oren

On Tue, 3 Dec 2024 at 15:10, Renfro, Michael  wrote:

> I’ll start with the question of “why spread the jobs out more than
> required?” and move on to why the other items didn’t work:
>
>
>
>1. exclusive only ensures that others’ jobs don’t run on a node with
>your jobs, and does nothing about other jobs you own.
>2. spread-job distributes the work of one job across multiple nodes,
>but does nothing about multiple jobs
>3. distribution also distributes the work of one job
>
>
>
> You might get something similar to what you want by changing the scheduler
> to use CR_LLN instead of CR_Core_Memory (or whatever you’re using), but
> that’ll potentially have serious side effects for others’ jobs.
>
>
>
> So back to the original question: why **not** pack 20 jobs onto fewer
> nodes if those nodes have the capacity to run the full set of jobs? You
> shouldn’t have a constraint with memory or CPUs. Are you trying to spread
> out an I/O load somehow? Networking?
>
>
>
> *From: *Oren via slurm-users 
> *Date: *Tuesday, December 3, 2024 at 1:35 PM
> *To: *slurm-us...@schedmd.com 
> *Subject: *[slurm-users] How can I make sure my user have only one job
> per node (Job array --exclusive=user,)
>
> *External Email Warning*
>
> *This email originated from outside the university. Please use caution
> when opening attachments, clicking links, or responding to requests.*
> --
>
> Hi,
> I have a cluster of 20-nodes, and I want to run a jobarray on that
> cluster, but I want each node to get one job per node.
>
>
>
> When I do the following:
>
> #!/bin/bash
>
> #SBATCH --job-name=process_images_train# Job name
>
> #SBATCH --time=50:00:00   # Time limit hrs:min:sec
>
> #SBATCH --tasks=1
>
> #SBATCH --cpus-per-task=4
>
> #SBATCH --mem=5
>
> #SBATCH --array=0-19# 19 # Job array with 20 jobs (0 to 19)
>
>
>
> I get 10 jobs in node #1 and 10 jobs in node #20, I want a job in each
> node.
>
>
>
> I've tried:
> #SBATCH --exclusive=user
>
> Also
>
> #SBATCH --spread-job
>
> #SBATCH  --distribution=cyclic
>
>
>
>
>
> Nothing changes, node#1 got 10 jobs and node#2 got 10 jobs.
>
> Thanks
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] slurm send email status with no details

2025-04-06 Thread Oren via slurm-users
Hi,
We set up a slurm system with email notification, this is the slurm.conf
`MailProg=/usr/sbin/sendmail`

But the email that I get has not status, just an empty message:
[image: image.png]

no subject, no info, what are we missing?
Thanks~

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com