from:"Feng Zhang via slurm\-users"

[slurm-users] Re: srun weirdness

2024-05-14 Thread Feng Zhang via slurm-users

Looks more like a runtime environment issue.

Check the binaries:

ldd  /mnt/local/ollama/ollama

on both clusters and comparing the output may give some hints.

Best,

Feng

On Tue, May 14, 2024 at 2:41 PM Dj Merrill via slurm-users
 wrote:
>
> I'm running into a strange issue and I'm hoping another set of brains
> looking at this might help.  I would appreciate any feedback.
>
> I have two Slurm Clusters.  The first cluster is running Slurm 21.08.8
> on Rocky Linux 8.9 machines.  The second cluster is running Slurm
> 23.11.6 on Rocky Linux 9.4 machines.
>
> This works perfectly fine on the first cluster:
>
> $ srun --mem=32G --pty /bin/bash
>
> srun: job 93911 queued and waiting for resources
> srun: job 93911 has been allocated resources
>
> and on the resulting shell on the compute node:
>
> $ /mnt/local/ollama/ollama help
>
> and the ollama help message appears as expected.
>
> However, on the second cluster:
>
> $ srun --mem=32G --pty /bin/bash
> srun: job 3 queued and waiting for resources
> srun: job 3 has been allocated resources
>
> and on the resulting shell on the compute node:
>
> $ /mnt/local/ollama/ollama help
> fatal error: failed to reserve page summary memory
> runtime stack:
> runtime.throw({0x1240c66?, 0x154fa39a1008?})
>  runtime/panic.go:1023 +0x5c fp=0x7ffe6be32648 sp=0x7ffe6be32618
> pc=0x4605dc
> runtime.(*pageAlloc).sysInit(0x127b47e8, 0xf8?)
>  runtime/mpagealloc_64bit.go:81 +0x11c fp=0x7ffe6be326b8
> sp=0x7ffe6be32648 pc=0x456b7c
> runtime.(*pageAlloc).init(0x127b47e8, 0x127b47e0, 0x128d88f8, 0x0)
>  runtime/mpagealloc.go:320 +0x85 fp=0x7ffe6be326e8 sp=0x7ffe6be326b8
> pc=0x454565
> runtime.(*mheap).init(0x127b47e0)
>  runtime/mheap.go:769 +0x165 fp=0x7ffe6be32720 sp=0x7ffe6be326e8
> pc=0x451885
> runtime.mallocinit()
>  runtime/malloc.go:454 +0xd7 fp=0x7ffe6be32758 sp=0x7ffe6be32720
> pc=0x434f97
> runtime.schedinit()
>  runtime/proc.go:785 +0xb7 fp=0x7ffe6be327d0 sp=0x7ffe6be32758
> pc=0x464397
> runtime.rt0_go()
>  runtime/asm_amd64.s:349 +0x11c fp=0x7ffe6be327d8 sp=0x7ffe6be327d0
> pc=0x49421c
>
>
> If I ssh directly to the same node on that second cluster (skipping
> Slurm entirely), and run the same "/mnt/local/ollama/ollama help"
> command, it works perfectly fine.
>
>
> My first thought was that it might be related to cgroups.  I switched
> the second cluster from cgroups v2 to v1 and tried again, no
> difference.  I tried disabling cgroups on the second cluster by removing
> all cgroups references in the slurm.conf file but that also made no
> difference.
>
>
> My guess is something changed with regards to srun between these two
> Slurm versions, but I'm not sure what.
>
> Any thoughts on what might be happening and/or a way to get this to work
> on the second cluster?  Essentially I need a way to request an
> interactive shell through Slurm that is associated with the requested
> resources.  Should we be using something other than srun for this?
>
>
> Thank you,
>
> -Dj
>
>
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: srun weirdness

2024-05-14 Thread Feng Zhang via slurm-users

Not sure, very strange, while the two linux-vdso.so.1 looks different:

[deej@moose66 ~]$ ldd /mnt/local/ollama/ollama
 linux-vdso.so.1 (0x7ffde81ee000)


[deej@moose66 ~]$ ldd /mnt/local/ollama/ollama
 linux-vdso.so.1 (0x7fffa66ff000)

Best,

Feng

On Tue, May 14, 2024 at 3:43 PM Dj Merrill via slurm-users
 wrote:
>
> Hi Feng,
> Thank you for replying.
>
> It is the same binary on the same machine that fails.
>
> If I ssh to a compute node on the second cluster, it works fine.
>
> It fails when running in an interactive shell obtained with srun on that
> same compute node.
>
> I agree that it seems like a runtime environment difference between the
> SSH shell and the srun obtained shell.
>
> This is the ldd from within the srun obtained shell (and gives the error
> when run):
>
> [deej@moose66 ~]$ ldd /mnt/local/ollama/ollama
>  linux-vdso.so.1 (0x7ffde81ee000)
>  libresolv.so.2 => /lib64/libresolv.so.2 (0x154f732cc000)
>  libpthread.so.0 => /lib64/libpthread.so.0 (0x154f732c7000)
>  libstdc++.so.6 => /lib64/libstdc++.so.6 (0x154f7300)
>  librt.so.1 => /lib64/librt.so.1 (0x154f732c2000)
>  libdl.so.2 => /lib64/libdl.so.2 (0x154f732bb000)
>  libm.so.6 => /lib64/libm.so.6 (0x154f72f25000)
>  libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x154f732a)
>  libc.so.6 => /lib64/libc.so.6 (0x154f72c0)
>  /lib64/ld-linux-x86-64.so.2 (0x154f732f8000)
>
> This is the ldd from the same exact node within an SSH shell which runs
> fine:
>
> [deej@moose66 ~]$ ldd /mnt/local/ollama/ollama
>  linux-vdso.so.1 (0x7fffa66ff000)
>  libresolv.so.2 => /lib64/libresolv.so.2 (0x14a9d82da000)
>  libpthread.so.0 => /lib64/libpthread.so.0 (0x14a9d82d5000)
>  libstdc++.so.6 => /lib64/libstdc++.so.6 (0x14a9d800)
>  librt.so.1 => /lib64/librt.so.1 (0x14a9d82d)
>  libdl.so.2 => /lib64/libdl.so.2 (0x14a9d82c9000)
>  libm.so.6 => /lib64/libm.so.6 (0x14a9d7f25000)
>  libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x14a9d82ae000)
>  libc.so.6 => /lib64/libc.so.6 (0x14a9d7c0)
>  /lib64/ld-linux-x86-64.so.2 (0x14a9d8306000)
>
>
> -Dj
>
>
>
> On 5/14/24 15:25, Feng Zhang via slurm-users wrote:
> > Looks more like a runtime environment issue.
> >
> > Check the binaries:
> >
> > ldd  /mnt/local/ollama/ollama
> >
> > on both clusters and comparing the output may give some hints.
> >
> > Best,
> >
> > Feng
> >
> > On Tue, May 14, 2024 at 2:41 PM Dj Merrill via slurm-users
> >  wrote:
> >> I'm running into a strange issue and I'm hoping another set of brains
> >> looking at this might help.  I would appreciate any feedback.
> >>
> >> I have two Slurm Clusters.  The first cluster is running Slurm 21.08.8
> >> on Rocky Linux 8.9 machines.  The second cluster is running Slurm
> >> 23.11.6 on Rocky Linux 9.4 machines.
> >>
> >> This works perfectly fine on the first cluster:
> >>
> >> $ srun --mem=32G --pty /bin/bash
> >>
> >> srun: job 93911 queued and waiting for resources
> >> srun: job 93911 has been allocated resources
> >>
> >> and on the resulting shell on the compute node:
> >>
> >> $ /mnt/local/ollama/ollama help
> >>
> >> and the ollama help message appears as expected.
> >>
> >> However, on the second cluster:
> >>
> >> $ srun --mem=32G --pty /bin/bash
> >> srun: job 3 queued and waiting for resources
> >> srun: job 3 has been allocated resources
> >>
> >> and on the resulting shell on the compute node:
> >>
> >> $ /mnt/local/ollama/ollama help
> >> fatal error: failed to reserve page summary memory
> >> runtime stack:
> >> runtime.throw({0x1240c66?, 0x154fa39a1008?})
> >>   runtime/panic.go:1023 +0x5c fp=0x7ffe6be32648 sp=0x7ffe6be32618
> >> pc=0x4605dc
> >> runtime.(*pageAlloc).sysInit(0x127b47e8, 0xf8?)
> >>   runtime/mpagealloc_64bit.go:81 +0x11c fp=0x7ffe6be326b8
> >> sp=0x7ffe6be32648 pc=0x456b7c
> >> runtime.(*pageAlloc).init(0x127b47e8, 0x127b47e0, 0x128d88f8, 0x0)
> >>   runtime/mpagealloc.go:320 +0x85 fp=0x7ffe6be326e8 sp=0x7ffe6be326b8
> >> pc=0x454565
> >> runtime.(*mheap).init(0x127b47e0)
> >>   runtime/mheap.go:769 +0x165 fp=0x7ffe6be32720 sp=0x7ffe6be326e8
> >> pc=0x451885
> >> runtime.mallocinit()
> >>   runtime/malloc.go:454 +0xd7 fp=0x7ffe

[slurm-users] Re: srun weirdness

2024-05-14 Thread Feng Zhang via slurm-users

Do you have containers setting?

On Tue, May 14, 2024 at 3:57 PM Feng Zhang  wrote:
>
> Not sure, very strange, while the two linux-vdso.so.1 looks different:
>
> [deej@moose66 ~]$ ldd /mnt/local/ollama/ollama
>  linux-vdso.so.1 (0x7ffde81ee000)
>
>
> [deej@moose66 ~]$ ldd /mnt/local/ollama/ollama
>  linux-vdso.so.1 (0x7fffa66ff000)
>
> Best,
>
> Feng
>
> On Tue, May 14, 2024 at 3:43 PM Dj Merrill via slurm-users
>  wrote:
> >
> > Hi Feng,
> > Thank you for replying.
> >
> > It is the same binary on the same machine that fails.
> >
> > If I ssh to a compute node on the second cluster, it works fine.
> >
> > It fails when running in an interactive shell obtained with srun on that
> > same compute node.
> >
> > I agree that it seems like a runtime environment difference between the
> > SSH shell and the srun obtained shell.
> >
> > This is the ldd from within the srun obtained shell (and gives the error
> > when run):
> >
> > [deej@moose66 ~]$ ldd /mnt/local/ollama/ollama
> >  linux-vdso.so.1 (0x7ffde81ee000)
> >  libresolv.so.2 => /lib64/libresolv.so.2 (0x154f732cc000)
> >  libpthread.so.0 => /lib64/libpthread.so.0 (0x154f732c7000)
> >  libstdc++.so.6 => /lib64/libstdc++.so.6 (0x154f7300)
> >  librt.so.1 => /lib64/librt.so.1 (0x154f732c2000)
> >  libdl.so.2 => /lib64/libdl.so.2 (0x154f732bb000)
> >  libm.so.6 => /lib64/libm.so.6 (0x154f72f25000)
> >  libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x154f732a)
> >  libc.so.6 => /lib64/libc.so.6 (0x154f72c0)
> >  /lib64/ld-linux-x86-64.so.2 (0x154f732f8000)
> >
> > This is the ldd from the same exact node within an SSH shell which runs
> > fine:
> >
> > [deej@moose66 ~]$ ldd /mnt/local/ollama/ollama
> >  linux-vdso.so.1 (0x7fffa66ff000)
> >  libresolv.so.2 => /lib64/libresolv.so.2 (0x14a9d82da000)
> >  libpthread.so.0 => /lib64/libpthread.so.0 (0x14a9d82d5000)
> >  libstdc++.so.6 => /lib64/libstdc++.so.6 (0x14a9d800)
> >  librt.so.1 => /lib64/librt.so.1 (0x14a9d82d)
> >  libdl.so.2 => /lib64/libdl.so.2 (0x000014a9d82c9000)
> >  libm.so.6 => /lib64/libm.so.6 (0x14a9d7f25000)
> >  libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x14a9d82ae000)
> >  libc.so.6 => /lib64/libc.so.6 (0x14a9d7c0)
> >  /lib64/ld-linux-x86-64.so.2 (0x14a9d8306000)
> >
> >
> > -Dj
> >
> >
> >
> > On 5/14/24 15:25, Feng Zhang via slurm-users wrote:
> > > Looks more like a runtime environment issue.
> > >
> > > Check the binaries:
> > >
> > > ldd  /mnt/local/ollama/ollama
> > >
> > > on both clusters and comparing the output may give some hints.
> > >
> > > Best,
> > >
> > > Feng
> > >
> > > On Tue, May 14, 2024 at 2:41 PM Dj Merrill via slurm-users
> > >  wrote:
> > >> I'm running into a strange issue and I'm hoping another set of brains
> > >> looking at this might help.  I would appreciate any feedback.
> > >>
> > >> I have two Slurm Clusters.  The first cluster is running Slurm 21.08.8
> > >> on Rocky Linux 8.9 machines.  The second cluster is running Slurm
> > >> 23.11.6 on Rocky Linux 9.4 machines.
> > >>
> > >> This works perfectly fine on the first cluster:
> > >>
> > >> $ srun --mem=32G --pty /bin/bash
> > >>
> > >> srun: job 93911 queued and waiting for resources
> > >> srun: job 93911 has been allocated resources
> > >>
> > >> and on the resulting shell on the compute node:
> > >>
> > >> $ /mnt/local/ollama/ollama help
> > >>
> > >> and the ollama help message appears as expected.
> > >>
> > >> However, on the second cluster:
> > >>
> > >> $ srun --mem=32G --pty /bin/bash
> > >> srun: job 3 queued and waiting for resources
> > >> srun: job 3 has been allocated resources
> > >>
> > >> and on the resulting shell on the compute node:
> > >>
> > >> $ /mnt/local/ollama/ollama help
> > >> fatal error: failed to reserve page summary memory
> > >> runtime stack:
> > >> runtime.throw({0x1240c66?, 0x154fa39a1008?})
> > >>   runtime/panic.go:1023 +0x5c fp=0x7ffe6be32648 sp=0x7ffe6be32618
> >

[slurm-users] maxrss reported by sachet is wrong

2024-06-07 Thread Feng Zhang via slurm-users

Hi All,

I am having trouble calculating the real RSS memory usage by some kind
of users' jobs. Which the sacct returned wrong numbers.

Rocky Linux release 8.5, Slurm 21.08

(slurm.conf)
ProctrackType=proctrack/cgroup
JobAcctGatherType=jobacct_gather/linux

The troubling jobs are like:

1. python spawn multithreading 96 threads;

2. Each thread uses SKlearn which again spawns 96 threads using openmp.

Which is obviously over running the node, and I want to address it.

The node has 300GB RAM, but the "sacct" (and seff) reports 1.2TB
MaxRSS(also AveRSS). This does not look correct.


I am suspecting that whether the SLurm+jobacct_gather/linux repeatedly
sums up the memory used by all these threads, multiple counted the
same thing many times.

For the openMP part, maybe it is fine for slurm; while for
python/multithreading, maybe it can not work well with Slurm for
memory accounting?

So, if this is the case, maybe 1.2TB/96= 12GB MaxRSS?

I want to get the right MaxRSS to report to users.

Thanks!

Best,

Feng

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Can Not Use A Single GPU for Multiple Jobs

2024-06-21 Thread Feng Zhang via slurm-users

yes, the algorithm should be like that 1 cpu (core) per job(task).
Like someone mentioned already, need to to --oversubscribe=10 on cpu
cores, meaning 10 jobs on each core for you case. Slurm.conf.
Best,

Feng

On Fri, Jun 21, 2024 at 6:52 AM Arnuld via slurm-users
 wrote:
>
> > Every job will need at least 1 core just to run
> > and if there are only 4 cores on the machine,
> > one would expect a max of 4 jobs to run.
>
> I have 3500+ GPU cores available. You mean each GPU job requires at least one 
> CPU? Can't we run a job with just GPU without any CPUs? This sbatch script 
> requires 100 GPU cores, can;t we run 35 in parallel?
>
> #! /usr/bin/env bash
>
> #SBATCH --output="%j.out"
> #SBATCH --error="%j.error"
> #SBATCH --partition=pgpu
> #SBATCH --gres=shard:100
>
> sleep 10
> echo "Current date and time: $(date +"%Y-%m-%d %H:%M:%S")"
> echo "Running..."
> sleep 10
>
>
>
>
>
>
> On Thu, Jun 20, 2024 at 11:23 PM Brian Andrus via slurm-users 
>  wrote:
>>
>> Well, if I am reading this right, it makes sense.
>>
>> Every job will need at least 1 core just to run and if there are only 4
>> cores on the machine, one would expect a max of 4 jobs to run.
>>
>> Brian Andrus
>>
>> On 6/20/2024 5:24 AM, Arnuld via slurm-users wrote:
>> > I have a machine with a quad-core CPU and an Nvidia GPU with 3500+
>> > cores.  I want to run around 10 jobs in parallel on the GPU (mostly
>> > are CUDA based jobs).
>> >
>> > PROBLEM: Each job asks for only 100 shards (runs usually for a minute
>> > or so), then I should be able to run 3500/100 = 35 jobs in
>> > parallel but slurm runs only 4 jobs in parallel keeping the rest in
>> > the queue.
>> >
>> > I have this in slurm.conf and gres.conf:
>> >
>> > # GPU
>> > GresTypes=gpu,shard
>> > # COMPUTE NODES
>> > PartitionName=pzero Nodes=ALL Default=YES MaxTime=INFINITE State=UP`
>> > PartitionName=pgpu Nodes=hostgpu MaxTime=INFINITE State=UP
>> > NodeName=hostgpu NodeAddr=x.x.x.x Gres=gpu:gtx_1080_ti:1,shard:3500
>> > CPUs=4 Boards=1 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=1
>> > RealMemory=64255 State=UNKNOWN
>> > --
>> > Name=gpu Type=gtx_1080_ti File=/dev/nvidia0 Count=1
>> > Name=shard Count=3500  File=/dev/nvidia0
>> >
>> >
>> >
>>
>> --
>> slurm-users mailing list -- slurm-users@lists.schedmd.com
>> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Print Slurm Stats on Login

2024-08-28 Thread Feng Zhang via slurm-users

You can also check https://github.com/prod-feng/slurm_tools

slurm_job_perf_show.py may be helpful.

I used to try to use slurm_job_perf_show_email.py to send emails to
users to summarize their usage, like monthly. While some users seemed
to get confused, so stopped.

Best,

Feng

On Fri, Aug 9, 2024 at 11:13 AM Paul Edmon via slurm-users
 wrote:
>
> We are working to make our users more aware of their usage. One of the
> ideas we came up with was to having some basic usage stats printed at
> login (usage over past day, fairshare, job efficiency, etc). Does anyone
> have any scripts or methods that they use to do this? Before baking my
> own I was curious what other sites do and if they would be willing to
> share their scripts and methodology.
>
> -Paul Edmon-
>
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: /etc/passwd sync?

2025-02-11 Thread Feng Zhang via slurm-users

Keep the /etc/password, group synced to all the nodes should work. And it
will need to set up an SSH key for MPI.

Best,

Feng


On Mon, Feb 10, 2025 at 10:29 PM mark.w.moorcroft--- via slurm-users <
slurm-users@lists.schedmd.com> wrote:

> If you set up slurm elastic cloud in EC2 without LDAP, what is the
> recommended method for sync of the passwd/group files? Is this necessary to
> get openmpi jobs to run. I would swear I had this working last week without
> synced passwd on two nodes. But thinking about it now I'm not sure how this
> could have worked. My home directories are in an NFS mount, but the user
> accounts don't exist on the node AMI. I'm using ansible/packer to manage
> the AMI's. When I ran OpenHPC / Slurm on bare metal there was a sync
> process. This is my first AWS Slurm cluster rodeo. I can't use the Amazon
> Parallel Computing tools because we are forced to be in GovCloud. I started
> with "ClusterInTheCloud", but it's all 4 years old, and semi-broken out of
> the box. My manager had me ditch a lot of it (including LDAP). So I'm
> building out a fork that is getting heavily modded for our situation.
>
> An ORTE daemon has unexpectedly failed after launch and before
> communicating back to mpirun. This could be caused by a number
> of factors, including an inability to create a connection back
> to mpirun due to a lack of common network interfaces and/or no
> route found between them. Please check network connectivity
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Using more cores/CPUs that requested with sbatch

2025-03-25 Thread Feng Zhang via slurm-users

Also in the cgroup.conf file, you can add constraints on memory,
devices(like GPU), etc.

Best,

Feng


On Tue, Mar 25, 2025 at 3:20 AM megan4slurm--- via slurm-users <
slurm-users@lists.schedmd.com> wrote:

> Hello Gestió,
>
> Yes, slurm can restrict the resources that are available to the job using
> cgroups. I accidentally send my first reply as a separate email in this
> mailing list, which you can find here:
>
> https://lists.schedmd.com/mailman3/hyperkitty/list/slurm-users@lists.schedmd.com/thread/IJHBUWOU5NPZQK7NYUZODTIZJRLLM3H4/
>
> Sorry about that,
> --Megan
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Job running slower when using Slurm

2025-04-23 Thread Feng Zhang via slurm-users

Beside slurm options, you might also need to set OpenMP env variable:

export OMP_NUM_THREADS=32 (the core, not thread number)

Also other similar env variables, if you use any Python  libs.
Best,

Feng


On Wed, Apr 23, 2025 at 3:22 PM Jeffrey Layton via slurm-users <
slurm-users@lists.schedmd.com> wrote:

> Roger. It's the code that prints out the threads it sees - I bet it is the
> cgroups. I need to look at how that it is configured as well.
>
> For the time, that comes from the code itself. I'm guessing it has a start
> time and and end time in the code and just takes the difference. But again,
> this is something in the code. Unfortunately, the code uses the time to
> compute Mop/s total and Mop/s/thread so a longer time means slower
> performance.
>
> Thanks!
>
> Jeff
>
>
> On Wed, Apr 23, 2025 at 2:53 PM Michael DiDomenico via slurm-users <
> slurm-users@lists.schedmd.com> wrote:
>
>> the program probably says 32 threads, because it's just looking at the
>> box, not what slurm cgroups allow (assuming your using them) for cpu
>>
>> i think for an openmp program (not openmpi) you definitely want the
>> first command with --cpus-per-task=32
>>
>> are you measuring the runtime inside the program or outside it?  if
>> the later the 10sec addition in time could be the slurm setup/node
>> allocation
>>
>> On Wed, Apr 23, 2025 at 2:41 PM Jeffrey Layton 
>> wrote:
>> >
>> > I tried using ntasks and cpus-per-task to get all 32 cores. So I added
>> --ntasks=# --cpus-per-task=N  to th sbatch command  so that it now looks
>> like:
>> >
>> > sbatch --nodes=1 --ntasks=1 --cpus-per-task=32

[slurm-users] Re: srun weirdness

[slurm-users] Re: srun weirdness

[slurm-users] Re: srun weirdness

[slurm-users] maxrss reported by sachet is wrong

[slurm-users] Re: Can Not Use A Single GPU for Multiple Jobs

[slurm-users] Re: Print Slurm Stats on Login

[slurm-users] Re: /etc/passwd sync?

[slurm-users] Re: Using more cores/CPUs that requested with sbatch

[slurm-users] Re: Job running slower when using Slurm

9 matches

Site Navigation

Mail list logo

Footer information