Hello,
In my cluster, every node has one instance of a gres called ‘io_nic’. The
intent of it is to make it easier for users to ensure that jobs that perform
excessive network I/O do not get scheduled simultaneously on the same machine.
$ sinfo -N -o '%N %Gres'
NODELIST GRESres
chhq-supgcmp001
I have a user who submitted an interactive srun job using:
srun --mem-per-gpu 64 --gpus 1 --nodes 1
From sacct for this job we see:
ReqTRES : billing=4,cpu=1,gres/gpu=1,mem=10G,node=1
AllocTRES : billing=4,cpu=1,gres/gpu=1,mem=64M,node=1
(where 10G I assume comes from th
Hi, Xand:
How does adding "ReqMem" to the sacct change the output?
E.g. on my cluster running Slurm 20.02.7 (on RHEL8), our GPU nodes have
TRESBillingWeights=CPU=0,Mem=0,GRES/gpu=43:
$ sacct --format=JobID%25,State,AllocTRES%50,ReqTRES,ReqMem,ReqCPUS|grep RUNNING
JobID
slurm.conf contains the following:
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
AccountingStorageTRES=gres/gpu
Could this be constraining CgfTRES=cpu=16 somehow?
David Guertin
From: Guertin, David S.
Sent: Wednesday, April 6, 2022 12:27 PM
To: Slurm
No, the user is submitting four jobs, each requesting 1/4 of the memory and 1/4
of the CPUs (i.e. 8 out of 32). But even though there are 32 physical cores,
Slurm only shows 16 as trackable resources:
>From scontrol show node node020:
CfgTRES=cpu=16,mem=257600M,billing=16,gres/gpu=4
Why would
Are the jobs getting assigned memory amounts that would only allow 16
processors to be used when the jobs are running on the node?
Jeff
From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of
Guertin, David S.
Sent: Wednesday, April 6, 2022 9:21 AM
To: slurm-users@lists.sc
Thanks. That shows 32 cores, as expected:
# /cm/shared/apps/slurm/19.05.8/sbin/slurmd -C
NodeName=node020 CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=16
ThreadsPerCore=1 RealMemory=257600
UpTime=0-22:39:36
But I can't understand why when users submit jobs, the node is only allocating
16.
Hi Sushil,
Try changing NodeName specification to:
NodeName=localhost CPUs=96 State=UNKNOWN Gres=gpu*:8*
Also:
TaskPlugin=task/cgroup
Best,
Steve
On Wed, Apr 6, 2022 at 9:56 AM Sushil Mishra
wrote:
> Dear SLURM users,
>
> I am very new to alarm and need some help in configuring slurm in
Hello,
try to comment out the line:
AutoDetect=nvml
And then restart "slurmd" and "slurmctld".
Job allocations to the same GPU might be an effect of automatic MPS
configuration, thogugh I'm not sure for 100%:
https://slurm.schedmd.com/gres.html#MPS_Management
Kind Regards
--
Kamil Wilczek
Dear SLURM users,
I am very new to alarm and need some help in configuring slurm in a single
node machine. This machine has 8x Nvidia GPUs and 96 core cpu. Vendor has
set up a "LocalQ" but thai somehow is running all the calculations in GPU
0. If I submit 4 independent jobs at a time, it starts ru
Thanks, Greg! This looks like the right way to do this. I will have to stop
putting off learning to use spank plugins :)
griznog
On Wed, Apr 6, 2022 at 1:40 AM Greg Wickham
wrote:
> Hi John, Mark,
>
>
>
> We use a spank plugin
> https://gitlab.com/greg.wickham/slurm-spank-private-tmpdir (this w
11 matches
Mail list logo