[slurm-users] Re: [External] Re: First setup of slurm with a GPU node

2024-11-13 Thread Henk Meij via slurm-users
Yes, I noticed this changed behavior too since v22 (testing v24 now)

The gres definitions in gres.conf are ignored but must be in slurm.conf

My gres.conf file now only has

NodeName=n[79-90] AutoDetect=nvml

-Henk

From: Benjamin Smith via slurm-users 
Sent: Wednesday, November 13, 2024 11:31 AM
To: Slurm User Community List 
Subject: [External] [slurm-users] Re: First setup of slurm with a GPU node


Hi Patrick,

You're missing a Gres= on your node in your slurm.conf:

Nodename=tenibre-gpu-0 RealMemory=257270 Sockets=2 CoresPerSocket=16 
ThreadsPerCore=1 State=UNKNOWN Gres=gpu:A100-40:1,gpu:A100-80:1

Ben

On 13/11/2024 16:00, Patrick Begou via slurm-users wrote:
This email was sent to you by someone outside the University.
You should only click on links or attachments if you are certain that the email 
is genuine and the content is safe.
Le 13/11/2024 à 15:45, Roberto Polverelli Monti via slurm-users a écrit :
Hello Patrick,

On 11/13/24 12:01 PM, Patrick Begou via slurm-users wrote:
As using this GPU resource increase I would like to manage this resource with 
Gres to avoid usage conflict. But at this time my setup do not works as I can 
reach a GPU without reserving it:

srun -n 1 -p tenibre-gpu ./a.out

can use a GPU even if the reservation do not specify this resource (checked 
with running nvidia-smi  on the node). "tenibre-gpu" is a slurm partition with 
only this gpu node.

I think what you're looking for is the ConstrainDevices parameter in 
cgroup.conf.

See here:
- https://slurm.schedmd.com/archive/slurm-20.11.7/cgroup.conf.html

Best,


Hi Roberto,

thanks for pointing to this parameter. I set it, update all the nodes, restart 
slurmd everywhere but it does not change the behavior.
However, when looking in the slurmd log on the GPU node I notice this 
information:


[2024-11-13T16:41:08.434] debug:  CPUs:32 Boards:1 Sockets:8 CoresPerSocket:4 
ThreadsPerCore:1
[2024-11-13T16:41:08.434] debug:  gres/gpu: init: loaded
[2024-11-13T16:41:08.434] WARNING: A line in gres.conf for GRES gpu:A100-40 has 
1 more configured than expected in slurm.conf. Ignoring extra GRES.
[2024-11-13T16:41:08.434] WARNING: A line in gres.conf for GRES gpu:A100-80 has 
1 more configured than expected in slurm.conf. Ignoring extra GRES.
[2024-11-13T16:41:08.434] debug:  gpu/generic: init: init: GPU Generic plugin 
loaded
[2024-11-13T16:41:08.434] topology/none: init: topology NONE plugin loaded
[2024-11-13T16:41:08.434] route/default: init: route default plugin loaded
[2024-11-13T16:41:08.434] CPU frequency setting not configured for this node
[2024-11-13T16:41:08.434] debug:  Resource spec: No specialized cores 
configured by default on this node
[2024-11-13T16:41:08.434] debug:  Resource spec: Reserved system memory limit 
not configured for this node
[2024-11-13T16:41:08.434] debug:  Reading cgroup.conf file 
/etc/slurm/cgroup.conf
[2024-11-13T16:41:08.434] error: MaxSwapPercent value (0.0%) is not a valid 
number
[2024-11-13T16:41:08.436] debug:  task/cgroup: init: core enforcement enabled
[2024-11-13T16:41:08.437] debug:  task/cgroup: task_cgroup_memory_init: 
task/cgroup/memory: total:257281M allowed:100%(enforced), swap:0%(enforced), 
max:100%(257281M) max+swap:100%(514562M) min:30M kmem:100%(257281M permissive) 
min:30M swappiness:0(unset)
[2024-11-13T16:41:08.437] debug:  task/cgroup: init: memory enforcement enabled
[2024-11-13T16:41:08.438] debug:  task/cgroup: task_cgroup_devices_init: unable 
to open /etc/slurm/cgroup_allowed_devices_file.conf: No such file or directory
[2024-11-13T16:41:08.438] debug:  task/cgroup: init: device enforcement enabled
[2024-11-13T16:41:08.438] debug:  task/cgroup: init: task/cgroup: loaded
[2024-11-13T16:41:08.438] debug:  auth/munge: init: Munge authentication plugin 
loaded


So something is wrong in may gres.conf file I think as I ttry do configure 2 
different devices on the node may be?

## GPU setup on tenibre-gpu-0
NodeName=tenibre-gpu-0 Name=gpu Type=A100-40 File=/dev/nvidia0 
Flags=nvidia_gpu_env
NodeName=tenibre-gpu-0 Name=gpu Type=A100-80 File=/dev/nvidia1 
Flags=nvidia_gpu_env

Patrick






--
Benjamin Smith 
Computing Officer, AT-7.12a
Research and Teaching Unit
School of Informatics, University of Edinburgh

The University of Edinburgh is a charitable body, registered in Scotland, with 
registration number SC005336. Is e buidheann carthannais a th’ ann an Oilthigh 
Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: [External] Re: InvalidAccount

2024-11-12 Thread Henk Meij via slurm-users
Ole, I had not made that connection yet ... The required part. Could be 
documented a bit more clearly, if true.

Small institutions like us are not interested in managing slurm accounts and 
projects. Also weird that job Reason changes from InvalidAccount to None in 
minutes but job is not released. While sinfo reports partition is available to 
Group 'all'

-Henk

From: Ole Holm Nielsen via slurm-users 
Sent: Tuesday, November 12, 2024 2:23 AM
To: slurm-users@lists.schedmd.com 
Subject: [External] [slurm-users] Re: InvalidAccount

On 11/11/24 21:39, Ole Holm Nielsen wrote:
> Hi Henk,
>
> On 11-11-2024 20:06, hmeij--- via slurm-users wrote:
>  > Manual compilation of 24.05.4. slurmctld and slurmd run on same server.
> All works ok but all test jobs end up pending with InvalidAccount message.
> I do not use slurm database and have not enabled accounting. Can not find
> an answer for this behavior or a misconfiguration. slurm.conf file was
> generated using easy config tool. Any ideas how to fix this? Thx,
>
> In the slurm.conf manual page for 24.05 the accounting options are listed:
>
>> AccountingStorageType
>> The accounting storage mechanism type. Acceptable values at present
>> "accounting_storage/slurmdbd".  The "accounting_storage/slurmdbd" value
>> indicates that accounting records will be written to the Slurm DBD,
>> which manages an underlying MySQL database. See "man slurmdbd"  for
>> more  information.
>> When this is not set it indicates that account records are not maintained.
>
> In other words, the use of slurmdbd seems to be *required* as of Slurm
> 24.05!  The use of AccountingStorageType=accounting_storage/none seems to
> be deprecated, but I can't offhand find this to be documented.  Can anyone
> else help?

It seems that accounting_storage/none was removed (deprecated) from 23.11,
but it was still documented until 23.02:

https://github.com/SchedMD/slurm/blob/slurm-23.02/doc/man/man5/slurm.conf.5#L210

Note that in 22.05 the "accounting_storage/none" still implied that
account records would not work, as you have experienced when getting
InvalidAccount messages:

> The default value is "accounting_storage/none" and indicates that account
> records are not maintained.

IHTH,
Ole


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: [External] Re: InvalidAccount

2024-11-18 Thread Henk Meij via slurm-users
Ole, you wrote

"Perhaps you may find some usable Slurm setup guidance in this Wiki page:
https://wiki.fysik.dtu.dk/Niflheim_system/
"

Just want to put it out there, that documentation is awesome.
With a high level of details. Thanks! Our test slurmdbd is up.

-Henk
____
From: Henk Meij via slurm-users 
Sent: Tuesday, November 12, 2024 9:36 AM
To: slurm-users@lists.schedmd.com ; 
ole.h.niel...@fysik.dtu.dk 
Subject: [slurm-users] Re: [External] Re: InvalidAccount

Ole, I had not made that connection yet ... The required part. Could be 
documented a bit more clearly, if true.

Small institutions like us are not interested in managing slurm accounts and 
projects. Also weird that job Reason changes from InvalidAccount to None in 
minutes but job is not released. While sinfo reports partition is available to 
Group 'all'

-Henk

From: Ole Holm Nielsen via slurm-users 
Sent: Tuesday, November 12, 2024 2:23 AM
To: slurm-users@lists.schedmd.com 
Subject: [External] [slurm-users] Re: InvalidAccount

On 11/11/24 21:39, Ole Holm Nielsen wrote:
> Hi Henk,
>
> On 11-11-2024 20:06, hmeij--- via slurm-users wrote:
>  > Manual compilation of 24.05.4. slurmctld and slurmd run on same server.
> All works ok but all test jobs end up pending with InvalidAccount message.
> I do not use slurm database and have not enabled accounting. Can not find
> an answer for this behavior or a misconfiguration. slurm.conf file was
> generated using easy config tool. Any ideas how to fix this? Thx,
>
> In the slurm.conf manual page for 24.05 the accounting options are listed:
>
>> AccountingStorageType
>> The accounting storage mechanism type. Acceptable values at present
>> "accounting_storage/slurmdbd".  The "accounting_storage/slurmdbd" value
>> indicates that accounting records will be written to the Slurm DBD,
>> which manages an underlying MySQL database. See "man slurmdbd"  for
>> more  information.
>> When this is not set it indicates that account records are not maintained.
>
> In other words, the use of slurmdbd seems to be *required* as of Slurm
> 24.05!  The use of AccountingStorageType=accounting_storage/none seems to
> be deprecated, but I can't offhand find this to be documented.  Can anyone
> else help?

It seems that accounting_storage/none was removed (deprecated) from 23.11,
but it was still documented until 23.02:

https://github.com/SchedMD/slurm/blob/slurm-23.02/doc/man/man5/slurm.conf.5#L210

Note that in 22.05 the "accounting_storage/none" still implied that
account records would not work, as you have experienced when getting
InvalidAccount messages:

> The default value is "accounting_storage/none" and indicates that account
> records are not maintained.

IHTH,
Ole


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com