Hi Hermann,
count doesn't make a difference, but I noticed that when I reconfigure
slurm and do reloads afterwards, the error "gpu count lower than
configured" no longer appears - so maybe it is just because a
reconfigure is needed after reloading slurmctld - or maybe it doesn't
show the error anymore, because the node is still invalid? However, I
still get the error:
error: _slurm_rpc_node_registration node=NName: Invalid argument
If I understand correctly, this is telling me that there's something
wrong with my slurm.conf. I know that all pre-existing parameters are
correct, so I assume it must be the gpus entry, but I don't see where
it's wrong:
NodeName=NName SocketsPerBoard=8 CoresPerSocket=1 RealMemory=8000
Gres=gpu:1 State=CLOUD # bibiserv
Thanks for all the help,
Xaver
On 19.07.23 15:04, Hermann Schwärzler wrote:
Hi Xaver,
I think you are missing the "Count=..." part in gres.conf
It should read
NodeName=NName Name=gpu File=/dev/tty0 Count=1
in your case.
Regards,
Hermann
On 7/19/23 14:19, Xaver Stiensmeier wrote:
Okay,
thanks to S. Zhang I was able to figure out why nothing changed.
While I did restart systemctld at the beginning of my tests, I didn't
do so later, because I felt like it was unnecessary, but it is right
there in the fourth line of the log that this is needed. Somehow I
misread it and thought it automatically restarted slurmctld.
Given the setup:
slurm.conf
...
GresTypes=gpu
NodeName=NName SocketsPerBoard=8 CoresPerSocket=1 RealMemory=8000
GRES=gpu:1 State=UNKNOWN
...
gres.conf
NodeName=NName Name=gpu File=/dev/tty0
When restarting, I get the following error:
error: Setting node NName state to INVAL with reason:gres/gpu count
reported lower than configured (0 < 1)
So it is still not working, but at least I get a more helpful log
message. Because I know that this /dev/tty trick works, I am still
unsure where the current error lies, but I will try to investigate it
further. I am thankful for any ideas in that regard.
Best regards,
Xaver
On 19.07.23 10:23, Xaver Stiensmeier wrote:
Alright,
I tried a few more things, but I still wasn't able to get past:
srun: error: Unable to allocate resources: Invalid generic resource
(gres) specification.
I should mention that the node I am trying to test GPU with, doesn't
really have a gpu, but Rob was so kind to find out that you do not
need a gpu as long as you just link to a file in /dev/ in the
gres.conf. As mentioned: This is just for testing purposes - in the
end we will run this on a node with a gpu, but it is not available
at the moment.
*The error isn't changing*
If I omitt "GresTypes=gpu" and "Gres=gpu:1", I still get the same
error.
*Debug Info*
I added the gpu debug flag and logged the following:
[2023-07-18T14:59:45.026] restoring original state of nodes
[2023-07-18T14:59:45.026] select/cons_tres: part_data_create_array:
select/cons_tres: preparing for 2 partitions
[2023-07-18T14:59:45.026] error: GresPlugins changed from (null) to
gpu ignored
[2023-07-18T14:59:45.026] error: Restart the slurmctld daemon to
change GresPlugins
[2023-07-18T14:59:45.026] read_slurm_conf: backup_controller not
specified
[2023-07-18T14:59:45.026] error: GresPlugins changed from (null) to
gpu ignored
[2023-07-18T14:59:45.026] error: Restart the slurmctld daemon to
change GresPlugins
[2023-07-18T14:59:45.026] select/cons_tres: select_p_reconfigure:
select/cons_tres: reconfigure
[2023-07-18T14:59:45.027] select/cons_tres: part_data_create_array:
select/cons_tres: preparing for 2 partitions
[2023-07-18T14:59:45.027] No parameter for mcs plugin, default
values set
[2023-07-18T14:59:45.027] mcs: MCSParameters = (null). ondemand set.
[2023-07-18T14:59:45.028] _slurm_rpc_reconfigure_controller:
completed usec=5898
[2023-07-18T14:59:45.952]
SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
I am a bit unsure what to do next to further investigate this issue.
Best regards,
Xaver
On 17.07.23 15:57, Groner, Rob wrote:
That would certainly do it. If you look at the slurmctld log when
it comes up, it will say that it's marking that node as invalid
because it has less (0) gres resources then you say it should
have. That's because slurmd on that node will come up and say
"What gres resources??"
For testing purposes, you can just create a dummy file on the
node, then in gres.conf, point to that file as the "graphics file"
interface. As long as you don't try to actually use it as a
graphics file, that should be enough for that node to think it has
gres/gpu resources. That's what I do in my vagrant slurm cluster.
Rob
------------------------------------------------------------------------
*From:* slurm-users <slurm-users-boun...@lists.schedmd.com> on
behalf of Xaver Stiensmeier <xaverstiensme...@gmx.de>
*Sent:* Monday, July 17, 2023 9:43 AM
*To:* slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com>
*Subject:* Re: [slurm-users] GRES and GPUs
Hi Hermann,
Good idea, but we are already using `SelectType=select/cons_tres`.
After
setting everything up again (in case I made an unnoticed mistake),
I saw
that the node got marked STATE=inval.
To be honest, I thought I can just claim that a node has a gpu even if
it doesn't have one - just for testing purposes. Could this be the
issue?
Best regards,
Xaver Stiensmeier
On 17.07.23 14:11, Hermann Schwärzler wrote:
> Hi Xaver,
>
> what kind of SelectType are you using in your slurm.conf?
>
> Per
https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fgres.html&data=05%7C01%7Crug262%40psu.edu%7Cbc4b7775beae4d2e376c08db86cbfc7b%7C7cf48d453ddb4389a9c1c115526eb52e%7C0%7C0%7C638251982928987379%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=PqvE6pL2sKSb6KxLngi0sbm6qhIv8MRYTmUM%2Bgq1hrI%3D&reserved=0
<https://slurm.schedmd.com/gres.html> you have to consider:
> "As for the --gpu* option, these options are only supported by
Slurm's
> select/cons_tres plugin."
>
> So you can use "--gpus ..." only when you state
> SelectType = select/cons_tres
> in your slurm.conf.
>
> But "--gres=gpu:1" should work always.
>
> Regards
> Hermann
>
>
> On 7/17/23 13:43, Xaver Stiensmeier wrote:
>> Hey,
>>
>> I am currently trying to understand how I can schedule a job that
>> needs a GPU.
>>
>> I read about GRES
https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fgres.html&data=05%7C01%7Crug262%40psu.edu%7Cbc4b7775beae4d2e376c08db86cbfc7b%7C7cf48d453ddb4389a9c1c115526eb52e%7C0%7C0%7C638251982928987379%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=PqvE6pL2sKSb6KxLngi0sbm6qhIv8MRYTmUM%2Bgq1hrI%3D&reserved=0
<https://slurm.schedmd.com/gres.html> and tried to use:
>>
>> GresTypes=gpu
>> NodeName=test Gres=gpu:1
>>
>> But calling - after a 'sudo scontrol reconfigure':
>>
>> srun --gpus 1 hostname
>>
>> didn't work:
>>
>> srun: error: Unable to allocate resources: Invalid generic resource
>> (gres) specification
>>
>> so I read more
https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fgres.conf.html&data=05%7C01%7Crug262%40psu.edu%7Cbc4b7775beae4d2e376c08db86cbfc7b%7C7cf48d453ddb4389a9c1c115526eb52e%7C0%7C0%7C638251982928987379%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=aCh8X6QtJpRlIWxo%2BQxL85CC%2FbIo6bDxAY%2Fd5B9khmE%3D&reserved=0
<https://slurm.schedmd.com/gres.conf.html> but that
>> didn't really help me.
>>
>>
>> I am rather confused. GRES claims to be generic resources but
then it
>> comes with three defined resources (GPU, MPS, MIG) and using one of
>> those didn't work in my case.
>>
>> Obviously, I am misunderstanding something, but I am unsure
where to
>> look.
>>
>>
>> Best regards,
>> Xaver Stiensmeier
>>
>