Hi Rob,thank you very much for that hint. I tried setting the MIG slices manually in the gres.conf and it works now.
Thank you very much. Best regards, Timon -- Timon Vogt Arbeitsgruppe "Computing" Nationales Hochleistungsrechnen (NHR) Scientific Employee NHR Tel.: +49 551 39-30146, E-Mail:timon.v...@gwdg.de ------------------------------------------------------------------------- Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG) Burckhardtweg 4, 37077 Göttingen, URL:https://gwdg.de Support: Tel.: +49 551 39-30000, URL:https://gwdg.de/support Sekretariat: Tel.: +49 551 39-30001, E-Mail:g...@gwdg.de Geschäftsführer: Prof. Dr. Ramin Yahyapour Aufsichtsratsvorsitzender: Prof. Dr. Christian Griesinger Sitz der Gesellschaft: Göttingen Registergericht: Göttingen, Handelsregister-Nr. B 598 Zertifiziert nach ISO 9001 und ISO 27001 ------------------------------------------------------------------------- Am 19.07.23 um 21:21 schrieb Groner, Rob:
At some point when we were experimenting with MIG, I was being entirely frustrated in getting it to work until I finally removed the autodetect from gres.conf and explicitly listed the stuff instead. THEN it worked. I think you can find the list of files that are the device files using nvidia-smi.Here is the entry we use in our gres.conf for one of the nodes:NodeName=p-gc-3037 Name=gpu Type=1g.5gb File=/dev/nvidia-caps/nvidia-cap[66,75,84,102,111,120,129,201,210,219,228,237,246,255]Something to TRY anyway. Odd that 3g.20gb works. You might try reconfiguring the node for that instead and see if it works then. We've used 3g.20gb and 1g.5gb on our nodes and it works fine, never tried 2g.10gb.Rob ------------------------------------------------------------------------ *From:* slurm-users on behalf of Vogt, Timon *Sent:* Wednesday, July 19, 2023 3:08 PM *To:* slurm-us...@schedmd.com *Subject:* [slurm-users] MIG-Slice: Unavailable GRES Dear Slurm Mailing List, I am experiencing a problem which affects our cluster and for which I am completely out of ideas by now, so I would like to ask the community for hints or ideas. We run a partition on our cluster containing multiple nodes with Nvidia A100 GPUs (40GB), which we have sliced up using Nvidia Multi-Instance GPUs (MIG) into one 3g.20gb slice and two 2g.10gb slices per GPU. Now, when submitting a job to it and requesting the 3g.20gb slice (like with "srun -p mig-partition -G 3g.20gb:1 hostname"), the job runs fine, but when a job requests one of the 2g.10gb slices instead (like with "srun -p mig-partition -G 2g.10gb:1 hostname"), the job does not get scheduled and the controller repeatedly outputs the error: slurmctld[28945]: error: _set_job_bits1: job 4780824 failed to find any available GRES on node 1471 slurmctld[28945]: error: gres_select_filter_select_and_set job 4780824 failed to satisfy gres-per-job counter Our cluster uses the AutoDetect=nvml feature for the nodes in the gres.conf and both slice types are defined in "AccountingStorageTRES" and in the GRES parameter of the node definition. The slurmd on the node also finds both types of slices and reports the correct amounts. They are also visible in the "Gres=" section of "scontrol show node", again in correct amounts. I have also ensured that the nodes are not used otherwise by creating a reservation on them accessible only to me, and I have restarted all slurmd's and the slurmctld. By now, I am out of ideas. Does someone here have a suggestion on whatelse I can try? Has someone already seen this error and knows more about it?Thank you very much in advance and best regards, Timon -- Timon Vogt Arbeitsgruppe "Computing" Nationales Hochleistungsrechnen (NHR) Scientific Employee NHR Tel.: +49 551 39-30146, E-Mail: timon.v...@gwdg.de ------------------------------------------------------------------------- Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG) Burckhardtweg 4, 37077 Göttingen, URL: https://gwdg.de Support: Tel.: +49 551 39-30000, URL: https://gwdg.de/support Sekretariat: Tel.: +49 551 39-30001, E-Mail: g...@gwdg.de Geschäftsführer: Prof. Dr. Ramin Yahyapour Aufsichtsratsvorsitzender: Prof. Dr. Christian Griesinger Sitz der Gesellschaft: Göttingen Registergericht: Göttingen, Handelsregister-Nr. B 598 Zertifiziert nach ISO 9001 und ISO 27001 -------------------------------------------------------------------------
OpenPGP_0x6441BD7DD0CD6C40.asc
Description: OpenPGP public key
OpenPGP_signature
Description: OpenPGP digital signature