Hi Ryan, Thank you very much for your reply. That is useful. We'll see how we get on.
Best regards, David ________________________________ From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Ryan Novosielski <novos...@rutgers.edu> Sent: 11 September 2020 00:08 To: Slurm User Community List <slurm-users@lists.schedmd.com> Subject: Re: [slurm-users] Slurm -- using GPU cards with NVLINK I’m fairly sure that you set this up the same way you set up for a peer-to-peer setup. Here’s ours: [root@cuda001 ~]# nvidia-smi topo --matrix GPU0 GPU1 GPU2 GPU3 mlx4_0 CPU Affinity GPU0 X PIX SYS SYS PHB 0-11 GPU1 PIX X SYS SYS PHB 0-11 GPU2 SYS SYS X PIX SYS 12-23 GPU3 SYS SYS PIX X SYS 12-23 mlx4_0 PHB PHB SYS SYS X [root@cuda001 ~]# cat /etc/slurm/gres.conf … # 2 x K80 (perceval) NodeName=cuda[001-008] Name=gpu File=/dev/nvidia[0-1] CPUs=0-11 NodeName=cuda[001-008] Name=gpu File=/dev/nvidia[2-3] CPUs=12-23 This also seems to be related: https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2FSLUG19%2FGPU_Scheduling_and_Cons_Tres.pdf&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7C1a052163da5d4d0643d808d855ded053%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=lV2AExQxAc7svAT2FNJHJ8TsU5pfix0GwjpQ29Cc%2B0A%3D&reserved=0 -- ____ || \\UTGERS, |---------------------------*O*--------------------------- ||_// the State | Ryan Novosielski - novos...@rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark `' > On Sep 10, 2020, at 11:00 AM, David Baker <d.j.ba...@soton.ac.uk> wrote: > > Hello, > > We are installing a group of nodes which all contain 4 GPU cards. The GPUs > are paired together using NVLINK as described in the matrix below. > > We are familiar with using Slurm to schedule and run jobs on GPU cards, but > this is the first time we have dealt with NVLINK enabled GPUs. Could someone > please advise us how to configure Slurm so that we can submit jobs to the > cards and make use of the NVLINK? That is, what do we need to put in the > gres.conf or slurm.conf, and how should users use the sbatch command? I > presume, for example, that a user could make use of a GPU card, and > potentially make use of memory on the paired card. > > Best regards, > David > > [root@alpha51 ~]# nvidia-smi topo --matrix > GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity > GPU0 X NV2 SYS SYS 0,2,4,6,8,10 0 > GPU1 NV2 X SYS SYS 0,2,4,6,8,10 0 > GPU2 SYS SYS X NV2 1,3,5,7,9,11 1 > GPU3 SYS SYS NV2 X 1,3,5,7,9,11 1