Any idea why pam_slurm_adopt would work on some nodes but not others? Here is
an excerpt from one of the nodes:
Jan 28 15:38:54 dgx1-1 sshd[1027640]: pam_sss(sshd:auth): authentication
success; logname= uid=0 euid=0 tty=ssh ruser= rhost=10.10.10.1 user=test.user
Jan 28 15:38:54 dgx1-1 pam_slurm_
configurations.
On Sat, Jan 15, 2022 at 10:32 AM Wayne Hendricks
wrote:
>
> The only thing that jumps out on the ctl logs is:
> error: step_layout_create: no usable CPUs
> The node logs were unremarkable.
>
> It doesn't make much sense to me that the same job with srun or an
at, Jan 15, 2022 at 12:56 AM Sean Crosby wrote:
>
> Any error in slurmd.log on the node or slurmctld.log on the ctl?
>
> Sean
>
> From: slurm-users on behalf of Wayne
> Hendricks
> Sent: Saturday, 15 January 2022 16:04
> To: slurm-us...
Running test job with srun works:
wayneh@login:~$ srun -G16 -p v100 /home/wayne.hendricks/job.sh
179851
Linux dgx1-1 5.4.0-94-generic #106-Ubuntu SMP Thu Jan 6 23:58:14 UTC
2022 x86_64 x86_64 x86_64 GNU/Linux
179851
Linux dgx1-2 5.4.0-94-generic #106-Ubuntu SMP Thu Jan 6 23:58:14 UTC
2022 x86_64 x8
./configure --prefix=/admin/slurm/slurm-21.08.5
--with-pmix=/admin/slurm/pmix-4.0.0
onfigure: WARNING: unable to locate pmix installation
configure: error: unable to locate pmix installation
configure:17261: checking for pmix installation
configure:17299: gcc -o conftest -DNUMA_VERSION1_COMPATIB
When using 20.02/cons_tres and defining DefMemPerGPU, jobs submitted that
request GPUs without defining “—mem” will not run more than one job per node. I
can see where it is allocating the correct amount of memory for the job per
GPUs requested, but no other jobs will run on the node. If a value