I finally solved the issue, my slurm client on computational node was build and
configured with pmix_v3 as follow:
$:/usr/local/lib/slurm$ ll | grep pmix
lrwxrwxrwx 1 root root 16 feb 23 15:57 mpi_pmix.so -> ./mpi_pmix_v3.so*
-rwxr-xr-x 1 root root 1003 feb 23 15:57 mpi_pmix_v3.la*
-r
Hi all,
I'm facing the following issue with a DGX A100 machine: I'm able to allocate
resources, but the job fail when I try to execute srun, follow a detailed
analysis of the incident:
```
$ salloc -n1 -N1 -p DEBUG -w dgx001 --time=2:0:0
salloc: Granted job allocation 1278
salloc: Waiting