On 6/4/21 11:04 am, Ahmad Khalifa wrote:
Because there are failing GPUs that I'm trying to avoid.
Could you remove them from your gres.conf and adjust slurm.conf to match?If you're using cgroups enforcement for devices (ConstrainDevices=yes in cgroup.conf) then that should render them inaccessible to jobs.
All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA