Re: [slurm-users] GrpMEMRunMins equivalent?

2020-06-05 Thread Corey Keasling
Yes GrpTRESRunMins is what I meant. And thank you also for the solution, I hadn't tried that syntax. Interesting that GrpCPURunMins works while GrpMemRunMins does not. I also noticed that if the limit is specified as GrpTRESRunMins=Memory=1000,Cpu=2000 only the CPU portion takes effect -- th

Re: [slurm-users] Intermittent problem at 32 CPUs

2020-06-05 Thread Riebs, Andy
Diego, I'm *guessing* that you are tripping over the use of "--tasks 32" on a heterogeneous cluster, though your comment about the node without InfiniBand troubles me. If you drain that node, or exclude it in your command line, that might correct the problem. I wonder if OMPI and PMIx have deci

[slurm-users] Intermittent problem at 32 CPUs

2020-06-05 Thread Diego Zuccato
Hello all. I already tried for some weeks to debug this problem, but it seems I'm still missing something. I have a small, (very) heterogeneous cluster. After upgrading to Debian 10 and packaged versions of Slurm and IB drivers/tools, I noticed that *sometimes* jobs requesting 32 or more threads f

Re: [slurm-users] Change ExcNodeList on a running job

2020-06-05 Thread Ole Holm Nielsen
Hi Geoffrey, I'm just curious as to what causes a user to decide that a given node has an issue? If a node is healthy in all respects, why would a user decide not to use the node? We can certainly perform all sorts of node health checks from Slurm by configuring the use of LBNL Node Health