[slurm-users] How to partition nodes into smaller units

2019-02-05 Thread Ansgar Esztermann-Kirchner
Hello List, we're operating a large-ish cluster (about 900 nodes) with diverse hardware. It has been running with SGE for several years now, but the more we refine our configuration, the more we're feeling SGE's limitations. Therefore, we're considering switching to Slurm. The latest challenge i

Re: [slurm-users] How to partition nodes into smaller units

2019-02-11 Thread Ansgar Esztermann-Kirchner
Hi, > On 05.02.19 16:46, Ansgar Esztermann-Kirchner wrote: > > [...]-- we'd like to have two "half nodes", where > > jobs will be able to use one of the two GPUs, plus (at most) half of > > the CPUs. With SGE, we've put two queues on the nodes,

Re: [slurm-users] Kinda Off-Topic: data management for Slurm clusters

2019-02-26 Thread Ansgar Esztermann-Kirchner
Hi, I'd like to share our set-up as well, even though it's very specialized and thus probably won't work in most places. However, it's also very efficient in terms of budget when it does. Our users don't usually have shared data sets, so we don't need high bandwidth at any particular point -- the

[slurm-users] incompatible plugin version

2019-05-24 Thread Ansgar Esztermann-Kirchner
Hello List, I'm seeing a version clash when trying to start MPI jobs via srun. In stderr, my executable (mdrun) complains about: mdrun: /usr/lib/x86_64-linux-gnu/slurm/auth_munge.so: Incompatible Slurm plug in version (17.11.9) I've checked my installation, and found nothing that suggests there

[slurm-users] Job flexibility with cons_tres

2021-02-08 Thread Ansgar Esztermann-Kirchner
Hello List, we're running a heterogeneous cluster (just x86_64, but a lot of different node types from 8 to 64 HW threads, 1 to 4 GPUs). Our processing power (for our main application, at least) is exclusively provided by the GPUs, so cons_tres looks quite promising: depending on the size of the

Re: [slurm-users] Job flexibility with cons_tres

2021-02-10 Thread Ansgar Esztermann-Kirchner
Hi Yair, thank you very much for your reply. I'll keep the points you make in mind while we're evolving our configuration toward something that can be called production-ready. A. -- Ansgar Esztermann Sysadmin Dep. Theoretical and Computational Biophysics http://www.mpibpc.mpg.de/grubmueller/esz

Re: [slurm-users] Job flexibility with cons_tres

2021-02-12 Thread Ansgar Esztermann-Kirchner
On Mon, Feb 08, 2021 at 12:36:06PM +0100, Ansgar Esztermann-Kirchner wrote: > Of course, one could use different partitions for different nodes, and > then submit individual jobs with CPU requests tailored to one such > partition, but I'd prefer a more flexible approach where a giv

Re: [slurm-users] Job flexibility with cons_tres

2021-02-12 Thread Ansgar Esztermann-Kirchner
On Fri, Feb 12, 2021 at 09:47:56AM +0100, Ole Holm Nielsen wrote: > > Could you kindly say where you have found documentation of the > DefaultCpusPerGpu (or DefCpusPerGpu?) parameter. Humph, I shouldn't have written the message from memory. It's actually DefCpuPerGPU (singular). > I'm unable t

[slurm-users] DefCpuPerGPU and multiple partitions

2024-04-09 Thread Ansgar Esztermann-Kirchner via slurm-users
Hello List, does anyone have experience with DefCpuPerGPU and jobs requesting multiple partitions? I would expect Slurm to select a partition from those requested by the job, then assign CPUs based on that partition's DefCpuPerGPU. But according to my observations, it appears that (at least someti

[slurm-users] Re: Implementing a "soft" wall clock limit

2025-06-12 Thread Ansgar Esztermann-Kirchner via slurm-users
Hi Davide, I think it should be possible to emulate this via preemption: if you set PreemptMode to CANCEL, a preempted job will behave just as if it reached the end of its wall time. Then, you can use PreemptExemptTime as your soft wall time limit -- the job will not be preempted before PreemptExe

[slurm-users] Re: Implementing a "soft" wall clock limit

2025-06-12 Thread Ansgar Esztermann-Kirchner via slurm-users
On Thu, Jun 12, 2025 at 04:52:24AM -0600, Davide DelVento wrote: > Hi Ansgar, > > This is indeed what I was looking for: I was not aware of PreemptExemptTime. > > From my cursory glance at the documentation, it seems > that PreemptExemptTime is QOS-based and not job based though. Is that > correc