Hello List,
we're operating a large-ish cluster (about 900 nodes) with diverse
hardware. It has been running with SGE for several years now, but the
more we refine our configuration, the more we're feeling SGE's
limitations.
Therefore, we're considering switching to Slurm.
The latest challenge i
Hi,
> On 05.02.19 16:46, Ansgar Esztermann-Kirchner wrote:
> > [...]-- we'd like to have two "half nodes", where
> > jobs will be able to use one of the two GPUs, plus (at most) half of
> > the CPUs. With SGE, we've put two queues on the nodes,
Hi,
I'd like to share our set-up as well, even though it's very
specialized and thus probably won't work in most places. However, it's
also very efficient in terms of budget when it does.
Our users don't usually have shared data sets, so we don't need high
bandwidth at any particular point -- the
Hello List,
I'm seeing a version clash when trying to start MPI jobs via srun.
In stderr, my executable (mdrun) complains about:
mdrun: /usr/lib/x86_64-linux-gnu/slurm/auth_munge.so: Incompatible Slurm plug
in version (17.11.9)
I've checked my installation, and found nothing that suggests there
Hello List,
we're running a heterogeneous cluster (just x86_64, but a lot of
different node types from 8 to 64 HW threads, 1 to 4 GPUs).
Our processing power (for our main application, at least) is
exclusively provided by the GPUs, so cons_tres looks quite promising:
depending on the size of the
Hi Yair,
thank you very much for your reply. I'll keep the points you make in
mind while we're evolving our configuration toward something that can
be called production-ready.
A.
--
Ansgar Esztermann
Sysadmin Dep. Theoretical and Computational Biophysics
http://www.mpibpc.mpg.de/grubmueller/esz
On Mon, Feb 08, 2021 at 12:36:06PM +0100, Ansgar Esztermann-Kirchner wrote:
> Of course, one could use different partitions for different nodes, and
> then submit individual jobs with CPU requests tailored to one such
> partition, but I'd prefer a more flexible approach where a giv
On Fri, Feb 12, 2021 at 09:47:56AM +0100, Ole Holm Nielsen wrote:
>
> Could you kindly say where you have found documentation of the
> DefaultCpusPerGpu (or DefCpusPerGpu?) parameter.
Humph, I shouldn't have written the message from memory. It's actually
DefCpuPerGPU (singular).
> I'm unable t
Hello List,
does anyone have experience with DefCpuPerGPU and jobs requesting
multiple partitions? I would expect Slurm to select a partition from
those requested by the job, then assign CPUs based on that
partition's DefCpuPerGPU. But according to my observations, it appears
that (at least someti
Hi Davide,
I think it should be possible to emulate this via preemption: if you
set PreemptMode to CANCEL, a preempted job will behave just as if it
reached the end of its wall time. Then, you can use PreemptExemptTime
as your soft wall time limit -- the job will not be preempted before
PreemptExe
On Thu, Jun 12, 2025 at 04:52:24AM -0600, Davide DelVento wrote:
> Hi Ansgar,
>
> This is indeed what I was looking for: I was not aware of PreemptExemptTime.
>
> From my cursory glance at the documentation, it seems
> that PreemptExemptTime is QOS-based and not job based though. Is that
> correc
11 matches
Mail list logo