[slurm-users] Alocating a subset cores to each job

2018-06-11 Thread Nadav Toledo
Hello everyone, Sorry for might be a trivial question for most of you. I am trying to understand cpu allocation in slurm. The goal is to launch a batch job on one node. while the batch itself will run several jobs in parallel each allocated a subset of the cpus g

Re: [slurm-users] Can't run jobs after upgrade to 17.11.5 due to memory?

2018-06-11 Thread Roberts, John E.
I see this in the debug logs: "memory per node set to 1M in partition bdwall" I seemingly can alleviate this if I set RealMemory=foo in the Node definitions, but this just seems like something that shouldn't be necessary. Did this become a required field after 16.05?? Thanks! John On 6/11/18,

Re: [slurm-users] Can't run jobs after upgrade to 17.11.5 due to memory?

2018-06-11 Thread Roberts, John E.
Nothing I assume isn't correct: DefMemPerNode = UNLIMITED MaxMemPerNode = UNLIMITED MemLimitEnforce = Yes PropagateResourceLimitsExcept = MEMLOCK CPU vars aren't set and never were. Thanks! John On 6/11/18, 4:09 PM, "slurm-users on behalf of Renfro, Michael" wrot

Re: [slurm-users] Can't run jobs after upgrade to 17.11.5 due to memory?

2018-06-11 Thread Renfro, Michael
Anything in particular set for DefMemPerCPU in your slurm.conf? > On Jun 11, 2018, at 3:50 PM, Roberts, John E. wrote: > > Hi, > >Seeing this after an upgrade today. I now can't get any jobs to run. > Things were fin before the upgrade. Any Ideas? > >slurmstepd: error: Job 535721 exce

[slurm-users] Can't run jobs after upgrade to 17.11.5 due to memory?

2018-06-11 Thread Roberts, John E.
Hi, Seeing this after an upgrade today. I now can't get any jobs to run. Things were fin before the upgrade. Any Ideas? slurmstepd: error: Job 535721 exceeded memory limit (1160 > 1024), being killed slurmstepd: error: Exceeded job memory limit ulimit shows: $ u

Re: [slurm-users] srun --x11 connection rejected because of wrong authentication

2018-06-11 Thread Hadrian Djohari
Yes. The x11 also worked for us outside of slurm. Well, good luck finding your issue. On Tue, Jun 12, 2018, 1:09 AM Christopher Benjamin Coffey < chris.cof...@nau.edu> wrote: > Hi Hadrian, > > Thank you, unfortunately that is not the issue. We can connect to the > nodes outside of slurm and have

Re: [slurm-users] srun --x11 connection rejected because of wrong authentication

2018-06-11 Thread Christopher Benjamin Coffey
Hi Hadrian, Thank you, unfortunately that is not the issue. We can connect to the nodes outside of slurm and have the X11 stuff work properly. Best, Chris — Christopher Coffey High-Performance Computing Northern Arizona University 928-523-1167 On 6/7/18, 6:49 PM, "slurm-users on behalf of H

[slurm-users] Poor scheduler performance with moderate number of jobs

2018-06-11 Thread Kevin M. Hildebrand
We're seeing some pretty bad performance with around 3000 jobs in queue. We're using sched/backfill, and I've been tweaking the bf_ parameters to try and improve some things, with limited results. But even before the backfill process starts, the main scheduling loop is taking so long per job that i