Re: [slurm-users] number of nodes varies for no reason?

2019-03-28 Thread Noam Bernstein
> On Mar 27, 2019, at 9:32 PM, Chris Samuel wrote: > > On 27/3/19 2:43 pm, Noam Bernstein wrote: > >> Hi fellow slurm users - I’ve been using slurm happily for a few months, but >> now I feel like it’s gone crazy, and I’m wondering if anyone can explain >> what

[slurm-users] number of nodes varies for no reason?

2019-03-27 Thread Noam Bernstein
Hi fellow slurm users - I’ve been using slurm happily for a few months, but now I feel like it’s gone crazy, and I’m wondering if anyone can explain what’s going on. I have a trivial batch script which I submit multiple times, and ends up with different numbers of nodes allocated. Does anyone h

Re: [slurm-users] practical tips to budget cluster expansion for a research center with heterogeneous workloads?

2019-03-21 Thread Noam Bernstein
> On Mar 21, 2019, at 12:38 PM, Alex Chekholko wrote: > > Hey Graziano, > > To make your decision more "data-driven", you can pipe your SLURM accounting > logs into a tool like XDMOD which will make you pie charts of usage by user, > group, job, gres, etc. > > https://open.xdmod.org/8.0/inde

[slurm-users] SlurmctlDebug=

2018-11-25 Thread Noam Bernstein
Hello fellow slurm users - can anyone explain what SlurmctlDebug=4 means? I see in the documentation a list of possible string level names, but I have a working slurm.conf which uses 3 and 4. Is what documented levels those map to written anywhere?

Re: [slurm-users] virtual memory limit exceeded

2018-11-09 Thread Noam Bernstein
> On Nov 9, 2018, at 3:14 AM, Bjørn-Helge Mevik wrote: > > Noam Bernstein writes: > >> Can anyone shed some light on where the _virtual_ memory limit comes from? > > Perhaps it comes from a VSizeFactor setting in slurm.conf: > > VSizeFactor >

[slurm-users] virtual memory limit exceeded

2018-11-08 Thread Noam Bernstein
Can anyone shed some light on where the _virtual_ memory limit comes from? We're getting jobs killed with the message slurmstepd: error: Step 3664.0 exceeded virtual memory limit (79348101120 > 72638634393), being killed Is this a limit that's dictated by cgroup.conf or by some srun option (like

Re: [slurm-users] epilog when job is killed for max time

2018-11-08 Thread Noam Bernstein
ground otherwise bash will not process the signal until this command > finishes > > wait # < wait until all the background processes are finished. If a > signal is received this will stop, process the signal and finish the script. > > > On 7/11/18 21:16, Noam Bernstein

[slurm-users] epilog when job is killed for max time

2018-11-07 Thread Noam Bernstein
Hi slurm users - I’ve been looking through the slurm prolog/epilog manuals, but haven’t been able to figure out if there’s a way to get an epilog script (requested by the user) to run when a job is killed for running out of time, and have the epilog script be able to detect that (through an env

Re: [slurm-users] requesting entire vs. partial nodes

2018-10-23 Thread Noam Bernstein
> On Oct 23, 2018, at 5:35 PM, Noam Bernstein > wrote: > >> > > > Any ideas as to what might be happening? Could it be that the nodes are missing the RealMemory setting? Noam smime.p7s Description: S/MIME cryptographic signature

Re: [slurm-users] requesting entire vs. partial nodes

2018-10-23 Thread Noam Bernstein
> On Oct 20, 2018, at 3:06 AM, Chris Samuel wrote: > > On Saturday, 20 October 2018 9:57:16 AM AEDT Noam Bernstein wrote: > >> If not, is there another way to do this? > > You can use --exclusive for jobs that want whole nodes. > > You will likely also want to

[slurm-users] requesting entire vs. partial nodes

2018-10-19 Thread Noam Bernstein
Hi - I have a slurm usage question that I haven't been able to figure out from the docs. We basically have two types of jobs - ones that require entire nodes, and ones that do not. An additional (minor) complication is that the nodes have hyperthreading enabled, but we want (usually) to use on

Re: [slurm-users] node showing "Low socket*core count"

2018-10-10 Thread Noam Bernstein
> On Oct 10, 2018, at 12:07 PM, Noam Bernstein > wrote: > > > slurmd -C confirms that indeed slurm understands the architecture, so that’s > good. However, removing the CPUs entry from the node list doesn’t change > anything. It still drains the node. If I just remov

Re: [slurm-users] node showing "Low socket*core count"

2018-10-10 Thread Noam Bernstein
ist item it just picks 1 cpu. Noam || |U.S. NAVAL| |_RESEARCH_| LABORATORY Noam Bernstein, Ph.D. Center for Materials Physics and Technology U.S. Naval Research Laboratory T +1 202 404 8628 F +1 202 404 7546 https://www.nrl.navy.mil <https://www.nrl.navy.mil/>

[slurm-users] node showing "Low socket*core count"

2018-10-10 Thread Noam Bernstein
Hi all - I’m new to slurm, and in many ways it’s been very nice to work with, but I’m having an issue trying to properly set up thread/core/socket counts on nodes. Basically, if I don’t specify anything except CPUs, the node is available, but doesn’t appear to know about cores and hyperthreadin