Re: [slurm-users] Compact scheduling strategy for small GPU jobs

2021-08-12 Thread Jack Chen
h node, you can give a preference to one over the other. > > That may be a way to do a "try this node first" method of job placement. > > Brian Andrus > On 8/10/2021 9:19 AM, Jack Chen wrote: > > Thanks for your reply! It's certain that slurm will not place smal

Re: [slurm-users] Compact scheduling strategy for small GPU jobs

2021-08-10 Thread Jack Chen
gt; > Brian Andrus > > > On 8/10/2021 8:07 AM, Jack Chen wrote: > > Does anyone have any ideas on this? > > On Fri, Aug 6, 2021 at 2:52 PM Jack Chen wrote: > >> I'm using slurm15.08.11, when I submit several 1 gpu jobs, slurm doesn't >> allocate node

Re: [slurm-users] Compact scheduling strategy for small GPU jobs

2021-08-10 Thread Jack Chen
Does anyone have any ideas on this? On Fri, Aug 6, 2021 at 2:52 PM Jack Chen wrote: > I'm using slurm15.08.11, when I submit several 1 gpu jobs, slurm doesn't > allocate nodes using compact strategy. Anyone know how to solve this? Will > upgrading slurm latest version hel

[slurm-users] Compact scheduling strategy for small GPU jobs

2021-08-05 Thread Jack Chen
I'm using slurm15.08.11, when I submit several 1 gpu jobs, slurm doesn't allocate nodes using compact strategy. Anyone know how to solve this? Will upgrading slurm latest version help ? For example, there are two nodes A and B with 8 gpus per node, I submitted 8 1 gpu jobs, slurm will allocate fir

Re: [slurm-users] ML Training task killed(SIGKILL) when cgroup cpu limit enabled in slurm15.08

2021-07-02 Thread Jack Chen
ok, thanks for your quick response, I will find a way to upgrade it. On Fri, Jul 2, 2021 at 2:12 PM Ole Holm Nielsen wrote: > On 7/2/21 7:34 AM, Jack Chen wrote: > > Slurm is great to use, I've developed several plugins on it. Now I'm > > working on an issue in slurm

[slurm-users] ML Training task killed(SIGKILL) when cgroup cpu limit enabled in slurm15.08

2021-07-01 Thread Jack Chen
Slurm is great to use, I've developed several plugins on it. Now I'm working on an issue in slurm. I'm using Slurm 15.08-11, after I enabled cgroup, some training job's task is killed after a few hours. This can be reproduced several times. After turning off cgroup, it disappears. Linux kernel: 3