Re: [slurm-users] Job canceled after reaching QOS limits for CPU time.

2020-10-30 Thread Zacarias Benta
And also the DMTCP project. On 30/10/2020 14:10, Thomas M. Payerle wrote: On Fri, Oct 30, 2020 at 5:37 AM Loris Bennett mailto:loris.benn...@fu-berlin.de>> wrote: Hi Zacarias, Zacarias Benta mailto:zacar...@lip.pt>> writes: > Good morning everyone. > > I'm having a "is

Re: [slurm-users] Job canceled after reaching QOS limits for CPU time.

2020-10-30 Thread Zacarias Benta
Thanks Tom, You are right it is suspend and not pendind that I would like the job state to go into. I'll take a look into the *OverTimeLimit *flag and see if it helps.* * On 30/10/2020 14:10, Thomas M. Payerle wrote: On Fri, Oct 30, 2020 at 5:37 AM Loris Bennett mailto:loris.benn...@fu-b

[slurm-users] How to see how much a job would be billed?

2020-10-30 Thread Diego Zuccato
Hello all. Is there a way to see how much a job, requesting the given set of TRES for the given time, would get billed according to the scheduler decisions? I tried srun --test-only -t 1-0 -p blade -n 32 --mem 1g but it only reports the expected start time. I thought I saw it, but can't find it

Re: [slurm-users] Job canceled after reaching QOS limits for CPU time.

2020-10-30 Thread Diego Zuccato
Il 30/10/20 14:38, Zacarias Benta ha scritto: > I know it sound kind o silly giving a limit and at the same time > allowing for exceptions, but we are trying to prevent the waste of > valuable cpu time. Then convince your users to use checkpointing. Then use shorter run times (we have 24h for 'nor

Re: [slurm-users] Job canceled after reaching QOS limits for CPU time.

2020-10-30 Thread Thomas M. Payerle
On Fri, Oct 30, 2020 at 5:37 AM Loris Bennett wrote: > Hi Zacarias, > > Zacarias Benta writes: > > > Good morning everyone. > > > > I'm having a "issue", I don't know if it is a "bug or a feature". > > I've created a QOS: "sacctmgr add qos myqos set GrpTRESMins=cpu=10 > > flags=NoDecay". I know

Re: [slurm-users] Job canceled after reaching QOS limits for CPU time.

2020-10-30 Thread Zacarias Benta
Hi Loris, Thanks for taking the time to reply to my message. We are indeed wanting to limit and not limit at the same time, I know that it is kind of tricky, but let me try to explain. Our hpc center currently limits jobs from running for more than 5 days straight when users submit single core

Re: [slurm-users] Job canceled after reaching QOS limits for CPU time.

2020-10-30 Thread Loris Bennett
Hi Zacarias, Zacarias Benta writes: > Good morning everyone. > > I'm having a "issue", I don't know if it is a "bug or a feature". > I've created a QOS: "sacctmgr add qos myqos set GrpTRESMins=cpu=10 > flags=NoDecay". I know the limit it too low, but I just wanted to > give you guys an example.

[slurm-users] Node random selection

2020-10-30 Thread Gestió Servidors
Hello, My students cluster has 12 computers that act as "execution node". I have configured a partition where these 12 computers are defined. When someone submits a job that requires only one computer, if 12 computers are available, always job runs in the first defined computer in slurm.conf.