I would have thought partition QoS is the way to do this. We add partition
QoS to our partition definitions, and implement quotas on usage as well.
PartitionName=physical Nodes=... Default=YES MaxTime=30-0
DefaultTime=0:10:0 State=DOWN QoS=physical
TRESBillingWeights=CPU=1.0,Mem=4.0G
We then defi
Stack Korora writes:
> On 3/1/21 4:26 PM, Prentice Bisbal wrote:
>> Two things:
>>
>> 1. So your users are okay with specifying a partition, but specifying a QOS
>> is
>> a bridge too far?
>
> *sigh* Yeah. It's been requested several times. I can't defend it, but if it
> makes them happy...then
Two things:
1. So your users are okay with specifying a partition, but specifying a
QOS is a bridge too far?
2. Have your job_submit.lua script filter the jobs into the correct QOS.
You can check the partition and set the QOS accordingly.
First, you need to have this set in your slurm.conf:
Greetings,
We have different node classes that we've set up in different
partitions. For example, we have our standard compute nodes in compute;
our GPU's in a gpu partition; and jobs that need to run for months go
into a long partition with a different set of machines.
For each partition, w
On 3/1/21 4:26 PM, Prentice Bisbal wrote:
Two things:
1. So your users are okay with specifying a partition, but specifying
a QOS is a bridge too far?
*sigh* Yeah. It's been requested several times. I can't defend it, but
if it makes them happy...then they will find something else to complai
Thanks for the info and link to your bug report. Unfortunately, my
GraceTime is already set to zero for that QOS:
$ sacctmgr show qos interruptible format=Name,gracetime
Name GraceTime
-- --
interrupt+ 00:00:00
On 2/26/21 3:58 PM, Michael Robbert wrote:
We saw som
Hello
I have a ticket posted with schedmd, but this may be an issue the community
has seen and may have a quick response.
Slurmctld segfaulted (signal 11) on us and now segfaults on restart. I'm not
aware of an obvious trigger for this behavior.
We upgraded this cluster from 20.02.5 to 20.11
Wow. I am definitely having a Monday.
I used that all the time and just could not remember the word even to
search.
Thanks!
Brian
On 3/1/2021 9:23 AM, Sarlo, Jeffrey S wrote:
Were you thinking of this
* Report current jobs that have been orphanded on the local cluster
and are now run
Were you thinking of this
* Report current jobs that have been orphanded on the local cluster and are
now runaway:
sacctmgr show RunawayJobs
From: slurm-users on behalf of Brian
Andrus
Sent: Monday, March 1, 2021 11:14 AM
To: slurm-users@lists.schedmd.co
All,
IIRC, there was a command that would repair the accounting tables when a
job had no endtime.
I can't seem to find the info for that. Does anyone recall such a thing?
Brian Andrus
10 matches
Mail list logo