[slurm-users] pty jobs are killed when another job on the same node terminates

2024-02-22 Thread Thomas Hartmann via slurm-users

Hi,

when I start an interactive job like this:

srun --pty --mem=3G -c2 bash

And then I schedule and run other jobs (can be interactive or non 
interactive) and one of these jobs that runs on the same node 
terminates, the interactive job gets killed with this message:


srun: error: node01.abc.at: task 0: Killed

I attached our slurm config. Does anybody have an idea what is going on 
here or where I could look to debug? I'm quite new to slurm, so I don't 
know all the places to look...


Thanks a lot in advance!

Thomas
#
# Example slurm.conf file. Please run configurator.html
# (in doc/html) to build a configuration file customized
# for your environment.
#
#
# slurm.conf file generated by configurator.html.
#
# See the slurm.conf man page for more information.
#
ClusterName=openhpc
SlurmctldHost=abc.at
#DisableRootJobs=NO
#EnforcePartLimits=NO
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=67043328
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=lua
#KillOnBadExit=0
#LaunchType=launch/slurm
#Licenses=foo*4,bar
#MailProg=/bin/mail
#MaxJobCount=1
#MaxStepCount=4
#MaxTasksPerNode=512
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/cgroup
#Prolog=
#PrologFlags=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#RebootProgram=
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurm # NB: not OpenHPC default!
SlurmUser=slurm
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/var/spool/slurm
SwitchType=switch/none
#TaskEpilog=
#TaskPlugin=task/affinity
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFS=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=300
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
#DefMemPerCPU=0
#MaxMemPerCPU=0
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
#
#
# JOB PRIORITY
#PriorityFlags=
PriorityType=priority/multifactor
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
PriorityWeightPartition=1000
#PriorityWeightQOS=
PreemptType=preempt/qos
PreemptMode=requeue
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
AccountingStorageHost=slurmdb.abc.at
AccountingStoragePort=6819
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageUser=slurm_db
#AccountingStoreFlags=
#JobCompHost=
JobCompLoc=/var/log/slurm_jobacct.log
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/none
#JobCompUser=
#JobContainerType=job_container/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/cgroup

# By default, SLURM will log to syslog, which is what we want
SlurmctldSyslogDebug=info
SlurmdSyslogDebug=info
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#DebugFlags=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES - NOT SUPPORTED IN THIS APPLIANCE VERSION

# LOGIN-ONLY NODES
# Define slurmd nodes not in partitions for login-only nodes in "configless" 
mode:

# COMPUTE NODES
# OpenHPC default configuration
PropagateResourceLimitsExcept=MEMLOCK
Epilog=/etc/slurm/slurm.epilog.clean

# openhpc_slurm_partitions group: openhpc_interactive

NodeName=node01.abc.at State=UNKNOWN RealMemory=420202 Sockets=2 
CoresPerSocket=30 ThreadsPerCore=1

NodeName=node02.abc.at State=UNKNOWN RealMemory=420202 Sockets=2 
CoresPerSocket=30 ThreadsPerCore=1

NodeName=node03.abc.at State=UNKNOWN RealMemory=420202 Sockets=2 
CoresPerSocket=30 ThreadsPerCore=1

NodeName=node04.abc.at State=UNKNOWN RealMemory=420202 Sockets=2 
CoresPerSocket=30 ThreadsPerCore=1

NodeName=node05.abc.at State=UNKNOWN RealMemory=420202 Sockets=2 
CoresPerSocket=30 ThreadsPerCore=1

NodeName=node06.abc.at State=UNKNOWN RealMemory=420202 Sockets=2 
CoresPerSocket=30 ThreadsPerCore=1
PartitionName=interactive Default=YES MaxTime=2-08:00:00 State=UP 
Nodes=node01.abc.at,node02.abc.at,node03.abc.at,node04.abc.at,node05.abc.at,node06.abc.at
 Priority=100

# Define a non-existent node, in no partition, so that slurmctld starts even 
with all partitions empty
NodeName=nonesuch

SlurmctldParameters=enable_configlessReturnToService=2
PrologFlags=contain,x11
TaskPlugin=task/cgroup,task/affinity
PriorityFavorSmall=YES
PriorityDecayHalfLife=14-0
PriorityWeightAge=1000
PriorityWeightFairshare=1
PriorityWeightJobSize=1000
PriorityWeightQOS=100


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: How to get usage data for a QOS

2024-03-01 Thread Thomas Hartmann via slurm-users

Thanks a lot!

Am 01.03.24 um 20:58 schrieb Maciej Pawlik via slurm-users:

Hello,

This information can be found in the output of "scontrol show 
assoc_mgr qos=".


best regards
Maciej Pawlik

śr., 28 lut 2024 o 16:04 thomas.hartmann--- via slurm-users 
 napisał(a):


Hi,
so, I figured out that I can give some users priority access for a
specific amount of TRES by creating a qos with the GrpTRESMins
property and the DenyOnLimit,NoDecay flags. This works nicely.

However, I would like to know, how much of this has already been
consumed and I have not yet found a way to do this. Like: How can
I get the amount of TRES/TRES minutes consumed for a certain QOS?

Thanks a lot!
Thomas

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com

To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: [EXTERN] Re: scheduling according time requirements

2024-04-30 Thread Thomas Hartmann via slurm-users

Hi Dietmar,

I was facing quite similar requirements to yours. We ended up using QoS 
instead of partitions because this approach provides higher flexibility 
and more features. The basic distinction between the two approaches is 
that partitions are node-based while QoS are (essentially) resource 
based. So, instead of saying "Long jobs can only run on nodes 9 and 10" 
you would be able to say "Long jobs  can only use X CPU cores in total".


However, yes, your partition based approach is going to do the job, as 
long as you do not need any QoS based preemption.


Cheers,

Thomas

Am 30.04.24 um 16:00 schrieb Dietmar Rieder via slurm-users:

Hi Loris,

On 4/30/24 3:43 PM, Loris Bennett via slurm-users wrote:

Hi Dietmar,

Dietmar Rieder via slurm-users  writes:


Hi Loris,

On 4/30/24 2:53 PM, Loris Bennett via slurm-users wrote:

Hi Dietmar,
Dietmar Rieder via slurm-users 
writes:


Hi,

is it possible to have slurm scheduling jobs automatical according to
the "-t" time requirements to a fitting partition?

e.g. 3 partitions

PartitionName=standard Nodes=c-[01-10] Default=YES MaxTime=04:00:00
DefaultTime=00:10:00 State=UP OverSubscribe=NO
PartitionName=medium Nodes=c-[04-08] Default=NO MaxTime=24:00:00
DefaultTime=04:00:00 State=UP OverSubscribe=NO
PartitionName=long Nodes=c-[09-10] Default=NO MaxTime=336:00:00
DefaultTime=24:00:00 State=UP OverSubscribe=NO


So in the standard partition which is the default we have all nodes
and a max time of 4h, in the medium partition we have 4 nodes with a
max time of 24h and in the long partition we have 2 nodes with a max
time of 336h.

I was hoping that if I submit a job with -t 01:00:00 it can be run on
any node (standard partition), whereas when specifying -t 05:00:00 or
-t 48:00:00 the job will run on the nodes of the medium or long
partition respectively.

However, my job will not get scheduled at all when -t is greater than
01:00:00

i.e.

]$ srun --cpus-per-task 1 -t 01:00:01  --pty bash
srun: Requested partition configuration not available now
srun: job 42095 queued and waiting for resources

it will wait forever because the standard partition is selected, I 
was

thinking that slurm would automatically switch to the medium
partition.

Do I misunderstand something there? Or can this be somehow 
configured.

You can specify multiple partitions, e.g.
  $ salloc --cpus-per-task=1 --time=01:00:01
--partition=standard,medium,long
Notice that rather than using 'srun ... --pty bash', as far as I
understand, the preferred method is to use 'salloc' as above, and 
to use

'srun' for starting MPI processes.


Thanks for the hint. This works nicely, but it would be nice that I
would not need to specify the partition at all. Any thoughts?


I am not aware that you can set multiple partition as a default.


Diego suggested a possible way which seems to work after a quick test.



The question is why you actually need partitions with different maximum
runtimes.


we would like to have only a sub set of the nodes in a partition for 
long running jobs, so that there are enough nodes available for short 
jobs.


The nodes for the long partition, however are also part of the short 
partition so they can also be utilized when no long jobs are running.


That's our idea




In our case, a university cluster with a very wide range of codes and
usage patterns, multiple partitions would probably lead to fragmentation
and wastage of resources due to the job mix not always fitting well to
the various partitions.  Therefore, I am a member of the "as few
partitions as possible" camp and so in our set-up we have as essentially
only one partition with a DefaultTime of 14 days.  We do however let
users set a QOS to gain a priority boost in return for accepting a
shorter run-time and a reduced maximum number of cores.


we didn't look into QOS yet, but this might also a way to go, thanks.


Occasionally people complain about short jobs having to wait in the
queue for too long, but I have generally been successful in solving the
problem by having them estimate their resource requirements better or
bundling their work in ordert to increase the run-time-to-wait-time
ratio.



Dietmar


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com