It shouldn't impact running jobs, all it should really do is impact pending jobs as it will order them by their relative priority scores.

-Paul Edmon-

On 4/30/2021 12:39 PM, Walsh, Kevin wrote:
Hello everyone,

We wish to deploy "fair share" scheduling configuration and would like to inquire if we should be aware of effects this might have on jobs already running or already queued when the config is changed.

The proposed changes are from the example at https://slurm.schedmd.com/archive/slurm-18.08.9/priority_multifactor.html#config <https://slurm.schedmd.com/archive/slurm-18.08.9/priority_multifactor.html#config> :

    # Activate the Multi-factor Job Priority Plugin with decay
    PriorityType=priority/multifactor
    # 2 week half-life
    PriorityDecayHalfLife=14-0
    # The larger the job, the greater its job size priority.
    PriorityFavorSmall=NO
    # The job's age factor reaches 1.0 after waiting in the
    # queue for 2 weeks.
    PriorityMaxAge=14-0
    # This next group determines the weighting of each of the
    # components of the Multi-factor Job Priority Plugin.
    # The default value for each of the following is 1.
    PriorityWeightAge=1000
    PriorityWeightFairshare=10000
    PriorityWeightJobSize=1000
    PriorityWeightPartition=1000
    PriorityWeightQOS=0 # don't use the qos factor

We're running SLURM 18.08.8 on CentOS Linux 7.8.2003. The current slurm.conf is defaults as far as fair share is concerned:

    EnforcePartLimits=ALL
    GresTypes=gpu
    MpiDefault=pmix
    ProctrackType=proctrack/cgroup
    PrologFlags=x11,contain
    PropagateResourceLimitsExcept=MEMLOCK,STACK
    RebootProgram=/sbin/reboot
    ReturnToService=1
    SlurmctldPidFile=/var/run/slurmctld.pid
    SlurmctldPort=6817
    SlurmdPidFile=/var/run/slurmd.pid
    SlurmdPort=6818
    SlurmdSpoolDir=/var/spool/slurmd
    SlurmUser=slurm
    SlurmdSyslogDebug=verbose
    StateSaveLocation=/var/spool/slurm/ctld
    SwitchType=switch/none
    TaskPlugin=task/cgroup,task/affinity
    TaskPluginParam=Sched
    HealthCheckInterval=300
    HealthCheckProgram=/usr/sbin/nhc
    InactiveLimit=0
    KillWait=30
    MinJobAge=300
    SlurmctldTimeout=120
    SlurmdTimeout=300
    Waittime=0
    DefMemPerCPU=1024
    FastSchedule=1
    SchedulerType=sched/backfill
    SelectType=select/cons_res
    SelectTypeParameters=CR_Core_Memory
    AccountingStorageHost=sched-db.lan
    AccountingStorageLoc=slurm_acct_db
    AccountingStoragePass=/var/run/munge/munge.socket.2
    AccountingStoragePort=6819
    AccountingStorageType=accounting_storage/slurmdbd
    AccountingStorageUser=slurm
    AccountingStoreJobComment=YES
    AccountingStorageTRES=gres/gpu
    JobAcctGatherFrequency=30
    JobAcctGatherType=jobacct_gather/linux
    SlurmctldDebug=info
    SlurmdDebug=info
    SlurmSchedLogFile=/var/log/slurm/slurmsched.log
    SlurmSchedLogLevel=1

Node and partition configs are omitted above.

Any and all advice will be greatly appreciated.

Best wishes,

~Kevin

Kevin Walsh
Senior Systems Administration Specialist
New Jersey Institute of Technology
Academic & Research Computing Systems


Reply via email to