Hi Paul, I'm wondering about this part in your SchedulerParameters:
### default_queue_depth should be some multiple of the partition_job_depth, ### ideally number_of_partitions * partition_job_depth, but typically the main ### loop exits prematurely if you go over about 400. A partition_job_depth of ### 10 seems to work well. Do you remember if that's still the case, or if it's in relation with a reported issue? That sure sounds like something that would need to be fixed if it hasn't been already. Cheers, -- Kilian On Wed, May 29, 2019 at 7:42 AM Paul Edmon <ped...@cfa.harvard.edu> wrote: > For reference we are running 18.08.7 > > -Paul Edmon- > On 5/29/19 10:39 AM, Paul Edmon wrote: > > Sure. Here is what we have: > > ########################## Scheduling ##################################### > ### This section is specific to scheduling > > ### Tells the scheduler to enforce limits for all partitions > ### that a job submits to. > EnforcePartLimits=ALL > > ### Let's slurm know that we have a jobsubmit.lua script > JobSubmitPlugins=lua > > ### When a job is launched this has slurmctld send the user information > ### instead of having AD do the lookup on the node itself. > LaunchParameters=send_gids > > ### Maximum sizes for Jobs. > MaxJobCount=200000 > MaxArraySize=10000 > DefMemPerCPU=100 > > ### Job Timers > CompleteWait=0 > > ### We set the EpilogMsgTime long so that Epilog Messages don't pile up > all > ### at one time due to forced exit which can cause problems for the master. > EpilogMsgTime=3000000 > InactiveLimit=0 > KillWait=30 > > ### This only applies to the reservation time limit, the job must still > obey > ### the partition time limit. > ResvOverRun=UNLIMITED > MinJobAge=600 > Waittime=0 > > ### Scheduling parameters > ### FastSchedule 2 lets slurm know not to auto detect the node config > ### but rather follow our definition. We also use setting 2 as due to our > geographic > ### size nodes may drop out of slurm and then reconnect. If we had 1 they > would be > ### set to drain when they reconnect. Setting it to 2 allows them to > rejoin with out > ### issue. > FastSchedule=2 > SchedulerType=sched/backfill > SelectType=select/cons_res > SelectTypeParameters=CR_Core_Memory > > ### Govern's default preemption behavior > PreemptType=preempt/partition_prio > PreemptMode=REQUEUE > > ### default_queue_depth should be some multiple of the partition_job_depth, > ### ideally number_of_partitions * partition_job_depth, but typically the > main > ### loop exits prematurely if you go over about 400. A partition_job_depth > of > ### 10 seems to work well. > SchedulerParameters=\ > default_queue_depth=1150,\ > partition_job_depth=10,\ > max_sched_time=50,\ > bf_continue,\ > bf_interval=30,\ > bf_resolution=600,\ > bf_window=11520,\ > bf_max_job_part=0,\ > bf_max_job_user=10,\ > bf_max_job_test=10000,\ > bf_max_job_start=1000,\ > bf_ignore_newly_avail_nodes,\ > kill_invalid_depend,\ > pack_serial_at_end,\ > nohold_on_prolog_fail,\ > preempt_strict_order,\ > preempt_youngest_first,\ > max_rpc_cnt=8 > > ################################ Fairshare ################################ > ### This section sets the fairshare calculations > > PriorityType=priority/multifactor > > ### Settings for fairshare calculation frequency and shape. > FairShareDampeningFactor=1 > PriorityDecayHalfLife=28-0 > PriorityCalcPeriod=1 > > ### Settings for fairshare weighting. > PriorityMaxAge=7-0 > PriorityWeightAge=10000000 > PriorityWeightFairshare=20000000 > PriorityWeightJobSize=0 > PriorityWeightPartition=0 > PriorityWeightQOS=1000000000 > > I'm happy to chat about any of the settings if you want, or share our full > config. > > -Paul Edmon- > On 5/29/19 10:17 AM, Julius, Chad wrote: > > All, > > > > We rushed our Slurm install due to a short timeframe and missed some > important items. We are now looking to implement a better system than the > first in, first out we have now. My question, are the defaults listed in > the slurm.conf file a good start? Would anyone be willing to share their > Scheduling section in their .conf? Also we are looking to increase the > maximum array size but I don’t see that in the slurm.conf in version 17. > Am I looking at an upgrade of Slurm in the near future or can I just add > MaxArraySize=somenumber? > > > > The defaults as of 17.11.8 are: > > > > # SCHEDULING > > #SchedulerAuth= > > #SchedulerPort= > > #SchedulerRootFilter= > > #PriorityType=priority/multifactor > > #PriorityDecayHalfLife=14-0 > > #PriorityUsageResetPeriod=14-0 > > #PriorityWeightFairshare=100000 > > #PriorityWeightAge=1000 > > #PriorityWeightPartition=10000 > > #PriorityWeightJobSize=1000 > > #PriorityMaxAge=1-0 > > > > *Chad Julius* > > Cyberinfrastructure Engineer Specialist > > > > *Division of Technology & Security* > > SOHO 207, Box 2231 > > Brookings, SD 57007 > > Phone: 605-688-5767 > > > > www.sdstate.edu > > [image: cid:image007.png@01D24AF4.6CEECA30] > > > > -- Kilian