We're seeing some pretty bad performance with around 3000 jobs in queue. We're using sched/backfill, and I've been tweaking the bf_ parameters to try and improve some things, with limited results. But even before the backfill process starts, the main scheduling loop is taking so long per job that it's not even completing the scheduling of a single job before timing out. >From watching the logs, when backfill does run it's taking 17 seconds or more to test each job, so it doesn't get very far either.
One thing of note, most of the 3000 jobs are serial jobs in our scavenger queue, which is preemptible. We're currently running SLURM version 16.05.1, a bit behind, I know. Any pointers on what to look at would be appreciated. Thanks, Kevin -- Kevin Hildebrand University of Maryland Division of IT
******************************************************* sdiag output at Mon Jun 11 10:09:17 2018 Data since Mon Jun 11 09:49:18 2018 ******************************************************* Server thread count: 7 Agent queue size: 0 Jobs submitted: 7 Jobs started: 1 Jobs completed: 3 Jobs canceled: 3 Jobs failed: 0 Main schedule statistics (microseconds): Last cycle: 20017144 Max cycle: 20897866 Total cycles: 41 Mean cycle: 18957323 Mean depth cycle: 1 Cycles per minute: 2 Last queue length: 11 Backfilling stats (WARNING: data obtained in the middle of backfilling execution.) Total backfilled jobs (since last slurm start): 1 Total backfilled jobs (since last stats cycle start): 1 Total cycles: 2 Last cycle when: Mon Jun 11 10:07:15 2018 Last cycle: 224511653 Max cycle: 224511653 Mean cycle: 213405214 Last depth cycle: 3 Last depth cycle (try sched): 3 Depth Mean: 4 Depth Mean (try depth): 4 Last queue length: 11 Queue length mean: 12 Remote Procedure Call statistics by message type MESSAGE_NODE_REGISTRATION_STATUS ( 1002) count:843 ave_time:4903158 total_time:4133362640 REQUEST_PARTITION_INFO ( 2009) count:98 ave_time:3133 total_time:307125 REQUEST_JOB_INFO ( 2003) count:94 ave_time:14811231 total_time:1392255803 REQUEST_NODE_INFO ( 2007) count:86 ave_time:9364554 total_time:805351683 REQUEST_NODE_INFO_SINGLE ( 2040) count:61 ave_time:13399191 total_time:817350682 REQUEST_UPDATE_NODE ( 3002) count:42 ave_time:2736748 total_time:114943454 REQUEST_JOB_STEP_INFO ( 2005) count:38 ave_time:1480198 total_time:56247529 MESSAGE_EPILOG_COMPLETE ( 6012) count:31 ave_time:16408225 total_time:508655003 REQUEST_JOB_INFO_SINGLE ( 2021) count:28 ave_time:10754115 total_time:301115228 REQUEST_COMPLETE_PROLOG ( 6018) count:20 ave_time:18418352 total_time:368367040 REQUEST_SUBMIT_BATCH_JOB ( 4003) count:7 ave_time:12642347 total_time:88496435 REQUEST_STATS_INFO ( 2035) count:5 ave_time:213 total_time:1069 REQUEST_STEP_COMPLETE ( 5016) count:4 ave_time:13680174 total_time:54720699 REQUEST_BUILD_INFO ( 2001) count:4 ave_time:369 total_time:1478 REQUEST_COMPLETE_BATCH_SCRIPT ( 5018) count:3 ave_time:6582324 total_time:19746973 REQUEST_JOB_STEP_CREATE ( 5001) count:3 ave_time:6392575 total_time:19177726 REQUEST_KILL_JOB ( 5032) count:3 ave_time:6208358 total_time:18625074 REQUEST_JOB_ALLOCATION_INFO_LITE ( 4016) count:3 ave_time:21989479 total_time:65968437 REQUEST_SHARE_INFO ( 2022) count:3 ave_time:1449 total_time:4348 REQUEST_JOB_USER_INFO ( 2039) count:1 ave_time:49483669 total_time:49483669
slurm.conf
Description: Binary data