We're seeing some pretty bad performance with around 3000 jobs in queue.
We're using sched/backfill, and I've been tweaking the bf_ parameters
to try and improve some things, with limited results.
But even before the backfill process starts, the main scheduling loop
is taking so long per job that it's not even completing the scheduling
of a single job before timing out.
>From watching the logs, when backfill does run it's taking 17 seconds
or more to test each job, so it doesn't get very far either.

One thing of note, most of the 3000 jobs are serial jobs in our
scavenger queue, which is preemptible.

We're currently running SLURM version 16.05.1, a bit behind, I know.

Any pointers on what to look at would be appreciated.

Thanks,
Kevin
--
Kevin Hildebrand
University of Maryland
Division of IT
*******************************************************
sdiag output at Mon Jun 11 10:09:17 2018
Data since      Mon Jun 11 09:49:18 2018
*******************************************************
Server thread count: 7
Agent queue size:    0

Jobs submitted: 7
Jobs started:   1
Jobs completed: 3
Jobs canceled:  3
Jobs failed:    0

Main schedule statistics (microseconds):
        Last cycle:   20017144
        Max cycle:    20897866
        Total cycles: 41
        Mean cycle:   18957323
        Mean depth cycle:  1
        Cycles per minute: 2
        Last queue length: 11

Backfilling stats (WARNING: data obtained in the middle of backfilling 
execution.)
        Total backfilled jobs (since last slurm start): 1
        Total backfilled jobs (since last stats cycle start): 1
        Total cycles: 2
        Last cycle when: Mon Jun 11 10:07:15 2018
        Last cycle: 224511653
        Max cycle:  224511653
        Mean cycle: 213405214
        Last depth cycle: 3
        Last depth cycle (try sched): 3
        Depth Mean: 4
        Depth Mean (try depth): 4
        Last queue length: 11
        Queue length mean: 12

Remote Procedure Call statistics by message type
        MESSAGE_NODE_REGISTRATION_STATUS        ( 1002) count:843    
ave_time:4903158 total_time:4133362640
        REQUEST_PARTITION_INFO                  ( 2009) count:98     
ave_time:3133   total_time:307125
        REQUEST_JOB_INFO                        ( 2003) count:94     
ave_time:14811231 total_time:1392255803
        REQUEST_NODE_INFO                       ( 2007) count:86     
ave_time:9364554 total_time:805351683
        REQUEST_NODE_INFO_SINGLE                ( 2040) count:61     
ave_time:13399191 total_time:817350682
        REQUEST_UPDATE_NODE                     ( 3002) count:42     
ave_time:2736748 total_time:114943454
        REQUEST_JOB_STEP_INFO                   ( 2005) count:38     
ave_time:1480198 total_time:56247529
        MESSAGE_EPILOG_COMPLETE                 ( 6012) count:31     
ave_time:16408225 total_time:508655003
        REQUEST_JOB_INFO_SINGLE                 ( 2021) count:28     
ave_time:10754115 total_time:301115228
        REQUEST_COMPLETE_PROLOG                 ( 6018) count:20     
ave_time:18418352 total_time:368367040
        REQUEST_SUBMIT_BATCH_JOB                ( 4003) count:7      
ave_time:12642347 total_time:88496435
        REQUEST_STATS_INFO                      ( 2035) count:5      
ave_time:213    total_time:1069
        REQUEST_STEP_COMPLETE                   ( 5016) count:4      
ave_time:13680174 total_time:54720699
        REQUEST_BUILD_INFO                      ( 2001) count:4      
ave_time:369    total_time:1478
        REQUEST_COMPLETE_BATCH_SCRIPT           ( 5018) count:3      
ave_time:6582324 total_time:19746973
        REQUEST_JOB_STEP_CREATE                 ( 5001) count:3      
ave_time:6392575 total_time:19177726
        REQUEST_KILL_JOB                        ( 5032) count:3      
ave_time:6208358 total_time:18625074
        REQUEST_JOB_ALLOCATION_INFO_LITE        ( 4016) count:3      
ave_time:21989479 total_time:65968437
        REQUEST_SHARE_INFO                      ( 2022) count:3      
ave_time:1449   total_time:4348
        REQUEST_JOB_USER_INFO                   ( 2039) count:1      
ave_time:49483669 total_time:49483669

Attachment: slurm.conf
Description: Binary data

Reply via email to