We do have one partition that uses AllowGroups instead of AllowAccounts. Testing with that partition closed did not change things.
This started Dec 2nd or 3rd - I noticed it on the 3rd. From: slurm-users <slurm-users-boun...@lists.schedmd.com> On Behalf Of Williams, Jenny Avis Sent: Monday, December 7, 2020 11:43 PM To: Slurm User Community List <slurm-users@lists.schedmd.com> Subject: [slurm-users] Extremely sluggish squeue -p partition I have an interesting condition that has been going on for a few days that could use the feedback of those more familiar with the way slurm works under the hood. Conditions : Slurm v20.02.3 The cluster is relatively quiet given the time of year, and the commands are running on the host on which slurmctld is running, so it should not be exiting the host. Sdiag output included at the end. Problem: Queries done on partitions by either squeue or sacct are very slow, even squeue on an empty partition !! with the following exceptions: sacct queries for running jobs squeue on jobID squeue on user Question: What is distinct in the case of squeue on partitions as compared to the sacct on running jobs, or the squeue on user examples, that might account for this behavior? # setup # Slow cases squeue_running="squeue -p general -t running|wc -l" squeue_pending="squeue -p general -t pending|wc -l" sacct_pending="sacct -r general -s pending|wc -l" empty_squeue="squeue -p staff |wc -l" # Fast cases sacct_running="sacct -r general -s running|wc -l" squeue_user="squeue -u auser |wc -l" # time eval $squeue_pending 17642 real 0m30.771s user 0m0.392s sys 0m0.835s # time eval $squeue_running 5175 real 0m31.137s user 0m0.304s sys 0m0.831s # time eval $empty_squeue 1 real 0m31.563s user 0m0.279s sys 0m0.844s # time eval $sacct_pending 17529 real 0m25.901s user 0m0.332s sys 0m0.175s # time eval $sacct_running 15541 real 0m1.135s user 0m0.149s sys 0m0.062s # time eval $squeue_user 3086 real 0m0.422s user 0m0.029s sys 0m0.031s # time sacct -r staff JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- real 0m0.209s user 0m0.007s sys 0m0.002s ******************************************************* sdiag output at Mon Dec 07 16:25:37 2020 (1607376337) Data since Mon Dec 07 15:09:07 2020 (1607371747) ******************************************************* Server thread count: 16 Agent queue size: 0 Agent count: 0 Agent thread count: 0 DBD Agent queue size: 0 Jobs submitted: 3550 Jobs started: 5064 Jobs completed: 4683 Jobs canceled: 83 Jobs failed: 0 Job states ts: Mon Dec 07 16:24:53 2020 (1607376293) Jobs pending: 39099 Jobs running: 5915 Main schedule statistics (microseconds): Last cycle: 96597 Max cycle: 1992372 Total cycles: 338 Mean cycle: 404511 Mean depth cycle: 1195 Cycles per minute: 4 Last queue length: 31916 Backfilling stats (WARNING: data obtained in the middle of backfilling execution.) Total backfilled jobs (since last slurm start): 4610 Total backfilled jobs (since last stats cycle start): 4610 Total backfilled heterogeneous job components: 0 Total cycles: 71 Last cycle when: Mon Dec 07 16:24:25 2020 (1607376265) Last cycle: 23893637 Max cycle: 43863303 Mean cycle: 22141327 Last depth cycle: 22840 Last depth cycle (try sched): 1318 Depth Mean: 23541 Depth Mean (try depth): 1299 Last queue length: 31917 Queue length mean: 31864 Last table size: 164 Mean table size: 114 Latency for 1000 calls to gettimeofday(): 19 microseconds Remote Procedure Call statistics by message type REQUEST_PARTITION_INFO ( 2009) count:16568 ave_time:314 total_time:5202676 REQUEST_FED_INFO ( 2049) count:7822 ave_time:194 total_time:1520744 REQUEST_JOB_USER_INFO ( 2039) count:7756 ave_time:70747 total_time:548714078 REQUEST_JOB_INFO_SINGLE ( 2021) count:5626 ave_time:119832 total_time:674179337 REQUEST_COMPLETE_PROLOG ( 6018) count:5068 ave_time:2126092 total_time:10775038166 MESSAGE_EPILOG_COMPLETE ( 6012) count:4774 ave_time:1044194 total_time:4984985393 REQUEST_STEP_COMPLETE ( 5016) count:4770 ave_time:1149196 total_time:5481666326 REQUEST_COMPLETE_BATCH_SCRIPT ( 5018) count:4746 ave_time:4107848 total_time:19495849569 REQUEST_SUBMIT_BATCH_JOB ( 4003) count:3180 ave_time:893436 total_time:2841126933 REQUEST_NODE_INFO ( 2007) count:2994 ave_time:61229 total_time:183321804 MESSAGE_NODE_REGISTRATION_STATUS ( 1002) count:369 ave_time:365319 total_time:134802881 REQUEST_JOB_INFO ( 2003) count:219 ave_time:5586738 total_time:1223495821 REQUEST_KILL_JOB ( 5032) count:83 ave_time:3698948 total_time:307012762 REQUEST_PRIORITY_FACTORS ( 2026) count:48 ave_time:302008 total_time:14496396 REQUEST_UPDATE_NODE ( 3002) count:30 ave_time:234937 total_time:7048123 REQUEST_BUILD_INFO ( 2001) count:27 ave_time:102471 total_time:2766731 REQUEST_JOB_READY ( 4019) count:23 ave_time:128717 total_time:2960508 REQUEST_PING ( 1008) count:15 ave_time:139 total_time:2088 REQUEST_CANCEL_JOB_STEP ( 5005) count:8 ave_time:1320084 total_time:10560677 REQUEST_COMPLETE_JOB_ALLOCATION ( 5017) count:7 ave_time:2417071 total_time:16919499 REQUEST_NODE_INFO_SINGLE ( 2040) count:6 ave_time:189125 total_time:1134750 REQUEST_RESOURCE_ALLOCATION ( 4001) count:6 ave_time:4771306 total_time:28627836 REQUEST_UPDATE_PARTITION ( 3005) count:6 ave_time:633112 total_time:3798672 REQUEST_JOB_STEP_CREATE ( 5001) count:5 ave_time:1015017 total_time:5075087 REQUEST_STATS_INFO ( 2035) count:0 ave_time:0 total_time:0 .. Pending RPC statistics No pending RPCs