i can't directly answer you're question, but i suspect there's a missing index somewhere. what i would do is turn on the mysql query log and look at the sql and explain plan associated. it's also possible that since you're a few rev's behind it's already been fixed in a later version, so you could make a quick pass through the release notes.
On Fri, Sep 1, 2023 at 4:02 AM John Snowdon <john.snow...@newcastle.ac.uk> wrote: > > Hi, > > I am attempting to pull some historical information from our HPC system to > analyse some trends of our users over time. > > As part of this I am using sacct to make a number of queries for different > jobs statuses (running, pending, completed, 'other') over particular time > periods (hourly, daily, etc). > > I have noticed that most of my results with sacct return in the order of a > few hundred milliseconds, regardless of rows (anywhere from none to several > thousand). > > However there are two distinct job status codes that result in a huge delay > of between 30 seconds to over 1 minute, irrespective of the number of rows > returned. > > Any job status code in the list of R,CD,CA,DL,F,NF,PR,RS,RV,OOM,TO returns > quickly, but PD and S queries are inordinately slow. Examples: > > # Jobs in running state: > > $ time sacct -X -v -p -a -S 2023-09-0100:00:00 -E 2023-09-0100:59:59 > --state=R | wc -l > sacct: Jobs RUNNING in the time window from Fri Sep 01 00:00:00 2023 to Fri > Sep 01 00:59:59 2023 > sacct: accounting_storage/slurmdbd: init: Accounting storage SLURMDBD plugin > loaded > 281 > > real 0m0.095s > user 0m0.032s > sys 0m0.012s > > # Jobs with an 'abnormal' state: > > $ time sacct -X -v -p -a -S 2023-09-0100:00:00 -E 2023-09-0100:59:59 > --state=CA,DL,F,NF,PR,RS,RV,OOM,TO | wc -l > sacct: Jobs > CANCELLED,DEADLINE,FAILED,NODE_FAIL,PREEMPTED,PENDING,RESIZING,PENDING,REVOKED,OUT_OF_MEMORY,TIMEOUT > in the time window from Fri Sep 01 00:00:00 2023 to Fri Sep 01 00:59:59 2023 > sacct: accounting_storage/slurmdbd: init: Accounting storage SLURMDBD plugin > loaded > 132 > > real 0m0.088s > user 0m0.033s > sys 0m0.014s > > ... but looking at suspended or pending job states: > > $ time sacct -X -v -p -a -S 2023-09-0100:00:00 -E 2023-09-0100:59:59 > --state=PD | wc -l > sacct: Jobs PENDING in the time window from Fri Sep 01 00:00:00 2023 to Fri > Sep 01 00:59:59 2023 > sacct: accounting_storage/slurmdbd: init: Accounting storage SLURMDBD plugin > loaded > 2000 > > real 0m45.712s > user 0m0.041s > sys 0m0.013s > > $ time sacct -X -v -p -a -S 2023-09-0100:00:00 -E 2023-09-0100:59:59 > --state=S | wc -l > sacct: Jobs SUSPENDED in the time window from Fri Sep 01 00:00:00 2023 to Fri > Sep 01 00:59:59 2023 > sacct: accounting_storage/slurmdbd: init: Accounting storage SLURMDBD plugin > loaded > 1 > > real 1m20.490s > user 0m0.033s > sys 0m0.006s > > Our sacct version reports: > > $ sacct -V > slurm 20.11.8 > > The current performance makes my efforts to analyse the size of the tail of > pending jobs (and thus one of the criteria we want to use to understand > whether we are coping with user submission demand) impractical - it seems to > be more than 100x slower than querying which jobs were running at any point > in time. > > Some things which I've observed: > > - Use of start/end or the default time window doesn't matter > - Size of time window set by start/end doesn't matter > - Querying a list of status codes or single states doesn't matter (single or > listed codes of everything but PD and S is fast) > > Is this likely to be behaviour of the sacct client, or is there a fundamental > difference in the database schema that somehow would make queries for S and > PD jobs slower by several factors? > > John Snowdon > Advanced Computing Consultant > > Newcastle University IT Service > The Elizabeth Barraclough Building > 91 Sandyford Road > Newcastle upon Tyne, > NE1 8HW