i can't directly answer you're question, but i suspect there's a
missing index somewhere.  what i would do is turn on the mysql query
log and look at the sql and explain plan associated.  it's also
possible that since you're a few rev's behind it's already been fixed
in a later version, so you could make a quick pass through the release
notes.

On Fri, Sep 1, 2023 at 4:02 AM John Snowdon
<john.snow...@newcastle.ac.uk> wrote:
>
> Hi,
>
> I am attempting to pull some historical information from our HPC system to 
> analyse some trends of our users over time.
>
> As part of this I am using sacct to make a number of queries for different 
> jobs statuses (running, pending, completed, 'other') over particular time 
> periods (hourly, daily, etc).
>
> I have noticed that most of my results with sacct return in the order of a 
> few hundred milliseconds, regardless of rows (anywhere from none to several 
> thousand).
>
> However there are two distinct job status codes that result in a huge delay 
> of between 30 seconds to over 1 minute, irrespective of the number of rows 
> returned.
>
> Any job status code in the list of R,CD,CA,DL,F,NF,PR,RS,RV,OOM,TO returns 
> quickly, but PD and S queries are inordinately slow. Examples:
>
> # Jobs in running state:
>
> $ time sacct -X -v -p -a -S 2023-09-0100:00:00 -E 2023-09-0100:59:59 
> --state=R | wc -l
> sacct: Jobs RUNNING in the time window from Fri Sep 01 00:00:00 2023 to Fri 
> Sep 01 00:59:59 2023
> sacct: accounting_storage/slurmdbd: init: Accounting storage SLURMDBD plugin 
> loaded
> 281
>
> real    0m0.095s
> user    0m0.032s
> sys     0m0.012s
>
> # Jobs with an 'abnormal' state:
>
> $ time sacct -X -v -p -a -S 2023-09-0100:00:00 -E 2023-09-0100:59:59 
> --state=CA,DL,F,NF,PR,RS,RV,OOM,TO | wc -l
> sacct: Jobs 
> CANCELLED,DEADLINE,FAILED,NODE_FAIL,PREEMPTED,PENDING,RESIZING,PENDING,REVOKED,OUT_OF_MEMORY,TIMEOUT
>  in the time window from Fri Sep 01 00:00:00 2023 to Fri Sep 01 00:59:59 2023
> sacct: accounting_storage/slurmdbd: init: Accounting storage SLURMDBD plugin 
> loaded
> 132
>
> real    0m0.088s
> user    0m0.033s
> sys     0m0.014s
>
> ... but looking at suspended or pending job states:
>
> $ time sacct -X -v -p -a -S 2023-09-0100:00:00 -E 2023-09-0100:59:59 
> --state=PD | wc -l
> sacct: Jobs PENDING in the time window from Fri Sep 01 00:00:00 2023 to Fri 
> Sep 01 00:59:59 2023
> sacct: accounting_storage/slurmdbd: init: Accounting storage SLURMDBD plugin 
> loaded
> 2000
>
> real    0m45.712s
> user    0m0.041s
> sys     0m0.013s
>
> $ time sacct -X -v -p -a -S 2023-09-0100:00:00 -E 2023-09-0100:59:59 
> --state=S | wc -l
> sacct: Jobs SUSPENDED in the time window from Fri Sep 01 00:00:00 2023 to Fri 
> Sep 01 00:59:59 2023
> sacct: accounting_storage/slurmdbd: init: Accounting storage SLURMDBD plugin 
> loaded
> 1
>
> real    1m20.490s
> user    0m0.033s
> sys     0m0.006s
>
> Our sacct version reports:
>
> $ sacct -V
> slurm 20.11.8
>
> The current performance makes my efforts to analyse the size of the tail of 
> pending jobs (and thus one of the criteria we want to use to understand 
> whether we are coping with user submission demand) impractical - it seems to 
> be more than 100x slower than querying which jobs were running at any point 
> in time.
>
> Some things which I've observed:
>
> - Use of start/end or the default time window doesn't matter
> - Size of time window set by start/end doesn't matter
> - Querying a list of status codes or single states doesn't matter (single or 
> listed codes of everything but PD and S is fast)
>
> Is this likely to be behaviour of the sacct client, or is there a fundamental 
> difference in the database schema that somehow would make queries for S and 
> PD jobs slower by several factors?
>
> John Snowdon
> Advanced Computing Consultant
>
> Newcastle University IT Service
> The Elizabeth Barraclough Building
> 91 Sandyford Road
> Newcastle upon Tyne,
> NE1 8HW

Reply via email to