[slurm-users] Re: 'sacct -a' missing running jobs 'held' by user

Lee via slurm-users Wed, 07 Jan 2026 04:23:56 -0800

Hello,

I replicated this issue on a different cluster and determined that the root
cause is that the time_eligible in the underlying MySQL database gets set
to 0 when a running job is held.  Let me demonstrate.


1. Allocate a job and check that I can query it via `sacct -S YYYY-MM-DD`

        jess@bcm10-h01:~$ srun --pty bash
        jess@bcm10-n01:~$ squeue
           JOBID PARTITION       NAME    USER ST         TIME  NODES   CPUS
MIN_M
             114      defq       bash    jess  R         1:13      1      1
2900M

        root@bcm10-h01:~# sacct -S 2026-01-06 -a
        JobID           JobName  Partition    Account  AllocCPUS      State
ExitCode
        ------------ ---------- ---------- ---------- ---------- ----------
--------
        114                bash       defq   allusers          1    RUNNING
     0:0
        114.0              bash              allusers          1    RUNNING
     0:0

        root@bcm10-h01:~# scontrol show jobid=114 | grep EligibleTime
           SubmitTime=2026-01-06T14:52:04* EligibleTime=2026-01-06T14:52:04*



2. Hold job, confirm that it is no longer queryable via `sacct -S
YYYY-MM-DD`, notice EligibleTime changes to Unknown.

        jess@bcm10-n01:~$ scontrol hold 114
        jess@bcm10-n01:~$ scontrol release 114

        root@bcm10-h01:~# sacct -S 2026-01-06 -a
        JobID           JobName  Partition    Account  AllocCPUS      State
ExitCode
        ------------ ---------- ---------- ---------- ---------- ----------
--------

        root@bcm10-h01:~# scontrol show jobid=114 | grep EligibleTime
           SubmitTime=2026-01-06T14:52:04 *EligibleTime=Unknown*


3. Check time_eligible in the underlying MySQL database and confirm that
changing time_eligible makes it queryable via `sacct -S YYYY-MM-DD`.

        root@bcm10-h01:~# mysql --host=localhost --user=slurm
 --password=XYZ slurm_acct_db
        mysql> SELECT id_job  FROM slurm_job_table WHERE time_eligible = 0;
        +--------+
        | id_job |
        +--------+
        |    *114* |
        |    112 |
        |    113 |
        +--------+
        3 rows in set (0.00 sec)

        mysql> UPDATE slurm_job_table SET time_eligible = 1767733491 WHERE
id_job = 114;
        Query OK, 1 row affected (0.01 sec)
        Rows matched: 1  Changed: 1  Warnings: 0

        mysql> SELECT time_eligible FROM slurm_job_table WHERE id_job = 114;
        +---------------+
        | time_eligible |
        +---------------+
        |    1767733491 |
        +---------------+
        1 row in set (0.00 sec)

        ### WORKS AGAIN
        root@bcm10-h01:~# sacct -S 2026-01-06 -a
        JobID           JobName  Partition    Account  AllocCPUS      State
ExitCode
        ------------ ---------- ---------- ---------- ---------- ----------
--------
        114                bash       defq   allusers          1    RUNNING
     0:0
        114.0              bash              allusers          1    RUNNING
     0:0

4. In the man page for sacct, it says things like :

         "For  example  jobs  submitted with the "--hold" option will have
"EligibleTime=Unknown" as they are pending indefinitely."

*Conclusion : *
This very much feels like a *bug*.  It doesn't seem like running jobs
should be able to be 'held' because they can't be pending indefinitely due
to the fact that they are actively running.  I don't think that the
EligibleTime should subsequently change when a user tries to 'hold' a
running job either.

*Question : *
1. Identifying these problematic jobs via the underlying MySQL database
seems not optimal.  Are there any better workarounds?

Best regards,
Lee

On Mon, Dec 15, 2025 at 2:33 PM Lee <[email protected]> wrote:

> Hello,
>
> I am using slurm 23.02.6.  I have a strange issue.  I periodically use
> sacct to dump job data.  I then generate reports based on the resource
> allocation of our users.
>
> Recently, I noticed some 'missing' jobs from my query. The missing jobs
> came from a user who had a large array job, who then 'held' all of the
> array jobs.  This included 'holding' the Running array jobs.
> Now, if I run `sacct -a -S YYYY-MM-DD --format="jobidraw,jobname"`, the
> job will be missing from that query.
>
> However, if I query specifically for that job, i.e. `sacct -j RAWJOBID -S
> YYYY-MM-DD --format="jobidraw,jobname", the job is present.
>
> *Question* :
> 1. How can I include the 'held' running job when I do my bulk query with
> `sacct -a`?  Finding these outliers and adding them ad-hoc to my dumped
> file is too laborious and isn't feasible.
>
>
> *Minimum working example *:
>     #. Submit a job :
>         myuser@clusterb01:~$ srun --pty bash # landed on dgx29
>
>     #. Hold job
>         myuser@clusterb01:~$ scontrol hold 120918
>         myuser@clusterb01:~$ scontrol show job=120918 JobId=120918
> JobName=bash
>            UserId=myuser(123456) GroupId=rdusers(7000) MCS_label=N/A
>            Priority=0 Nice=0 Account=allusers QOS=normal
>            JobState=*RUNNING* Reason=*JobHeldUser* Dependency=(null)
>            Requeue=0 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
>            RunTime=00:00:29 TimeLimit=7-00:00:00 TimeMin=N/A
>            SubmitTime=2025-12-15T13:31:28 EligibleTime=Unknown
>            AccrueTime=Unknown
>            StartTime=2025-12-15T13:31:28 EndTime=2025-12-22T13:31:28
> Deadline=N/A
>            SuspendTime=None SecsPreSuspend=0
> LastSchedEval=2025-12-15T13:31:28 Scheduler=Main
>            Partition=defq AllocNode:Sid=clusterb01:4145861
>            ReqNodeList=(null) ExcNodeList=(null)
>            NodeList=dgx29
>            BatchHost=dgx29
>            NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>            ReqTRES=cpu=1,mem=9070M,node=1,billing=1
>            AllocTRES=cpu=2,mem=18140M,node=1,billing=2
>            Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>            MinCPUsNode=1 MinMemoryCPU=9070M MinTmpDiskNode=0
>            Features=(null) DelayBoot=00:00:00
>            OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
>            Command=bash
>            WorkDir=/home/myuser
>            Power=
>
>     #. Release job
>         myuser@clusterb01:~$ scontrol release 120918
>
>     #. Show job again
>         myuser@clusterb01:~$ scontrol show job=120918
>         JobId=120918 JobName=bash
>            UserId=myuser(123456) GroupId=rdusers(7000) MCS_label=N/A
>            Priority=1741 Nice=0 Account=allusers QOS=normal
>            JobState=*RUNNING* Reason=*None* Dependency=(null)
>            Requeue=0 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
>            RunTime=00:01:39 TimeLimit=7-00:00:00 TimeMin=N/A
>            SubmitTime=2025-12-15T13:31:28 EligibleTime=Unknown
>            AccrueTime=Unknown
>            StartTime=2025-12-15T13:31:28 EndTime=2025-12-22T13:31:28
> Deadline=N/A
>            SuspendTime=None SecsPreSuspend=0
> LastSchedEval=2025-12-15T13:31:28 Scheduler=Main
>            Partition=defq AllocNode:Sid=clusterb01:4145861
>            ReqNodeList=(null) ExcNodeList=(null)
>            NodeList=dgx29
>            BatchHost=dgx29
>            NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>            ReqTRES=cpu=1,mem=9070M,node=1,billing=1
>            AllocTRES=cpu=2,mem=18140M,node=1,billing=2
>            Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>            MinCPUsNode=1 MinMemoryCPU=9070M MinTmpDiskNode=0
>            Features=(null) DelayBoot=00:00:00
>            OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
>            Command=bash
>            WorkDir=/home/myuser/
>            Power=
>
>     #. In slurmctld, I see :
>             root@clusterb01:~# grep 120918 /var/log/slurmctld
>             [2025-12-15T13:31:28.706] sched: _slurm_rpc_allocate_resources
> JobId=120918 NodeList=dgx29 usec=1269
>             [2025-12-15T13:31:47.751] sched: _hold_job_rec: hold on
> JobId=120918 by uid 123456
>             [2025-12-15T13:31:47.751] sched: _update_job: set priority to
> 0 for JobId=120918
>             [2025-12-15T13:31:47.751] _slurm_rpc_update_job: complete
> JobId=120918 uid=123456 usec=189
>             [2025-12-15T13:32:48.081] sched: _release_job_rec: release
> hold on JobId=120918 by uid 123456
>             [2025-12-15T13:32:48.081] _slurm_rpc_update_job: complete
> JobId=120918 uid=123456 usec=268
>             [2025-12-15T13:33:20.552] _job_complete: JobId=120918
> WEXITSTATUS 0
>             [2025-12-15T13:33:20.552] _job_complete: JobId=120918 done
>
>     #. Job is NOT missing, when identifying it by jobid
>             myuser@clusterb01:~$ sacct -j 120918 --starttime=2025-12-12
> -o "jobidraw,jobid,node,start,end,elapsed,state,submitline%30"
>             JobIDRaw     JobID               NodeList               Start
>                 End    Elapsed      State                     SubmitLine
>             ------------ ------------ --------------- -------------------
> ------------------- ---------- ---------- ------------------------------
>             120918       120918                 dgx29 2025-12-15T13:31:28
> 2025-12-15T13:33:20   00:01:52  COMPLETED                srun --pty bash
>             120918.0     120918.0               dgx29 2025-12-15T13:31:28
> 2025-12-15T13:33:20   00:01:52  COMPLETED                srun --pty bash
>
>  #. Job IS *missing* when using -a
>             myuser@clusterb01:~$ sacct -a --starttime=2025-12-12 -o
> "jobidraw,jobid,node,start,end,elapsed,state,submitline%30"  | grep -i
> 120918    ## *MISSING*
>
> Best regards,
> Lee
>

-- 
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[slurm-users] Re: 'sacct -a' missing running jobs 'held' by user

Reply via email to