Try:

    sacctmgr list runawayjobs

Brian Andrus

On 12/20/2022 7:54 AM, Reed Dier wrote:
Hoping this is a fairly simple one.

This is a small internal cluster that we’ve been using for about 6 months now, and we’ve had some infrastructure instability in that time, which I think may be the root culprit behind this weirdness, but hopefully someone can point me in the direction to solve the issue.

I do a daily email of sreport to show how busy the cluster was, and who were the top users. Weirdly, I have a user that seems to be able to use the same exact usage day after day after day, down to hundredth of a percent, conspicuously even when they were on vacation and claimed that they didn’t have job submissions in cron/etc.

So then, taking a spin of the scom tui <https://lists.schedmd.com/pipermail/slurm-users/2022-December/009514.html>posted this morning, I then filtered that user, and noticed that even though I was only looking 2 days back at job history, I was seeing a job from August.

Conspicuously, the job state is cancelled, but the job end time is 1y from the start time, meaning its job end time is in 2023. So something with the dbd is confused about this/these jobs that are lingering and reporting cancelled but still “on the books” somehow until next August.

╭──────────────────────────────────────────────────────────────────────────────────────────╮
│              │
│  Job ID     : 290742             │
│  Job Name     : $jobname             │
│  User     : $user              │
│  Group    : $user            │
│  Job Account    : $account             │
│  Job Submission     : 2022-08-08 08:44:52 -0400 EDT              │
│  Job Start    : 2022-08-08 08:46:53 -0400 EDT            │
│  Job End    : 2023-08-08 08:47:01 -0400 EDT            │
│  Job Wait time    : 2m1s             │
│  Job Run time     : 8760h0m8s              │
│  Partition    : $part            │
│  Priority     : 127282             │
│  QoS    : $qos             │
│              │
│              │
╰──────────────────────────────────────────────────────────────────────────────────────────╯
Steps count: 0

Filter: $user         Items: 13

 Job ID      Job Name                             Part.  QoS Account     User             Nodes                 State
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 290714  $jobname                             $part  $qos  $acct       $user            node32  CANCELLED  290716  $jobname                             $part  $qos  $acct       $user            node24  CANCELLED  290736  $jobname                             $part  $qos  $acct       $user            node00  CANCELLED  290742  $jobname                             $part  $qos  $acct       $user            node01  CANCELLED  290770  $jobname                             $part  $qos  $acct       $user            node02  CANCELLED  290777  $jobname                             $part  $qos  $acct       $user            node03  CANCELLED  290793  $jobname                             $part  $qos  $acct       $user            node04  CANCELLED  290797  $jobname                             $part  $qos  $acct       $user            node05  CANCELLED  290799  $jobname                             $part  $qos  $acct       $user            node06  CANCELLED  290801  $jobname                             $part  $qos  $acct       $user            node07  CANCELLED  290814  $jobname                             $part  $qos  $acct       $user            node08  CANCELLED  290817  $jobname                             $part  $qos  $acct       $user            node09  CANCELLED  290819  $jobname                             $part  $qos  $acct       $user            node10  CANCELLED

I’d love to figure out the proper way to either purge these jid’s from the accounting database cleanly, or change the job end/run time to a sane/correct value. Slurm is v21.08.8-2, and ntp is a stratum 1 server, so time is in sync everywhere, not that multiple servers would drift 1 year off like this.

Thanks for any help,
Reed

Reply via email to