Seems like the time may have been off on the db server at the insert/update.

You may want to dump the database, find what table/records need updated and try updating them. If anything went south, you could restore from the dump.

Brian Andrus

On 12/20/2022 11:51 AM, Reed Dier wrote:
Just to followup with some things I’ve tried:

scancel doesn’t want to touch it:
# scancel -v 290710
scancel: Terminating job 290710
scancel: error: Kill job error on job id 290710: Job/step already completing or completed

pscontrol does see that these are all members of the same array, but doesn’t want to touch it:
# scontrol update JobID=290710 EndTime=2022-08-09T08:47:01
290710_4,6,26,32,60,67,83,87,89,91,...: Job has already finished

And trying to modify the job’s end time with sacctmgr fails, as expected, to modify the EndTime because EndTime is only a where spec, not a set spec, also tried EndTime=now with same results:
# sacctmgr modify job where JobID=290710 set EndTime=2022-08-09T08:47:01
 Unknown option: EndTime=2022-08-09T08:47:01
 Use keyword 'where' to modify condition
 You didn't give me anything to set

I was able to set a comment for the jobs/array, so the DBD can see/talk to them. One additional thing to mention is that there are 14 JIDs that are stuck like this, 1 is an Array JID, and 13 of them are array tasks on the original Array ID.

But figured I would provide some of the other steps I’ve tried to flush those ideas.

Thanks,
Reed

On Dec 20, 2022, at 10:08 AM, Reed Dier <reed.d...@focusvq.com> wrote:

2 votes for runawayjobs is a strong vote (and also something I’m glad to learn exists for the future), however, it does not appear to be the case.

# sacctmgr show runawayjobs
Runaway Jobs: No runaway jobs found on cluster $cluster

So unfortunately that doesn’t appear to be the culprit.

Appreciate the responses.

Reed

On Dec 20, 2022, at 10:03 AM, Brian Andrus <toomuc...@gmail.com> wrote:

Try:

    sacctmgr list runawayjobs

Brian Andrus

On 12/20/2022 7:54 AM, Reed Dier wrote:
Hoping this is a fairly simple one.

This is a small internal cluster that we’ve been using for about 6 months now, and we’ve had some infrastructure instability in that time, which I think may be the root culprit behind this weirdness, but hopefully someone can point me in the direction to solve the issue.

I do a daily email of sreport to show how busy the cluster was, and who were the top users. Weirdly, I have a user that seems to be able to use the same exact usage day after day after day, down to hundredth of a percent, conspicuously even when they were on vacation and claimed that they didn’t have job submissions in cron/etc.

So then, taking a spin of the scom tui <https://lists.schedmd.com/pipermail/slurm-users/2022-December/009514.html>posted this morning, I then filtered that user, and noticed that even though I was only looking 2 days back at job history, I was seeing a job from August.

Conspicuously, the job state is cancelled, but the job end time is 1y from the start time, meaning its job end time is in 2023. So something with the dbd is confused about this/these jobs that are lingering and reporting cancelled but still “on the books” somehow until next August.

╭──────────────────────────────────────────────────────────────────────────────────────────╮
│                                │
│  Job ID : 290742                               │
│  Job Name : $jobname                               │
│  User : $user                                │
│  Group  : $user                                │
│  Job Account  : $account                                 │
│  Job Submission : 2022-08-08 08:44:52 -0400 EDT                                │ │  Job Start  : 2022-08-08 08:46:53 -0400 EDT                                │ │  Job End  : 2023-08-08 08:47:01 -0400 EDT                                │
│  Job Wait time  : 2m1s                                 │
│  Job Run time : 8760h0m8s                                │
│  Partition  : $part                                │
│  Priority : 127282                               │
│  QoS  : $qos                                 │
│                                │
│                                │
╰──────────────────────────────────────────────────────────────────────────────────────────╯
Steps count: 0

Filter: $user       Items: 13

 Job ID      Job Name                           Part.  QoS         Account     User Nodes                 State
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 290714      $jobname                           $part  $qos        $acct       $user      node32                CANCELLED  290716      $jobname                           $part  $qos        $acct       $user      node24                CANCELLED  290736      $jobname                           $part  $qos        $acct       $user      node00                CANCELLED  290742      $jobname                           $part  $qos        $acct       $user      node01                CANCELLED  290770      $jobname                           $part  $qos        $acct       $user      node02                CANCELLED  290777      $jobname                           $part  $qos        $acct       $user      node03                CANCELLED  290793      $jobname                           $part  $qos        $acct       $user      node04                CANCELLED  290797      $jobname                           $part  $qos        $acct       $user      node05                CANCELLED  290799      $jobname                           $part  $qos        $acct       $user      node06                CANCELLED  290801      $jobname                           $part  $qos        $acct       $user      node07                CANCELLED  290814      $jobname                           $part  $qos        $acct       $user      node08                CANCELLED  290817      $jobname                           $part  $qos        $acct       $user      node09                CANCELLED  290819      $jobname                           $part  $qos        $acct       $user      node10                CANCELLED

I’d love to figure out the proper way to either purge these jid’s from the accounting database cleanly, or change the job end/run time to a sane/correct value. Slurm is v21.08.8-2, and ntp is a stratum 1 server, so time is in sync everywhere, not that multiple servers would drift 1 year off like this.

Thanks for any help,
Reed

Reply via email to