Re: [slurm-users] Job cancelled into the future

Brian Andrus Tue, 20 Dec 2022 08:05:34 -0800

Try:

    sacctmgr list runawayjobs


Brian Andrus

On 12/20/2022 7:54 AM, Reed Dier wrote:

Hoping this is a fairly simple one.
This is a small internal cluster that we’ve been using for about 6months now, and we’ve had some infrastructure instability in thattime, which I think may be the root culprit behind this weirdness, buthopefully someone can point me in the direction to solve the issue.
I do a daily email of sreport to show how busy the cluster was, andwho were the top users.Weirdly, I have a user that seems to be able to use the same exactusage day after day after day, down to hundredth of a percent,conspicuously even when they were on vacation and claimed that theydidn’t have job submissions in cron/etc.
So then, taking a spin of the scom tui<https://lists.schedmd.com/pipermail/slurm-users/2022-December/009514.html>postedthis morning, I then filtered that user, and noticed that even thoughI was only looking 2 days back at job history, I was seeing a job fromAugust.
Conspicuously, the job state is cancelled, but the job end time is 1yfrom the start time, meaning its job end time is in 2023.So something with the dbd is confused about this/these jobs that arelingering and reporting cancelled but still “on the books” somehowuntil next August.
╭──────────────────────────────────────────────────────────────────────────────────────────╮
│              │
│  Job ID     : 290742             │
│  Job Name     : $jobname             │
│  User     : $user              │
│  Group    : $user            │
│  Job Account    : $account             │
│  Job Submission     : 2022-08-08 08:44:52 -0400 EDT              │
│  Job Start    : 2022-08-08 08:46:53 -0400 EDT            │
│  Job End    : 2023-08-08 08:47:01 -0400 EDT            │
│  Job Wait time    : 2m1s             │
│  Job Run time     : 8760h0m8s              │
│  Partition    : $part            │
│  Priority     : 127282             │
│  QoS    : $qos             │
│              │
│              │
╰──────────────────────────────────────────────────────────────────────────────────────────╯
Steps count: 0
Filter: $user         Items: 13
Job ID Job Name Part. QoS Account User Nodes State
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
290714 $jobname $part $qos $acct $user node32 CANCELLED 290716 $jobname $part $qos $acct $user node24 CANCELLED 290736 $jobname $part $qos $acct $user node00 CANCELLED 290742 $jobname $part $qos $acct $user node01 CANCELLED 290770 $jobname $part $qos $acct $user node02 CANCELLED 290777 $jobname $part $qos $acct $user node03 CANCELLED 290793 $jobname $part $qos $acct $user node04 CANCELLED 290797 $jobname $part $qos $acct $user node05 CANCELLED 290799 $jobname $part $qos $acct $user node06 CANCELLED 290801 $jobname $part $qos $acct $user node07 CANCELLED 290814 $jobname $part $qos $acct $user node08 CANCELLED 290817 $jobname $part $qos $acct $user node09 CANCELLED 290819 $jobname $part $qos $acct $user node10 CANCELLED
I’d love to figure out the proper way to either purge these jid’s fromthe accounting database cleanly, or change the job end/run time to asane/correct value.Slurm is v21.08.8-2, and ntp is a stratum 1 server, so time is in synceverywhere, not that multiple servers would drift 1 year off like this.
Thanks for any help,
Reed

Re: [slurm-users] Job cancelled into the future

Reply via email to