Sounds suspiciously similar to a bug we reported a very long time ago, and that I'd submitted a patch for: https://bugs.schedmd.com/show_bug.cgi?id=1048
Which was then revisited here: https://bugs.schedmd.com/show_bug.cgi?id=2423 Though my fix handles a problem with a UsageFactor other than 1, I'm wondering if the problem is the same with BillingWeight too. Kevin On Mon, Nov 27, 2017 at 5:06 PM, John Roberts <roberts.johne...@gmail.com> wrote: > Hoping someone will get eyes on this one. I ended up changing the partition > in question to only use 1 thread per core to keep things simple, but it > would still be nice to know why slurm is looking at TRES hours instead of > RawUsage. > > thanks. > -John > > On Wed, Nov 15, 2017 at 10:55 AM, John Roberts <roberts.johne...@gmail.com> > wrote: >> >> Hi, >> >> I'm having an issue with accounts in slurm and not sure if I'm missing >> something. Here's a quick breakdown of the issue: >> >> We have many accounts in Slurm (v16.05.10) / SlurmDBD. We recently set 1 >> partition's billing weight to 0.25. This partition has 64 cores with 4 >> threads per node. We set this weight to 0.25 so we don't bill for threads, >> just core hours. This part seems to be working ok. >> >> When querying the account balance via RawUsage (and we use sbank to >> present this to the user in readable hours), these numbers look right. They >> come out to a quarter of full node. >> >> However, when querying say "UserUtilizationByAccount", this number is >> about 4 times as much. This also makes sense because they are technically >> being allocated for all cores and threads, but we only expect to bill for a >> quarter of the time. >> >> The problem arose when a user of this account tried to submit a job and it >> sat in the queue with the error "AssocGrpCPUMinutesLimit". >> >> Turning up the debug logs showed this: >> >> "debug2: Job 161868 being held, the job is at or exceeds assoc >> 2159(<foo>/(null)/(null)) group max tres(cpu) minutes of 150000000 of which >> 27718972 are still available but request is for 94371840 (plus 0 already in >> use) tres minutes (request tres count 65536)" >> >> The available number above "27718972" matches what the balance would have >> been from the max CPU minutes minus the usage from >> "UserUtilizationByAccount" instead of reporting the real balance of 4x that >> number. >> >> Why would Slurm be trying to schedule jobs based on this number instead of >> RawUsage? If we're billing it lower, RawUsage should be the true balance, >> but that doesn't seem to be the case. >> >> thanks! >> -John > >