[slurm-users] Setting up fairshare accounting

2024-09-10 Thread tluchko via slurm-users
Hello,

We have a new cluster and I'm trying to setup fairshare accounting. I'm trying 
to track CPU, MEM and GPU. It seems that billing for individual jobs is 
correct, but billing isn't being accumulated (TRESRunMin is always 0).

In my slurm.conf, I think the relevant lines are

AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageTRES=gres/gpu
PriorityFlags=MAX_TRES

PartitionName=gpu Nodes=node[1-7] MaxCPUsPerNode=384 MaxTime=7-0:00:00 State=UP 
TRESBillingWeights="CPU=1.0,MEM=0.125G,GRES/gpu=9.6"
PartitionName=cpu Nodes=node[1-7] MaxCPUsPerNode=182 MaxTime=7-0:00:00 State=UP 
TRESBillingWeights="CPU=1.0,MEM=0.125G,GRES/gpu=9.6"
I currently have one recently finished job and one running job. sacct gives

$ sacct 
--format=JobID,JobName,ReqTRES%50,AllocTRES%50,TRESUsageInAve%50,TRESUsageInMax%50
JobID JobName ReqTRES AllocTRES TRESUsageInAve TRESUsageInMax
 -- -- 
-- 
-- 
--
154 interacti+ billing=9,cpu=1,gres/gpu=1,mem=1G,node=1 
billing=9,cpu=2,gres/gpu=1,mem=2G,node=1
154.interac+ interacti+ cpu=2,gres/gpu=1,mem=2G,node=1 
cpu=00:00:00,energy=0,fs/disk=2480503,mem=3M,page+ 
cpu=00:00:00,energy=0,fs/disk=2480503,mem=3M,page+
155 interacti+ billing=9,cpu=1,gres/gpu=1,mem=1G,node=1 
billing=9,cpu=2,gres/gpu=1,mem=2G,node=1155.interac+ interacti+ 
cpu=2,gres/gpu=1,mem=2G,node=1

billing=9 seems correct to me, since I have 1 GPU allocated, which has the 
largest score of 9.6. However, sshare doesn't show anything in TRESRunMins

sshare 
--format=Account,User,RawShares,FairShare,RawUsage,EffectvUsage,TRESRunMins%110
Account User RawShares FairShare RawUsage EffectvUsage TRESRunMins
 -- -- -- --- - 
--
root 21589714 1.00 
cpu=0,mem=0,energy=0,node=0,billing=0,fs/disk=0,vmem=0,pages=0,gres/gpu=0,gres/gpumem=0,gres/gpuutil=0
abrol_group 2000 0 0.00 
cpu=0,mem=0,energy=0,node=0,billing=0,fs/disk=0,vmem=0,pages=0,gres/gpu=0,gres/gpumem=0,gres/gpuutil=0
luchko_group 2000 21589714 1.00 
cpu=0,mem=0,energy=0,node=0,billing=0,fs/disk=0,vmem=0,pages=0,gres/gpu=0,gres/gpumem=0,gres/gpuutil=0
 luchko_group tluchko 1 0.33 21589714 1.00 
cpu=0,mem=0,energy=0,node=0,billing=0,fs/disk=0,vmem=0,pages=0,gres/gpu=0,gres/gpumem=0,gres/gpuutil=0

Why is TRESRunMin all 0 but RawUsage is not for tluchko? I have checked and 
slurmdbd is running.

Thank you,

Tyler

Sent with [Proton Mail](https://proton.me/) secure email.
-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Setting up fairshare accounting

2024-09-24 Thread tluchko via slurm-users
Just following up on my own message in case someone else is trying to figure 
out RawUsage and Fair Share.

I ran some additional tests, except that I ran jobs for 10 min instead of 1 
min. The procedure was

1. Set the accounting stats to update every minute in slurm.conf

PriorityCalcPeriod=1

2. Reset the RawUsage stat

sacctmgr modify account luchko_group set RawUsage=0

3. Check the RawUsage every second

while sleep 1; do date; sshare -ao Account,User,RawShares,NormShares,RawUsage ; 
done > watch.out

4. Run a 10 min job. The billing per CPU is 1, so the total RawUsage should 
60,000 and the RawUsage should increase 6,000 each minute

sbatch --account=luchko_group --wrap="sleep 600" -p cpu -n 100

Scanning the output file, I can see that the RawUsage does update once every 
minute. Below are the updates. (I've removed irrelevant output.)

Tue Sep 24 10:14:24 AM PDT 2024
Account User RawShares NormShares RawUsage
 -- -- --- ---
luchko_group tluchko 100 0.50 0

Tue Sep 24 10:14:25 AM PDT 2024
luchko_group tluchko 100 0.50 4099
Tue Sep 24 10:15:24 AM PDT 2024
luchko_group tluchko 100 0.50 10099Tue Sep 24 10:16:25 AM PDT 2024
luchko_group tluchko 100 0.50 16099
Tue Sep 24 10:17:24 AM PDT 2024
luchko_group tluchko 100 0.50 22098

Tue Sep 24 10:18:25 AM PDT 2024

luchko_group tluchko 100 0.50 28097

Tue Sep 24 10:19:24 AM PDT 2024

luchko_group tluchko 100 0.50 34096

Tue Sep 24 10:20:25 AM PDT 2024

luchko_group tluchko 100 0.50 40094

Tue Sep 24 10:21:24 AM PDT 2024

luchko_group tluchko 100 0.50 46093

Tue Sep 24 10:22:25 AM PDT 2024

luchko_group tluchko 100 0.50 52091

Tue Sep 24 10:23:24 AM PDT 2024

luchko_group tluchko 100 0.50 58089

Tue Sep 24 10:24:25 AM PDT 2024

luchko_group 2000 0.133324 58087

Tue Sep 24 10:25:25 AM PDT 2024

luchko_group tluchko 100 0.50 58085

So, the RawUsage does increase by the expected amount each minute, and the 
RawUsage does decay (I have the half-life set to 14 days). However, the update 
for the last part of a minute, which should be 1901, is not recorded. I suspect 
this is because the job is no longer running when the accounting update occurs.

For typical jobs that run for hours or days, this is a negligible error, but it 
does explain the results I got when I ran a 1 min job.

TRESRunMins is still not updating, but this is an inconvenience.

Tyler

Sent with [Proton Mail](https://proton.me/mail/home) secure email.

On Thursday, September 19th, 2024 at 8:47 PM, tluchko via slurm-users 
 wrote:

> Hello,
>
> I'm hoping someone can offer some suggestions.
>
> I went ahead started the database from scratch and reinitialized it to see if 
> that would help and to try and understand how RawUsage is calculated. I ran 
> two jobs of
>
> sbatch --account=luchko_group --wrap="sleep 60" -p cpu -n 100
>
> With the partition defined as
>
> PriorityFlags=MAX_TRES
> PartitionName=cpu Nodes=node[1-7] MaxCPUsPerNode=182 MaxTime=7-0:00:00 
> State=UP TRESBillingWeights="CPU=1.0,MEM=0.125G,GRES/gpu=9.6"
>
> I expected each job to contribute 6000 to the RawUsage, however one job 
> contributed 3100 and the other 2800. And TRESRunMins stayed at 0 for all 
> categories.
>
> I'm at a loss as to what is going on.
>
> Thank you,
>
> Tyler
>
> Sent with [Proton Mail](https://proton.me/mail/home) secure email.
>
> On Tuesday, September 10th, 2024 at 9:03 PM, tluchko  
> wrote:
>
>> Hello,
>>
>> We have a new cluster and I'm trying to setup fairshare accounting. I'm 
>> trying to track CPU, MEM and GPU. It seems that billing for individual jobs 
>> is correct, but billing isn't being accumulated (TRESRunMin is always 0).
>>
>> In my slurm.conf, I think the relevant lines are
>>
>> AccountingStorageType=accounting_storage/slurmdbd
>> AccountingStorageTRES=gres/gpu
>> PriorityFlags=MAX_TRES
>>
>> PartitionName=gpu Nodes=node[1-7] MaxCPUsPerNode=384 MaxTime=7-0:00:00 
>> State=UP TRESBillingWeights="CPU=1.0,MEM=0.125G,GRES/gpu=9.6"
>> PartitionName=cpu Nodes=node[1-7] MaxCPUsPerNode=182 MaxTime=7-0:00:00 
>> State=UP TRESBillingWeights="CPU=1.0,MEM=0.125G,GRES/gpu=9.6"
>> I currently have one recently finished job and one running job. sacct gives
>>
>> $ sacct 
>> --format=JobID,JobName,ReqTRES%50,AllocTRES%50,TRESUsageInAve%50,TRESUsageInMax%50
>> JobID JobName ReqTRES AllocTRES TRESUsageInAve TRESUsageInMax
>>  -- -- 
>> -- 
>> -- 
>> --
>> 154 interacti+ 

[slurm-users] Re: Setting up fairshare accounting

2024-09-19 Thread tluchko via slurm-users
Hello,

I'm hoping someone can offer some suggestions.

I went ahead started the database from scratch and reinitialized it to see if 
that would help and to try and understand how RawUsage is calculated. I ran two 
jobs of

sbatch --account=luchko_group --wrap="sleep 60" -p cpu -n 100

With the partition defined as

PriorityFlags=MAX_TRES
PartitionName=cpu Nodes=node[1-7] MaxCPUsPerNode=182 MaxTime=7-0:00:00 State=UP 
TRESBillingWeights="CPU=1.0,MEM=0.125G,GRES/gpu=9.6"

I expected each job to contribute 6000 to the RawUsage, however one job 
contributed 3100 and the other 2800. And TRESRunMins stayed at 0 for all 
categories.

I'm at a loss as to what is going on.

Thank you,

Tyler

Sent with [Proton Mail](https://proton.me/mail/home) secure email.

On Tuesday, September 10th, 2024 at 9:03 PM, tluchko  
wrote:

> Hello,
>
> We have a new cluster and I'm trying to setup fairshare accounting. I'm 
> trying to track CPU, MEM and GPU. It seems that billing for individual jobs 
> is correct, but billing isn't being accumulated (TRESRunMin is always 0).
>
> In my slurm.conf, I think the relevant lines are
>
> AccountingStorageType=accounting_storage/slurmdbd
> AccountingStorageTRES=gres/gpu
> PriorityFlags=MAX_TRES
>
> PartitionName=gpu Nodes=node[1-7] MaxCPUsPerNode=384 MaxTime=7-0:00:00 
> State=UP TRESBillingWeights="CPU=1.0,MEM=0.125G,GRES/gpu=9.6"
> PartitionName=cpu Nodes=node[1-7] MaxCPUsPerNode=182 MaxTime=7-0:00:00 
> State=UP TRESBillingWeights="CPU=1.0,MEM=0.125G,GRES/gpu=9.6"
> I currently have one recently finished job and one running job. sacct gives
>
> $ sacct 
> --format=JobID,JobName,ReqTRES%50,AllocTRES%50,TRESUsageInAve%50,TRESUsageInMax%50
> JobID JobName ReqTRES AllocTRES TRESUsageInAve TRESUsageInMax
>  -- -- 
> -- 
> -- 
> --
> 154 interacti+ billing=9,cpu=1,gres/gpu=1,mem=1G,node=1 
> billing=9,cpu=2,gres/gpu=1,mem=2G,node=1
> 154.interac+ interacti+ cpu=2,gres/gpu=1,mem=2G,node=1 
> cpu=00:00:00,energy=0,fs/disk=2480503,mem=3M,page+ 
> cpu=00:00:00,energy=0,fs/disk=2480503,mem=3M,page+
> 155 interacti+ billing=9,cpu=1,gres/gpu=1,mem=1G,node=1 
> billing=9,cpu=2,gres/gpu=1,mem=2G,node=1155.interac+ interacti+ 
> cpu=2,gres/gpu=1,mem=2G,node=1
>
> billing=9 seems correct to me, since I have 1 GPU allocated, which has the 
> largest score of 9.6. However, sshare doesn't show anything in TRESRunMins
>
> sshare 
> --format=Account,User,RawShares,FairShare,RawUsage,EffectvUsage,TRESRunMins%110
> Account User RawShares FairShare RawUsage EffectvUsage TRESRunMins
>  -- -- -- --- 
> - 
> --
> root 21589714 1.00 
> cpu=0,mem=0,energy=0,node=0,billing=0,fs/disk=0,vmem=0,pages=0,gres/gpu=0,gres/gpumem=0,gres/gpuutil=0
> abrol_group 2000 0 0.00 
> cpu=0,mem=0,energy=0,node=0,billing=0,fs/disk=0,vmem=0,pages=0,gres/gpu=0,gres/gpumem=0,gres/gpuutil=0
> luchko_group 2000 21589714 1.00 
> cpu=0,mem=0,energy=0,node=0,billing=0,fs/disk=0,vmem=0,pages=0,gres/gpu=0,gres/gpumem=0,gres/gpuutil=0
>  luchko_group tluchko 1 0.33 21589714 1.00 
> cpu=0,mem=0,energy=0,node=0,billing=0,fs/disk=0,vmem=0,pages=0,gres/gpu=0,gres/gpumem=0,gres/gpuutil=0
>
> Why is TRESRunMin all 0 but RawUsage is not for tluchko? I have checked and 
> slurmdbd is running.
>
> Thank you,
>
> Tyler
>
> Sent with [Proton Mail](https://proton.me/) secure email.
-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com