[slurm-users] how do slurm schedule health check when setting "HealthCheckNodeState=CYCLE"

2020-12-01 Thread taleintervenor
Hello,

 

Our slurm cluster managed about 600+ nodes and I tested to set
HealthCheckNodeState=CYCLE in slurm.conf. According to conf manual, setting
this to CYCLE shall cause slurm to "cycle through running on all compute
nodes through the course of the HealthCheckInterval". So I set
"HealthCheckInterval = 600", and expected the health check time point can be
evenly distributed across the 600 seconds period.

But the test result showed that the earliest checked node is at about
14:19:35, while the latest checked node is at about 14:20:39. A round of the
health checks only distributed across 60+ seconds? And the previous checking
round distributed from 14:08:10 to 14:09:26, it seems the
HealthCheckInterval only control the time interval between two rounds, not
the time range distributed by one round checkings.

So did I mistake the description in conf's manual? And is there any method
can control the health check frequency in one round between different nodes?

 

Thanks.



[slurm-users] how do array jobs stored in slurmdb database?

2021-01-28 Thread taleintervenor
Hello,

 

The question background is:

>From query command such as 'sacct -j 123456' I can see a series of jobs
named 123456_1, 123456_2, etc. And I need to delete these job records from
mysql database for some reason.

 

But in job_table of slurmdb, there is only one record with id_job=123456.
not any record has a id like 123456_2. After I delete the id_job=123456
record, sacct result show the 123456_1 job disappeared, but other jobs in
the array still exist. So how do these array job recorded in the database?
And how to completely delete all the jobs in a array?

 

Thanks.



Re: [slurm-users] how do array jobs stored in slurmdb database?

2021-01-28 Thread taleintervenor
Thanks for the help. The doc page is useful and we can get the actual job id 
now.

The reason we need to delete job record from database is our billing system 
will calculate user cost from these historical records. But after a slurm 
system faulty there will be some specific jobs which should not be charged. it 
seems the best practical solution is to directly modify the database since 
slurm does not provide commend to delete job records.


-邮件原件-
发件人: Ole Holm Nielsen  
发送时间: 2021年1月29日 0:14
收件人: slurm-users@lists.schedmd.com
主题: Re: [slurm-users] how do array jobs stored in slurmdb database?

On 1/28/21 11:59 AM, taleinterve...@sjtu.edu.cn wrote:
>  From query command such as ‘sacct -j 123456’ I can see a series of 
> jobs named 123456_1, 123456_2, etc. And I need to delete these job 
> records from mysql database for some reason.
> 
> But in job_table of slurmdb, there is only one record with id_job=123456. 
> not any record has a id like 123456_2. After I delete the 
> id_job=123456 record, sacct result show the 123456_1 job disappeared, 
> but other jobs in the array still exist. So how do these array job recorded 
> in the database?
> And how to completely delete all the jobs in a array?

I think you need to study how job arrays are implemented in Slurm, please read 
https://slurm.schedmd.com/job_array.html

You will discover that job arrays, when each individual jobs start running, 
become independent jobs and obtain their own unique JobIDs.  It must be those 
JobIDs that will appear in the Slurm database.

This command illustrates the different JobID types (please read the squeue 
manual page about ArrayJobID,JobArrayID,JobID):

$ squeue  -j 3394902 -O ArrayJobID,JobArrayID,JobID
ARRAY_JOB_IDJOBID   JOBID
3394902 3394902_[18-91] 3394902
3394902 3394902_17  3394919
3394902 3394902_16  3394918
3394902 3394902_15  3394917
3394902 3394902_14  3394916

The last 4 jobs are running, while the first job i still pending.

Perhaps you may find my "showjob" script useful:
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/jobs
In this script you can see how I work with array jobs.

I did not answer your question about how to delete array jobs in the Slurm 
database.  But in most cases manipulating the database directly is probably a 
bad idea.  I wonder why you want to delete jobs in the database at all?

Best regards,
Ole






Re: [slurm-users] how do array jobs stored in slurmdb database?

2021-01-29 Thread taleintervenor
Well, maybe my example in first mail caused some misunderstanding. We just use 
sacct to check some job records manually in the maintenance process after the 
system fault. Our account and billing system is an commercial product which 
unfortunately also not provide the ability to adjust billing rate for 
individual job. I'm not sure how it get the job data from slurm. But as long as 
sacct can not find the job record, the billing system of course won't generate 
billing for it.

-邮件原件-
发件人: Ole Holm Nielsen  
发送时间: 2021年1月29日 15:40
收件人: slurm-users@lists.schedmd.com
主题: Re: [slurm-users] how do array jobs stored in slurmdb database?

On 1/29/21 3:51 AM, taleinterve...@sjtu.edu.cn wrote:
> The reason we need to delete job record from database is our billing system 
> will calculate user cost from these historical records. But after a slurm 
> system faulty there will be some specific jobs which should not be charged. 
> it seems the best practical solution is to directly modify the database since 
> slurm does not provide commend to delete job records.

I think the sreport command is normally used to generate accounting reports.  I 
have described this in my Wiki page 
https://wiki.fysik.dtu.dk/niflheim/Slurm_accounting#accounting-reports

I would like to understand how you have chosen to calculate user cost of a 
given job using the sacct command?  The sacct command will report accounting 
for each individual job, so which sacct options do you use to get the total 
cost value for a user with many jobs?

/Ole


> -邮件原件-
> 发件人: Ole Holm Nielsen 
> 发送时间: 2021年1月29日 0:14
> 收件人: slurm-users@lists.schedmd.com
> 主题: Re: [slurm-users] how do array jobs stored in slurmdb database?
> 
> On 1/28/21 11:59 AM, taleinterve...@sjtu.edu.cn wrote:
>>   From query command such as ‘sacct -j 123456’ I can see a series of 
>> jobs named 123456_1, 123456_2, etc. And I need to delete these job 
>> records from mysql database for some reason.
>>
>> But in job_table of slurmdb, there is only one record with id_job=123456.
>> not any record has a id like 123456_2. After I delete the
>> id_job=123456 record, sacct result show the 123456_1 job disappeared, 
>> but other jobs in the array still exist. So how do these array job recorded 
>> in the database?
>> And how to completely delete all the jobs in a array?
> 
> I think you need to study how job arrays are implemented in Slurm, 
> please read https://slurm.schedmd.com/job_array.html
> 
> You will discover that job arrays, when each individual jobs start running, 
> become independent jobs and obtain their own unique JobIDs.  It must be those 
> JobIDs that will appear in the Slurm database.
> 
> This command illustrates the different JobID types (please read the squeue 
> manual page about ArrayJobID,JobArrayID,JobID):
> 
> $ squeue  -j 3394902 -O ArrayJobID,JobArrayID,JobID
> ARRAY_JOB_IDJOBID   JOBID
> 3394902 3394902_[18-91] 3394902
> 3394902 3394902_17  3394919
> 3394902 3394902_16  3394918
> 3394902 3394902_15  3394917
> 3394902 3394902_14  3394916
> 
> The last 4 jobs are running, while the first job i still pending.
> 
> Perhaps you may find my "showjob" script useful:
> https://github.com/OleHolmNielsen/Slurm_tools/tree/master/jobs
> In this script you can see how I work with array jobs.
> 
> I did not answer your question about how to delete array jobs in the Slurm 
> database.  But in most cases manipulating the database directly is probably a 
> bad idea.  I wonder why you want to delete jobs in the database at all?
> 
> Best regards,
> Ole






[slurm-users] how to print all the key-values of "job_desc" in job_submit.lua?

2021-03-29 Thread taleintervenor
Hello,

 

Because I'm not sure about the relations between fields of job_desc
structure and sbatch parameter, I want to print all the fields and their
values in job_desc when testing job_submit.lua. But the following code add
to job_submit.lua failed to iterate through job_desc, the for loop print
nothing while specifying "job_desc.partition" do print the value:

 

-- dev:

if job_desc.user_name == "hpczty" then

slurm.log_user("print job_desc>>>")

slurm.log_user("job_desc.partition=%s",job_desc["partition"])

for k, v in pairs(job_desc) do

slurm.log_user("%s: %s", k, v)

end

return slurm.ERROR

end

 

submit job print the following message:

sbatch testjob.sh

sbatch: error: print job_desc>>>

sbatch: error: job_desc.partition=debug

sbatch: error: Batch job submission failed: Unspecified error

 

Why the loop code cannot get the content in job_desc? And what is the
correct way to print all its content without manually specify each key?

 

Thanks.



[slurm-users] how to check what slurm is doing when job pending with reason=none?

2021-06-16 Thread taleintervenor
Hello,

 

Recently we notice a strange delay from job-submitting to job-start while
the partition is sure to have enough idle nodes to meet the job's demand. To
avoid interference, we use the 4-node debug partition for test, which does
not have any other job to run. And the test job script is also as simple as
possible:

 

#!/bin/bash

 

#SBATCH --job-name=test

#SBATCH --partition=debug

#SBATCH --nodes=1

#SBATCH --ntasks-per-node=1

#SBATCH --cpus-per-task=1

#SBATCH --output=%j.out

#SBATCH --error=%j.err

 

hostname

sleep 1000

echo end

 

But after submit, this job still stay at PENDING state for about 30-60s and
during the pending time sacct shows the REASON is "None". We have also
checked the slurmctld.log at server and slurmd.log at client node with debug
log level. Both of them have nothing useful to figure out the pending
reason. 

 

So is there any way to make slurm explain in detail why the job didn't start
immediately or what it was doing during the job pending time?

 

 

Thanks.



[slurm-users] 答复: how to check what slurm is doing when job pending with reason=none?

2021-06-17 Thread taleintervenor
Thanks for the help. We tried to reduce the sched_interval and the pending
time decreased as expected.

But the influence of 'sched_interval' is global, setting it too small may
put pressure on slurmctld server. Since we only want quick response on debug
partition (which is designed to let user frequently submitting debug jobs
without waiting), is it possible to make slurm do immediate schedual on the
specific partition no matter how long the job queue is?

-邮件原件-
发件人: Gerhard Strangar  
发送时间: 2021年6月17日 0:27
收件人: Slurm User Community List 
主题: Re: [slurm-users] how to check what slurm is doing when job pending
with reason=none?

taleinterve...@sjtu.edu.cn wrote:

> But after submit, this job still stay at PENDING state for about 
> 30-60s and during the pending time sacct shows the REASON is "None".

It's the default sched_interval=60 in your slurm.conf.

Gerhard






[slurm-users] Is there bug in PrivateData=jobs option of slurmdbd?

2021-06-30 Thread taleintervenor
Hello,

 

We find a strange behavior about sacct and PrivateData option of slurmdbd.
Our original configuration is setting "PrivateData =
accounts,jobs,usage,users,reservations" in slurm.conf and not setting
"PrivateData" in slurmdbd.conf. At this point, common user can see all
others job information with sacct. Now we add option "PrivateData =jobs" to
slurmdbd.conf, then common users even can't see their own jobs using sacct.

 

According.to https://slurm.schedmd.com/slurmdbd.conf.html , setting "jobs"
in PrivateData should only prevent user from viewing others' job. Why it
also hide jobs submit by user itself from sacct query?

 

The test records as below:

 before add option "PrivateData =jobs" to slurmdbd.conf
==

[2021-06-30T18:18:07+0800][hpczty@login3] ~/downloads> sbatch testjob.sh

Submitted batch job 6944660

 

[2021-06-30T18:18:11+0800][hpczty@login3] ~/downloads> squeue

 JOBID PARTITION NAME USER ST   TIME  NODES
NODELIST(REASON)

   6944660 debug test   hpczty PD   0:00  1 (None)

 

[2021-06-30T18:18:16+0800][hpczty@login3] ~/downloads> sacct

   JobIDJobName  PartitionAccount  AllocCPUS  State ExitCode

 -- -- -- -- -- 

6944660test  debug   acct-hpc  1RUNNING  0:0

6944660.bat+  batch  acct-hpc  1RUNNING  0:0

6944660.ext+ extern  acct-hpc  1RUNNING  0:0

 

 

 after add option "PrivateData =jobs" to slurmdbd.conf
==

[2021-06-30T18:21:27+0800][hpczty@login3] ~/downloads> sbatch testjob.sh

Submitted batch job 6944665

 

[2021-06-30T18:21:30+0800][hpczty@login3] ~/downloads> squeue

 JOBID PARTITION NAME USER ST   TIME  NODES
NODELIST(REASON)

   6944665 debug test   hpczty PD   0:00  1 (None)

 

[2021-06-30T18:21:32+0800][hpczty@login3] ~/downloads> sacct

   JobIDJobName  PartitionAccount  AllocCPUS  State ExitCode

 -- -- -- -- -- 

(no jobs shown)

 

Thanks



[slurm-users] 答复: Is there bug in PrivateData=jobs option of slurmdbd?

2021-07-01 Thread taleintervenor
I can make sure the test job is running (of course in the default time
window) when doing sacct query, and here is the new test record which
describe it more clearly:

 

[2021-07-01T16:02:42+0800][hpczty@cas013] ~/downloads> sbatch testjob.sh

Submitted batch job 6955371

 

[2021-07-01T16:02:48+0800][hpczty@cas013] ~/downloads> squeue

 JOBID PARTITION NAME USER ST   TIME  NODES
NODELIST(REASON)

   6955371 debug test   hpczty  R   0:02  1 cas011

 

[2021-07-01T16:02:50+0800][hpczty@cas013] ~/downloads> sacct

   JobIDJobName  PartitionAccount  AllocCPUS  State ExitCode

 -- -- -- -- -- 

 

[2021-07-01T16:02:52+0800][hpczty@cas013] ~/downloads> sacct --state=R
--starttime=2021-07-01T16:00:00 --endtime=now

   JobIDJobName  PartitionAccount  AllocCPUS  State ExitCode

 -- -- -- -- -- 

 

[2021-07-01T16:03:25+0800][hpczty@cas013] ~/downloads> squeue

 JOBID PARTITION NAME USER ST   TIME  NODES
NODELIST(REASON)

   6955371 debug test   hpczty  R   0:43  1 cas011

 

发件人: Brian Andrus  
发送时间: 2021年6月30日 22:29
收件人: taleinterve...@sjtu.edu.cn
主题: Re: [slurm-users] Is there bug in PrivateData=jobs option of slurmdbd?

 

I suspect your job fell out of the default time window for sacct.

Add a time window that you know includes when the job ran and you will
likely see it.

Brian Andrus

On 6/30/2021 3:53 AM, taleinterve...@sjtu.edu.cn
  wrote:

Hello,

 

We find a strange behavior about sacct and PrivateData option of slurmdbd.
Our original configuration is setting “PrivateData =
accounts,jobs,usage,users,reservations” in slurm.conf and not setting
“PrivateData” in slurmdbd.conf. At this point, common user can see all
others job information with sacct. Now we add option “PrivateData =jobs”
to slurmdbd.conf, then common users even can’t see their own jobs using
sacct.

 

According.to https://slurm.schedmd.com/slurmdbd.conf.html , setting “jobs”
in PrivateData should only prevent user from viewing others’ job. Why it
also hide jobs submit by user itself from sacct query?

 

The test records as below:

 before add option “PrivateData =jobs” to slurmdbd.conf
==

[2021-06-30T18:18:07+0800][hpczty@login3] ~/downloads> sbatch testjob.sh

Submitted batch job 6944660

 

[2021-06-30T18:18:11+0800][hpczty@login3] ~/downloads> squeue

 JOBID PARTITION NAME USER ST   TIME  NODES
NODELIST(REASON)

   6944660 debug test   hpczty PD   0:00  1 (None)

 

[2021-06-30T18:18:16+0800][hpczty@login3] ~/downloads> sacct

   JobIDJobName  PartitionAccount  AllocCPUS  State ExitCode

 -- -- -- -- -- 

6944660test  debug   acct-hpc  1RUNNING  0:0

6944660.bat+  batch  acct-hpc  1RUNNING  0:0

6944660.ext+ extern  acct-hpc  1RUNNING  0:0

 

 

 after add option “PrivateData =jobs” to slurmdbd.conf
==

[2021-06-30T18:21:27+0800][hpczty@login3] ~/downloads> sbatch testjob.sh

Submitted batch job 6944665

 

[2021-06-30T18:21:30+0800][hpczty@login3] ~/downloads> squeue

 JOBID PARTITION NAME USER ST   TIME  NODES
NODELIST(REASON)

   6944665 debug test   hpczty PD   0:00  1 (None)

 

[2021-06-30T18:21:32+0800][hpczty@login3] ~/downloads> sacct

   JobIDJobName  PartitionAccount  AllocCPUS  State ExitCode

 -- -- -- -- -- 

(no jobs shown)

 

Thanks



[slurm-users] 答复: 答复: Is there bug in PrivateData=jobs option of slurmdbd?

2021-07-02 Thread taleintervenor
Well, you got the point. We didn’t configure ldap on slurm database node. After 
configuring ldap authorization the PrivateData option finally worked as 
expected.

Thanks for the assistance.

 

发件人: Brian Andrus  
发送时间: 2021年7月1日 21:57
收件人: taleinterve...@sjtu.edu.cn
抄送: slurm-users@lists.schedmd.com
主题: Re: 答复: [slurm-users] Is there bug in PrivateData=jobs option of slurmdbd?

 

Ok.

You may want to check your slurmdbd host(s) and ensure the users are known 
there. If it does not know who a user is, it will not allow access to the data.

If you are running sssd, clear the cache and such too.

Brian Andrus

 

On 7/1/2021 1:12 AM, taleinterve...@sjtu.edu.cn 
  wrote:

I can make sure the test job is running (of course in the default time window) 
when doing sacct query, and here is the new test record which describe it more 
clearly:

 

[2021-07-01T16:02:42+0800][hpczty@cas013] ~/downloads> sbatch testjob.sh

Submitted batch job 6955371

 

[2021-07-01T16:02:48+0800][hpczty@cas013] ~/downloads> squeue

 JOBID PARTITION NAME USER ST   TIME  NODES 
NODELIST(REASON)

   6955371 debug test   hpczty  R   0:02  1 cas011

 

[2021-07-01T16:02:50+0800][hpczty@cas013] ~/downloads> sacct

   JobIDJobName  PartitionAccount  AllocCPUS  State ExitCode

 -- -- -- -- -- 

 

[2021-07-01T16:02:52+0800][hpczty@cas013] ~/downloads> sacct --state=R 
--starttime=2021-07-01T16:00:00 --endtime=now

   JobIDJobName  PartitionAccount  AllocCPUS  State ExitCode

 -- -- -- -- -- 

 

[2021-07-01T16:03:25+0800][hpczty@cas013] ~/downloads> squeue

 JOBID PARTITION NAME USER ST   TIME  NODES 
NODELIST(REASON)

   6955371 debug test   hpczty  R   0:43  1 cas011

 

发件人: Brian Andrus    
发送时间: 2021年6月30日 22:29
收件人: taleinterve...@sjtu.edu.cn  
主题: Re: [slurm-users] Is there bug in PrivateData=jobs option of slurmdbd?

 

I suspect your job fell out of the default time window for sacct.

Add a time window that you know includes when the job ran and you will likely 
see it.

Brian Andrus

On 6/30/2021 3:53 AM, taleinterve...@sjtu.edu.cn 
  wrote:

Hello,

 

We find a strange behavior about sacct and PrivateData option of slurmdbd. Our 
original configuration is setting “PrivateData = 
accounts,jobs,usage,users,reservations” in slurm.conf and not setting 
“PrivateData” in slurmdbd.conf. At this point, common user can see all others 
job information with sacct. Now we add option “PrivateData =jobs” to 
slurmdbd.conf, then common users even can’t see their own jobs using sacct.

 

According.to https://slurm.schedmd.com/slurmdbd.conf.html , setting “jobs” in 
PrivateData should only prevent user from viewing others’ job. Why it also hide 
jobs submit by user itself from sacct query?

 

The test records as below:

 before add option “PrivateData =jobs” to slurmdbd.conf 
==

[2021-06-30T18:18:07+0800][hpczty@login3] ~/downloads> sbatch testjob.sh

Submitted batch job 6944660

 

[2021-06-30T18:18:11+0800][hpczty@login3] ~/downloads> squeue

 JOBID PARTITION NAME USER ST   TIME  NODES 
NODELIST(REASON)

   6944660 debug test   hpczty PD   0:00  1 (None)

 

[2021-06-30T18:18:16+0800][hpczty@login3] ~/downloads> sacct

   JobIDJobName  PartitionAccount  AllocCPUS  State ExitCode

 -- -- -- -- -- 

6944660test  debug   acct-hpc  1RUNNING  0:0

6944660.bat+  batch  acct-hpc  1RUNNING  0:0

6944660.ext+ extern  acct-hpc  1RUNNING  0:0

 

 

 after add option “PrivateData =jobs” to slurmdbd.conf 
==

[2021-06-30T18:21:27+0800][hpczty@login3] ~/downloads> sbatch testjob.sh

Submitted batch job 6944665

 

[2021-06-30T18:21:30+0800][hpczty@login3] ~/downloads> squeue

 JOBID PARTITION NAME USER ST   TIME  NODES 
NODELIST(REASON)

   6944665 debug test   hpczty PD   0:00  1 (None)

 

[2021-06-30T18:21:32+0800][hpczty@login3] ~/downloads> sacct

   JobIDJobName  PartitionAccount  AllocCPUS  State ExitCode

 -- -- -- -- -- 

(no jobs shown)

 

Thanks



[slurm-users] What is the 'Root/Cluster association' level in Resource Limits document mean?

2022-02-07 Thread taleintervenor
Hi all,

 

According to Resource Limits page (
https://slurm.schedmd.com/resource_limits.html ), there is Root/Cluster
association level under account level to provide default limitation. But how
to check or modify this "cluster association"? Using command sacctmgr show
association, I can only list all users' association.

 

Considering the scene in which we want to set a default node number
limitation for all users, command such as sacctmgr modify user set
grptres="node=8" do can set the limitation on all users at once, but it will
cover the original per-user limitation on some specific account. So it may
not be an satisfying solution. If the "cluster association" exists, it may
be exactly what we want. So how to set the "cluster association"?



[slurm-users] 答复: What is the 'Root/Cluster association' level in Resource Limits document mean?

2022-02-10 Thread taleintervenor
Well, ‘sacctmgr modify cluster name=***’ is exactly what we want, and
inspired by this command, we found that ‘sacctmgr show cluster’ can
clearly list all the cluster associations.

 

But during test we found another problem. When limitation is defined both on
cluster level and user level, the smaller one will take effect, user
association did not take precedence of low level one. For example:

> sacctmgr show association format=cluster,account,user,grptres,qos

   ClusterAccount   User   GrpTRES  QOS

-- -- -- - 

sjtupi   root   gres/gpu=1   normal

sjtupi   acct-hpcnormal

sjtupi   acct-hpc hpcztygres/gpu=2   normal

Cluster association defined 1-gpu limitation and User association defined
2-gpu limitation, and then 2-gpu job be blocked:

> scontrol show job 6567880

JobId=6567880 JobName=test

   UserId=hpczty(3861) GroupId=hpczty(3861) MCS_label=N/A

   Priority=127 Nice=0 Account=acct-hpc QOS=normal

   JobState=PENDING Reason=AssocGrpGRES Dependency=(null)

   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0

   …

   NumNodes=1-1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*

   TRES=cpu=1,mem=7G,node=1,billing=1,gres/gpu=2

   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*

   MinCPUsNode=1 MinMemoryCPU=7G MinTmpDiskNode=0

   Features=(null) DelayBoot=00:00:00

   …

According to official document
https://slurm.schedmd.com/resource_limits.html , User association at
hierarchy 3 should have higher priority than Cluster association at
hierarchy 5. Is this a bug or document wrote wrong?

 

发件人: Paul Brunk  
发送时间: 2022年2月10日 10:28
收件人: Slurm User Community List 
主题: Re: [slurm-users] What is the 'Root/Cluster association' level in
Resource Limits document mean?

 

Hi:

 

You can use e.g. 'sacctmgr show -s users', and you'll see each user's

cluster assocation as one of the output columns.  If the name were

'yourcluster', then you could do: sacctmgr modify cluster

name=yourcluster set grpTres="node=8".

 

== 

Paul Brunk, system administrator

Georgia Advanced Resource Computing Center

Enterprise IT Svcs, the University of Georgia

 

 

On 2/8/22, 2:33 AM, "slurm-users" mailto:slurm-users-boun...@lists.schedmd.com> > wrote:

…[H]ow to check or modify this “cluster association”? Using command
sacctmgr show association, I can only list all users’ association.

 

Considering the scene in which we want to set a default node number
limitation for all users, command such as sacctmgr modify user set
grptres="node=8" do can set the limitation on all users at once, but it will
cover the original per-user limitation on some specific account. So it may
not be an satisfying solution. If the “cluster association” exists, it may
be exactly what we want. So how to set the “cluster association”?



[slurm-users] why sacct display wrong username while the UID is right?

2022-03-12 Thread taleintervenor
Hi all:

 

We encountered a strange bug when query job history using sacct. As show
below, we try to list user hpczbzt's job, and sacct do filter the right jobs
belong to this user. But there username is displayed as phywht.

 

> sacct -X --user=hpczbzt
--format=jobid%16,jobidraw,user,uid,partition,start,end,AllocCPUS,state%20

   JobID JobIDRaw  UserUID  Partition
Start End  AllocCPUSState

  - -- --
--- --- -- 

 9882328 9882328 phywht   6270   dgx2
2022-03-13T04:50:12 Unknown  6  RUNNING

 9882330 9882330 phywht   6270   dgx2
2022-03-13T04:50:12 Unknown  6  RUNNING

 9882332 9882332 phywht   6270   dgx2
2022-03-13T04:50:12 Unknown  6  RUNNING

 9882335 9882335 phywht   6270   dgx2
2022-03-13T04:50:12 Unknown  6  RUNNING

 9882337 9882337 phywht   6270   dgx2
2022-03-13T04:50:12 Unknown  6  RUNNING

 9884211 9884211 phywht   6270   a100
2022-03-12T23:56:02 2022-03-13T00:13:43  8CANCELLED by 6270

 9884265 9884265 phywht   6270   a100
2022-03-13T00:14:22 Unknown  8  RUNNING

 9884308 9884308 phywht   627064c512g
2022-03-13T01:18:44 2022-03-13T01:37:04  4CANCELLED by 6270

 9884413 9884413 phywht   627064c512g
2022-03-13T04:52:06 2022-03-13T05:59:49 40COMPLETED

 9884431 9884431 phywht   6270   a100
2022-03-13T06:09:02 2022-03-13T09:32:45  8COMPLETED

 9887011 9887011 phywht   6270 debug64c5+
2022-03-13T11:06:44 2022-03-13T11:07:41  1CANCELLED by 6270

 

The UID showed by sacct is right, and actual UID of phywht is 6272 as shown
below:

 

> id phywht

uid=6272(phywht) gid=6272(phywht) groups=6272(phywht)

> id hpczbzt

uid=6270(hpczbzt) gid=6270(hpczbzt) groups=6270(hpczbzt)

 

Those 2 system accounts are both stored in ldap. Also we have checked them
to be consistent on either slurmctld and slurmdbd node. What's more,
scontrol and squeue can show the right username as hpczbzt:

 

> scontrol show job 9884265

JobId=9884265 JobName=af_test_session

   UserId=hpczbzt(6270) GroupId=hpczbzt(6270) MCS_label=N/A

   Priority=519 Nice=0 Account=acct-phywht QOS=normal

   JobState=RUNNING Reason=None Dependency=(null)

..

> squeue --user=hpczbzt

 JOBID PARTITION NAME USER ST   TIME  NODES
NODELIST(REASON)

   9884265  a100 af_test_  hpczbzt  R   11:43:46  1 gpu04

   9882328  dgx2 repeat_V  hpczbzt  R7:07:56  1 vol05

..

 

So is there any guess about why only sacct display the wrong username?



[slurm-users] how to locate the problem when slurm failed to restrict gpu usage of user jobs

2022-03-23 Thread taleintervenor
Hi, all:

 

We found a problem that slurm job with argument such as --gres gpu:1 didn't
be restricted with gpu usage, user still can see all gpu card on allocated
nodes.

Our gpu node has 4 cards with their gres.conf to be:

> cat /etc/slurm/gres.conf

Name=gpu Type=NVlink_A100_40GB File=/dev/nvidia0 CPUs=0-15

Name=gpu Type=NVlink_A100_40GB File=/dev/nvidia1 CPUs=16-31

Name=gpu Type=NVlink_A100_40GB File=/dev/nvidia2 CPUs=32-47

Name=gpu Type=NVlink_A100_40GB File=/dev/nvidia3 CPUs=48-63

 

And for test, we submit simple job batch like:

#!/bin/bash

#SBATCH --job-name=test

#SBATCH --partition=a100

#SBATCH --nodes=1

#SBATCH --ntasks=6

#SBATCH --gres=gpu:1

#SBATCH --reservation="gpu test"

hostname

nvidia-smi

echo end

 

Then in the out file the nvidia-smi showed all 4 gpu cards. But we expect to
see only 1 allocated gpu card.

 

Official document of slurm said it will set CUDA_VISIBLE_DEVICES env var to
restrict the gpu card available to user. But we didn't find such variable
exists in job environment. We only confirmed it do exist in prolog script
environment by adding debug command "echo $CUDA_VISIBLE_DEVICES" to slurm
prolog script.

 

So how do slurm co-operate with nvidia tools to make job user only see its
allocated gpu card? What is the requirement on nvidia gpu drivers, CUDA
toolkit or any other part to help slurm correctly restrict the gpu usage?



[slurm-users] 答复: how to locate the problem when slurm failed to restrict gpu usage of user jobs

2022-03-24 Thread taleintervenor
Well, this is indeed the point. We didn’t set ConstrainDevices=yes in 
cgroup.conf. After adding this, gpu restriction works as expected.

But what is the relation between gpu restriction and cgroup? I never heard that 
cgroup can limit gpu card usage. Isn’t it a feature of cuda or nvidia driver? 

 

发件人: Sean Maxwell  
发送时间: 2022年3月23日 23:05
收件人: Slurm User Community List 
主题: Re: [slurm-users] how to locate the problem when slurm failed to restrict 
gpu usage of user jobs

 

Hi,

 

If you are using cgroups for task/process management, you should verify that 
your /etc/slurm/cgroup.conf has the following line:

 

ConstrainDevices=yes

 

I'm not sure about the missing environment variable, but the absence of the 
above in cgroup.conf is one way the GPU devices can be unconstrained in the 
jobs.

 

-Sean

 

 

 

On Wed, Mar 23, 2022 at 10:46 AM mailto:taleinterve...@sjtu.edu.cn> > wrote:

Hi, all:

 

We found a problem that slurm job with argument such as --gres gpu:1 didn’t be 
restricted with gpu usage, user still can see all gpu card on allocated nodes.

Our gpu node has 4 cards with their gres.conf to be:

> cat /etc/slurm/gres.conf

Name=gpu Type=NVlink_A100_40GB File=/dev/nvidia0 CPUs=0-15

Name=gpu Type=NVlink_A100_40GB File=/dev/nvidia1 CPUs=16-31

Name=gpu Type=NVlink_A100_40GB File=/dev/nvidia2 CPUs=32-47

Name=gpu Type=NVlink_A100_40GB File=/dev/nvidia3 CPUs=48-63

 

And for test, we submit simple job batch like:

#!/bin/bash

#SBATCH --job-name=test

#SBATCH --partition=a100

#SBATCH --nodes=1

#SBATCH --ntasks=6

#SBATCH --gres=gpu:1

#SBATCH --reservation="gpu test"

hostname

nvidia-smi

echo end

 

Then in the out file the nvidia-smi showed all 4 gpu cards. But we expect to 
see only 1 allocated gpu card.

 

Official document of slurm said it will set CUDA_VISIBLE_DEVICES env var to 
restrict the gpu card available to user. But we didn’t find such variable 
exists in job environment. We only confirmed it do exist in prolog script 
environment by adding debug command “echo $CUDA_VISIBLE_DEVICES” to slurm 
prolog script.

 

So how do slurm co-operate with nvidia tools to make job user only see its 
allocated gpu card? What is the requirement on nvidia gpu drivers, CUDA toolkit 
or any other part to help slurm correctly restrict the gpu usage?



[slurm-users] what is the elegant way to drain node from epilog with self-defined reason?

2022-05-03 Thread taleintervenor
Hi, all:

 

We need to detect some problem at job end timepoint, so we write some
detection script in slurm epilog, which should drain the node if check is
not passed.

I know exit epilog with non-zero code will make slurm automatically drain
the node. But in such way, drain reason will all be marked as "Epilog
error". Then our auto-repair program will have trouble to determine how to
repair the node.

Another way is call scontrol directly from epilog to drain the node, but
from official doc https://slurm.schedmd.com/prolog_epilog.html it wrote:

Prolog and Epilog scripts should be designed to be as short as possible and
should not call Slurm commands (e.g. squeue, scontrol, sacctmgr, etc). .
Slurm commands in these scripts can potentially lead to performance issues
and should not be used.

So what is the best way to drain node from epilog with a self-defined
reason, or tell slurm to add more verbose message besides "Epilog error"
reason?



[slurm-users] what is the possible reason for secondary slurmctld node not allocate job after takeover?

2022-06-03 Thread taleintervenor
Hi, all:

 

Our cluster set up 2 slurm control node and scontrol show config as below:

> scontrol show config

.

SlurmctldHost[0]= slurm1

SlurmctldHost[1]= slurm2

StateSaveLocation   = /etc/slurm/state

.

Of course we have make sure both node has the some slurm conf and mount the
same nfs on StateSaveLocation and can read/write it. (but there system is
different, slurm1 is centos7 and slurm2 is centos8)

When slurm1 control the cluster and slurm2 work in standby mode, the cluster
has no problem.

But when we use "scontrol takeover" on slurm2 to switch the primary role, we
find new-submit jobs all stuck in PD state.

No job will be allocated resource by slurm2, no matter how long we wait.
Meanwhile old running jobs can complete without problem, and query command
like "sinfo", "sacct" all work well.

The pending reason is firstly shown as "priority" in squeue, but after we
manually update the priority, it become "none" reason and still stuck in PD
state.

During slurm2 primary period, there is no significant error in
slurmctld.log. Only after we restart the slurm1 service to let slurm2 return
to standby role, it report lots of error as:

error: Invalid RPC received MESSAGE_NODE_REGISTRATION_STATUS while in
standby mode

error: Invalid RPC received REQUEST_COMPLETE_PROLOG while in standby mode

error: Invalid RPC received REQUEST_COMPLETE_JOB_ALLOCATION while in standby
mode

 

So is there any suggestion to find the reason why slurm2 work abnormally as
primary controller?



[slurm-users] 答复: what is the possible reason for secondary slurmctld node not allocate job after takeover?

2022-06-04 Thread taleintervenor
Well, after increase slurmctld log level to debug, we do found some error 
related to munge like:

[2022-06-04T15:17:21.258] debug:  auth/munge: _decode_cred: Munge decode 
failed: Failed to connect to "/run/munge/munge.socket.2": Resource temporarily 
unavailable (retrying ...)

 

But when test munge manually, it works well between slurm2 and other compute 
nodes.

> munge -n | ssh node010 unmunge

The authenticity of host 'node010 (192.168.1.10)' can't be established.

RSA key fingerprint is SHA256:/fx4zQPDDPHj7df6ml0Fd0kn8cIKkSO0OgKpF+qcRDI.

Are you sure you want to continue connecting (yes/no/[fingerprint])? yes

Warning: Permanently added 'node010,192.168.1.10' (RSA) to the list of known 
hosts.

Password:

STATUS:  Success (0)

ENCODE_HOST: slurm2 (192.168.0.33)

ENCODE_TIME: 2022-06-04 16:11:35 +0800 (1654330295)

DECODE_TIME: 2022-06-04 16:11:52 +0800 (1654330312)

TTL: 300

CIPHER:  aes128 (4)

MAC: sha256 (5)

ZIP: none (0)

UID: root (0)

GID: root (0)

LENGTH:  0

Of course munge at compute nodes and unmunge at slurm2 also work well.

 

So what else does slurmctld required from munge? Or what is the difference 
between slurm auth/munge from manually munge/unmunge test?

 

发件人: Brian Andrus <> 
发送时间: 2022年6月3日 21:16
收件人: slurm-users@lists.schedmd.com
主题: Re: [slurm-users] what is the possible reason for secondary slurmctld node 
not allocate job after takeover?

 

Offhand, I would suggest double check munge and versions of slurmd/slurmctld.

Brian Andrus

On 6/3/2022 3:17 AM, taleinterve...@sjtu.edu.cn 
  wrote:

Hi, all:

 

Our cluster set up 2 slurm control node and scontrol show config as below:

> scontrol show config

…

SlurmctldHost[0]= slurm1

SlurmctldHost[1]= slurm2

StateSaveLocation   = /etc/slurm/state

…

Of course we have make sure both node has the some slurm conf and mount the 
same nfs on StateSaveLocation and can read/write it. (but there system is 
different, slurm1 is centos7 and slurm2 is centos8)

When slurm1 control the cluster and slurm2 work in standby mode, the cluster 
has no problem.

But when we use “scontrol takeover” on slurm2 to switch the primary role, we 
find new-submit jobs all stuck in PD state.

No job will be allocated resource by slurm2, no matter how long we wait. 
Meanwhile old running jobs can complete without problem, and query command like 
“sinfo”, “sacct” all work well.

The pending reason is firstly shown as “priority” in squeue, but after we 
manually update the priority, it become “none” reason and still stuck in PD 
state.

During slurm2 primary period, there is no significant error in slurmctld.log. 
Only after we restart the slurm1 service to let slurm2 return to standby role, 
it report lots of error as:

error: Invalid RPC received MESSAGE_NODE_REGISTRATION_STATUS while in standby 
mode

error: Invalid RPC received REQUEST_COMPLETE_PROLOG while in standby mode

error: Invalid RPC received REQUEST_COMPLETE_JOB_ALLOCATION while in standby 
mode

 

So is there any suggestion to find the reason why slurm2 work abnormally as 
primary controller?



[slurm-users] slurm continously log _remove_accrue_time_internal and something underflow error

2022-06-16 Thread taleintervenor
Hi all:

 

We found out slurmctld keep log error message as

[2022-06-16T04:01:20.219] error: _remove_accrue_time_internal: QOS normal
accrue_cnt underflow

[2022-06-16T04:01:20.219] error: _remove_accrue_time_internal: QOS normal
acct acct-ioomj accrue_cnt underflow

[2022-06-16T04:01:20.219] error: _remove_accrue_time_internal: QOS normal
user 3901 accrue_cnt underflow

[2022-06-16T04:01:20.219] error: _remove_accrue_time_internal: assoc_id
2676(acct-ioomj/ioomj-stu3/(null)) accrue_cnt underflow

[2022-06-16T04:01:20.219] error: _remove_accrue_time_internal: assoc_id
2623(acct-ioomj/(null)/(null)) accrue_cnt underflow

[2022-06-16T04:01:20.219] error: _remove_accrue_time_internal: assoc_id
1(root/(null)/(null)) accrue_cnt underflow

But slurm itself seem to work well and using sacctmgr query the reported
user/account all seem to be ok. 

So what is the underflow mean? Is it imply some kind of mismatch between
slurmdb datebase records?

How can we fix the problem and stop slurm to report such message?

 



[slurm-users] Is there split-brain danger when using backup slurmdbd?

2022-06-27 Thread taleintervenor
Hi, all:

 

We noticed that slurmdbd provide the conf option DbdBackupHost for user to
set a secondary slurmdbd node. Since slurmdbd is closely related to
database, we wonder will multiple slurmdbd bring up the split-brain danger,
which is the common topic in database high-available discussion. Will there
be any case in which slurmdbd_A and slurmdbd_B failed to recognize each
other's state and both work as active node?

Another related question is, when primary slurmdbd node work well, will
standby slurmdbd node write anything to database? If standby slurmdbd won't
write anything, then is it safe to separately connect slurmdbd_A to mysql_A,
and slurmdbd_B to mysql_B, and using multi-source replication to sync
mysql_A with mysql_B?



[slurm-users] how do slurmctld determine whether a compute node is not responding?

2022-07-11 Thread taleintervenor
Hi, all:

 

Recently we found some strange log in slurmctld.log about node not
responding, such as:

[2022-07-09T03:23:10.692] error: Nodes node[128-168,170-178] not responding

[2022-07-09T03:23:58.098] Node node171 now responding

[2022-07-09T03:23:58.099] Node node165 now responding

[2022-07-09T03:23:58.099] Node node163 now responding

[2022-07-09T03:23:58.099] Node node172 now responding

[2022-07-09T03:23:58.099] Node node170 now responding

[2022-07-09T03:23:58.099] Node node175 now responding

[2022-07-09T03:23:58.099] Node node164 now responding

[2022-07-09T03:23:58.099] Node node178 now responding

[2022-07-09T03:23:58.099] Node node177 now responding

Meanwhile, checking slurmd.log and nhc.log on those node all seem to be ok
at the reported timepoint.

So we guess it's slurmctld launch some detection towards those compute node
and didn't get response, thus lead to slurmctld thinking those node to be
not responding.

Then the question is what detect action do slurmctld launched? How did it
determine whether a node is responsive or non-responsive?

And is it possible to customize slurmctld's behavior on such detection, for
example wait timeout or retry count before determine the node to be not
responding?



[slurm-users] 答复: how do slurmctld determine whether a compute node is not responding?

2022-07-11 Thread taleintervenor
Hello, Kamil Wilczek:

Well I agree that the non-responding case may caused by network unstable, since 
our slurm cluster has 2 part nodes geographical distant distributed with only 
ethernet link them. Those reported nodes are all in one building while the 
slurmctld node in another building.
But we can do nothing about the network infrastructure, so we are more 
interested in adjusting slurm to make it tolerate such short-time non 
responding case.
Or is there possible to tell slurmctld do those detection through certain proxy 
node? For example we have slurmctld backup node in the same building with those 
reported compute nodes, if slurm can use this backup controller node to execute 
detection towards some part of compute nodes, the result may be more stable.

-邮件原件-
发件人: Kamil Wilczek <> 
发送时间: 2022年7月11日 15:53
收件人: Slurm User Community List ; 
taleinterve...@sjtu.edu.cn
主题: Re: [slurm-users] how do slurmctld determine whether a compute node is not 
responding?

Hello,

I know that this is not quite the answer, but you could additionally (and maybe 
you already did this :)) check if this is not a network
problem:

* Are the nodes available outside of Slurm during that time? SSH, ping?
* If you have a monitoring system (Prometheus, Icinga, etc.), are
   there any issues reported?

And lastly, did you try to set log level to "debug" for "slurmd"
and "slurmctld"?

Kind Regards
-- 

W dniu 11.07.2022 o 09:32, taleinterve...@sjtu.edu.cn pisze:
> Hi, all:
> 
> Recently we found some strange log in slurmctld.log about node not 
> responding, such as:
> 
> [2022-07-09T03:23:10.692] error: Nodes node[128-168,170-178] not 
> responding
> 
> [2022-07-09T03:23:58.098] Node node171 now responding
> 
> [2022-07-09T03:23:58.099] Node node165 now responding
> 
> [2022-07-09T03:23:58.099] Node node163 now responding
> 
> [2022-07-09T03:23:58.099] Node node172 now responding
> 
> [2022-07-09T03:23:58.099] Node node170 now responding
> 
> [2022-07-09T03:23:58.099] Node node175 now responding
> 
> [2022-07-09T03:23:58.099] Node node164 now responding
> 
> [2022-07-09T03:23:58.099] Node node178 now responding
> 
> [2022-07-09T03:23:58.099] Node node177 now responding
> 
> Meanwhile, checking slurmd.log and nhc.log on those node all seem to 
> be ok at the reported timepoint.
> 
> So we guess it’s slurmctld launch some detection towards those compute 
> node and didn’t get response, thus lead to slurmctld thinking those 
> node to be not responding.
> 
> Then the question is what detect action do slurmctld launched? How did 
> it determine whether a node is responsive or non-responsive?
> 
> And is it possible to customize slurmctld’s behavior on such 
> detection, for example wait timeout or retry count before determine 
> the node to be not responding?
> 

--
Kamil Wilczek  [https://keys.openpgp.org/] 
[D415917E84B8DA5A60E853B6E676ED061316B69B]
Laboratorium Komputerowe
Wydział Matematyki, Informatyki i Mechaniki Uniwersytet Warszawski

ul. Banacha 2
02-097 Warszawa

Tel.: 22 55 44 392
https://www.mimuw.edu.pl/
https://www.uw.edu.pl/




[slurm-users] Can slurm be configured to count CG job into max_job or max_submit limitation?

2022-07-18 Thread taleintervenor
Hi all,

 

Recently we found a problem caused by too many CG jobs. When user
continuously submit small jobs which complete quickly, the RUNNING and
PENDING job number do restricted by MaxJob and MaxSubmit in user's
association. But slurm did not count the CG job. Because we set epilog to
collect some job information into shared storage(lustre,gpfs), too many CG
job running epilog at the same time jam-up the shared storage.

So we want to restrict the CG jobs just like MaxJob and MaxSubmit do. Is
this behavior supported by slurm?



[slurm-users] What is the complete logic to calculate node number in job_submit.lua

2022-09-25 Thread taleintervenor


Hi all:

 

When designing restriction in job_submit.lua, I found there is no member in
job_desc struct can directly be used to determine the node number finally
allocated to a job. The job_desc.min_nodes seem to be a close answer, but it
will be 0xFFFE when user not specify -node option. Then in such case we
think we can use job_desc.num_tasks and job_desc.ntasks_per_node to
calculate node number. But again, we find ntasks_per_node may also be
default value 0xFFFE if user not specify related option.

So what is the complete and elegant way to predict the job node number in
job_submit.lua in all case, no matter how user write their submit options?