Re: [slurm-users] associations, limits,qos

2021-01-28 Thread Diego Zuccato
Il 25/01/21 14:46, Durai Arasan ha scritto: > Jobs submitted with sbatch cannot run on multiple partitions. The job > will be submitted to the partition where it can start first. (from > sbatch reference) Did I misunderstand or heterogeneous jobs can workaround this limitation? -- Diego Zuccato

Re: [slurm-users] Fairshare tree after SLURM upgrade

2021-01-28 Thread Ole Holm Nielsen
On 1/29/21 8:03 AM, Gestió Servidors wrote: I’m going to upgrade my SLURM version from 17.11.5 to 19.05.1. I know this is not the last version, but I manage another cluster that is running, also, this version. My question is: during the process, I need to upgrade “slurmdbd”. All the fairshare t

Re: [slurm-users] how do array jobs stored in slurmdb database?

2021-01-28 Thread Ole Holm Nielsen
On 1/29/21 3:51 AM, taleinterve...@sjtu.edu.cn wrote: The reason we need to delete job record from database is our billing system will calculate user cost from these historical records. But after a slurm system faulty there will be some specific jobs which should not be charged. it seems the b

Re: [slurm-users] how do array jobs stored in slurmdb database?

2021-01-28 Thread Ole Holm Nielsen
On 1/29/21 3:51 AM, taleinterve...@sjtu.edu.cn wrote: Thanks for the help. The doc page is useful and we can get the actual job id now. I'm glad that you solved the problem. The reason we need to delete job record from database is our billing system will calculate user cost from these histo

[slurm-users] Fairshare tree after SLURM upgrade

2021-01-28 Thread Gestió Servidors
Hello, I'm going to upgrade my SLURM version from 17.11.5 to 19.05.1. I know this is not the last version, but I manage another cluster that is running, also, this version. My question is: during the process, I need to upgrade "slurmdbd". All the fairshare tree (with rawusage, effectvusage, fai

Re: [slurm-users] [EXT]Re: only 1 job running

2021-01-28 Thread Chandler
Thanks for the explanation Brian. Seems turning on IOMMU helped, as well as added sharing to slurm.conf: SelectType=select/cons_res SelectTypeParameters=CR_CPU Now all the CPUs are being used on all the compute nodes so things are working as expected. Thanks to everyone else on the list who

Re: [slurm-users] how do array jobs stored in slurmdb database?

2021-01-28 Thread taleintervenor
Thanks for the help. The doc page is useful and we can get the actual job id now. The reason we need to delete job record from database is our billing system will calculate user cost from these historical records. But after a slurm system faulty there will be some specific jobs which should not

Re: [slurm-users] [EXT]Re: only 1 job running

2021-01-28 Thread Brian Andrus
Yep, Looks like you are on the right track. If the CPU count does not make sense to slurm, it will drain the node and jobs will not be able to start on them. There does seem more to it though. Detailed info about a job and node would help. The 'priority' pending jobs, you can ignore. Those

Re: [slurm-users] [EXT]Re: only 1 job running

2021-01-28 Thread Chandler
Brian Andrus wrote on 1/28/21 13:59: What are the specific requests for resources from a job? Nodes, Cores, Memory, threads, etc? Well the jobs are only asking for 16 CPUs each. The 255 threads is weird though, seems to be related to this, https://askubuntu.com/questions/1182818/dual-amd-ep

Re: [slurm-users] [EXT]Re: only 1 job running

2021-01-28 Thread Brian Andrus
You are getting close :) You can see why n010 is able to have multiple jobs. It shows more resources available. What are the specific requests for resources from a job? Nodes, Cores, Memory, threads, etc? Brian Andrus On 1/28/2021 12:52 PM, Chandler wrote: OK I'm getting this same output on

Re: [slurm-users] [EXT]Re: only 1 job running

2021-01-28 Thread Chandler
OK I'm getting this same output on nodes n[011-013]: # slurmd -C NodeName=n011 slurmd: error: FastSchedule will be removed in 20.02, as will the FastSchedule=0 functionality. Please consider removing this from your configuration now. slurmd: Considering each NUMA node as a socket slurmd: error:

Re: [slurm-users] only 1 job running

2021-01-28 Thread Chandler
Christopher Samuel wrote on 1/28/21 12:50: Did you restart the slurm daemons when you added the new node?  Some internal data structures (bitmaps) are build based on the number of nodes and they need to be rebuild with a restart in this situation. https://slurm.schedmd.com/faq.html#add_nodes

Re: [slurm-users] [EXT]Re: only 1 job running

2021-01-28 Thread Brian Andrus
Ahh. One one of the new nodes do: slurmd -C The output of that will tell you what those settings should be. I suspect they are off, which forces them into drain mode. Brian Andrus On 1/28/2021 12:25 PM, Chandler wrote: Andy Riebs wrote on 1/28/21 07:53: If the only changes to your system h

Re: [slurm-users] [EXT]Re: only 1 job running

2021-01-28 Thread Ole Holm Nielsen
On 28-01-2021 21:21, Chandler wrote: Brian Andrus wrote on 1/28/21 12:07: scontrol update state=resume nodename=n[011-013] I tried that but got, slurm_update error: Invalid node state specified As Chris Samuel said, you must restart the Slurm daemons when adding (or removing) nodes! See a

Re: [slurm-users] [EXT]Re: only 1 job running

2021-01-28 Thread Chandler
Andy Riebs wrote on 1/28/21 07:53: If the only changes to your system have been the slurm.conf configuration and the addition of a new node, the easiest way to track this down is probably to show us the diffs between the previous and current versions of slurm.conf, and a note about what's differe

Re: [slurm-users] [EXT]Re: only 1 job running

2021-01-28 Thread Chandler
Brian Andrus wrote on 1/28/21 12:07: scontrol update state=resume nodename=n[011-013] I tried that but got, slurm_update error: Invalid node state specified

Re: [slurm-users] only 1 job running

2021-01-28 Thread Christopher Samuel
On 1/27/21 9:28 pm, Chandler wrote: Hi list, we have a new cluster setup with Bright cluster manager. Looking into a support contract there, but trying to get community support in the mean time.  I'm sure things were working when the cluster was delivered, but I provisioned an additional node

Re: [slurm-users] only 1 job running

2021-01-28 Thread Brian Andrus
Heh. Your nodes are drained. do: scontrol update state=resume nodename=n[011-013] If they go back into a drained state, you need to look into why. That will be in the slurmctld log. You can also see it with 'sinfo -R' Brian Andrus On 1/27/2021 10:18 PM, Chandler wrote: Made a little bit of

[slurm-users] sbatch output logs get truncated

2021-01-28 Thread Timo Rothenpieler
This has started happening after upgrading slurm from 20.02 to latest 20.11. It seems like something exits too early, before slurm, or whatever else is writing that file, has a chance to flush the final output buffer to disk. For example, take this very simple batch script, which gets submitted

Re: [slurm-users] how do array jobs stored in slurmdb database?

2021-01-28 Thread Ole Holm Nielsen
On 1/28/21 11:59 AM, taleinterve...@sjtu.edu.cn wrote: From query command such as ‘sacct -j 123456’ I can see a series of jobs named 123456_1, 123456_2, etc. And I need to delete these job records from mysql database for some reason. But in job_table of slurmdb, there is only one record with

Re: [slurm-users] only 1 job running

2021-01-28 Thread Andy Riebs
Hi Chandler, If the only changes to your system have been the slurm.conf configuration and the addition of a new node, the easiest way to track this down is probably to show us the diffs between the previous and current versions of slurm.conf, and a note about what's different about the new n

[slurm-users] how do array jobs stored in slurmdb database?

2021-01-28 Thread taleintervenor
Hello, The question background is: >From query command such as 'sacct -j 123456' I can see a series of jobs named 123456_1, 123456_2, etc. And I need to delete these job records from mysql database for some reason. But in job_table of slurmdb, there is only one record with id_job=123456. n