Il 25/01/21 14:46, Durai Arasan ha scritto:
> Jobs submitted with sbatch cannot run on multiple partitions. The job
> will be submitted to the partition where it can start first. (from
> sbatch reference)
Did I misunderstand or heterogeneous jobs can workaround this limitation?
--
Diego Zuccato
On 1/29/21 8:03 AM, Gestió Servidors wrote:
I’m going to upgrade my SLURM version from 17.11.5 to 19.05.1. I know this
is not the last version, but I manage another cluster that is running,
also, this version. My question is: during the process, I need to upgrade
“slurmdbd”. All the fairshare t
On 1/29/21 3:51 AM, taleinterve...@sjtu.edu.cn wrote:
The reason we need to delete job record from database is our billing system
will calculate user cost from these historical records. But after a slurm
system faulty there will be some specific jobs which should not be charged. it
seems the b
On 1/29/21 3:51 AM, taleinterve...@sjtu.edu.cn wrote:
Thanks for the help. The doc page is useful and we can get the actual job id
now.
I'm glad that you solved the problem.
The reason we need to delete job record from database is our billing system
will calculate user cost from these histo
Hello,
I'm going to upgrade my SLURM version from 17.11.5 to 19.05.1. I know this is
not the last version, but I manage another cluster that is running, also, this
version. My question is: during the process, I need to upgrade "slurmdbd". All
the fairshare tree (with rawusage, effectvusage, fai
Thanks for the explanation Brian. Seems turning on IOMMU helped, as well as
added sharing to slurm.conf:
SelectType=select/cons_res
SelectTypeParameters=CR_CPU
Now all the CPUs are being used on all the compute nodes so things are working
as expected.
Thanks to everyone else on the list who
Thanks for the help. The doc page is useful and we can get the actual job id
now.
The reason we need to delete job record from database is our billing system
will calculate user cost from these historical records. But after a slurm
system faulty there will be some specific jobs which should not
Yep, Looks like you are on the right track.
If the CPU count does not make sense to slurm, it will drain the node
and jobs will not be able to start on them.
There does seem more to it though. Detailed info about a job and node
would help.
The 'priority' pending jobs, you can ignore. Those
Brian Andrus wrote on 1/28/21 13:59:
What are the specific requests for resources from a job?
Nodes, Cores, Memory, threads, etc?
Well the jobs are only asking for 16 CPUs each. The 255 threads is weird
though, seems to be related to this,
https://askubuntu.com/questions/1182818/dual-amd-ep
You are getting close :)
You can see why n010 is able to have multiple jobs. It shows more
resources available.
What are the specific requests for resources from a job?
Nodes, Cores, Memory, threads, etc?
Brian Andrus
On 1/28/2021 12:52 PM, Chandler wrote:
OK I'm getting this same output on
OK I'm getting this same output on nodes n[011-013]:
# slurmd -C
NodeName=n011 slurmd: error: FastSchedule will be removed in 20.02, as will the
FastSchedule=0 functionality. Please consider removing this from your
configuration now.
slurmd: Considering each NUMA node as a socket
slurmd: error:
Christopher Samuel wrote on 1/28/21 12:50:
Did you restart the slurm daemons when you added the new node? Some internal
data structures (bitmaps) are build based on the number of nodes and they need
to be rebuild with a restart in this situation.
https://slurm.schedmd.com/faq.html#add_nodes
Ahh.
One one of the new nodes do:
slurmd -C
The output of that will tell you what those settings should be. I
suspect they are off, which forces them into drain mode.
Brian Andrus
On 1/28/2021 12:25 PM, Chandler wrote:
Andy Riebs wrote on 1/28/21 07:53:
If the only changes to your system h
On 28-01-2021 21:21, Chandler wrote:
Brian Andrus wrote on 1/28/21 12:07:
scontrol update state=resume nodename=n[011-013]
I tried that but got,
slurm_update error: Invalid node state specified
As Chris Samuel said, you must restart the Slurm daemons when adding (or
removing) nodes!
See a
Andy Riebs wrote on 1/28/21 07:53:
If the only changes to your system have been the slurm.conf
configuration and the addition of a new node, the easiest way to
track this down is probably to show us the diffs between the previous
and current versions of slurm.conf, and a note about what's differe
Brian Andrus wrote on 1/28/21 12:07:
scontrol update state=resume nodename=n[011-013]
I tried that but got,
slurm_update error: Invalid node state specified
On 1/27/21 9:28 pm, Chandler wrote:
Hi list, we have a new cluster setup with Bright cluster manager.
Looking into a support contract there, but trying to get community
support in the mean time. I'm sure things were working when the cluster
was delivered, but I provisioned an additional node
Heh. Your nodes are drained.
do:
scontrol update state=resume nodename=n[011-013]
If they go back into a drained state, you need to look into why. That
will be in the slurmctld log. You can also see it with 'sinfo -R'
Brian Andrus
On 1/27/2021 10:18 PM, Chandler wrote:
Made a little bit of
This has started happening after upgrading slurm from 20.02 to latest 20.11.
It seems like something exits too early, before slurm, or whatever else
is writing that file, has a chance to flush the final output buffer to disk.
For example, take this very simple batch script, which gets submitted
On 1/28/21 11:59 AM, taleinterve...@sjtu.edu.cn wrote:
From query command such as ‘sacct -j 123456’ I can see a series of jobs
named 123456_1, 123456_2, etc. And I need to delete these job records from
mysql database for some reason.
But in job_table of slurmdb, there is only one record with
Hi Chandler,
If the only changes to your system have been the slurm.conf
configuration and the addition of a new node, the easiest way to track
this down is probably to show us the diffs between the previous and
current versions of slurm.conf, and a note about what's different about
the new n
Hello,
The question background is:
>From query command such as 'sacct -j 123456' I can see a series of jobs
named 123456_1, 123456_2, etc. And I need to delete these job records from
mysql database for some reason.
But in job_table of slurmdb, there is only one record with id_job=123456.
n
22 matches
Mail list logo