A few things to look at, make sure DNS/Host name resolution works, disable
any firewalls for testing, you can lock it down after, make sure the
slurm.conf file is the same on all nodes.
I've just done a 20.11.9 to 24.05.2 upgrade along with a Centos7.9 to rhel
9.10 upgrade on all my nodes.
Sid
r you end up
with a situation where the slurmd can't talk to the running slurmstepd and
the job(s) gets lost. (Shows as a "Protocol Error").
Ole sent me a link to this guide which mostly worked.
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/#upgrade-slurmd-on-nodes
G'Day all,
Can anyone shed light on the parameter "Resume AfterTime" returned from the
command
"scontrol show node XXX"
Can it be used to automatically resume a "Down"ed node?
Sid
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@li
G'Day all,
I have 3 years worth of job records in the slurm DB and we do not have any
need to actually track anything at this stage, I would like to keep
12months worth of jobs so I need to purge 2 years worth at some point..
Is there a command to issue via scontrol to kick off a SlurmPurge using
G'Day all,
I've been upgrading cmy cluster from 20.11.0 in small steps to get to
24.05.2. Currently 1 have all nodes on 23.02.8, the controller on 24.05.2
and a single test node on 24.05.2. All are Centos 7.9 (upgrade to Oracle
Linux 8.10 is Phase 2 of the upgrades).
When I check the slurmd statu
if it goes wrong? 😊
>
>
>
> Regards,
>
>
>
> Tim
>
> --
>
> *Tim Cutts*
>
> Scientific Computing Platform Lead
>
> AstraZeneca
>
>
>
> Find out more about R&D IT Data, Analytics & AI and how we can support you
> by visiting our Service
G'day all,
I've been waiting for node to become idle before upgrading them however
some jobs take a long time. If I try to remove all the packages I assume
that kills the slurmstep program and with it the job.
Sid
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send
Thats a Very interesting design and looking at the SD665 V3 documentation
am I correct each node has dual 25GBs SFP28 interfaces?
If so, the despite dual nodes in a 1u configuration, you actually have 2
separate servers?
Sid
On Fri, 23 Feb 2024, 22:40 Ole Holm Nielsen via slurm-users, <
slurm-u
Is there a direct upgrade path from 20.11.0 to 22.05.6 or is it in
multiple steps?
Sid Young
On Fri, Nov 11, 2022 at 7:53 AM Marshall Garey wrote:
> We are pleased to announce the availability of Slurm version 22.05.6.
>
> This includes a fix to core selection for steps which cou
Brian / Christopher, that looks like a good process, thanks guys, I will do
some testing and let you know.
if I mark a partition down and it has running jobs, what happens to those
jobs, do they keep running?
Sid Young
W: https://off-grid-engineering.com
W: (personal) https://sidyoung.com/
W
Sid Young
W: https://off-grid-engineering.com
W: (personal) https://sidyoung.com/
W: (personal) https://z900collector.wordpress.com/
On Tue, Feb 1, 2022 at 3:02 PM Christopher Samuel wrote:
> On 1/31/22 4:41 pm, Sid Young wrote:
>
> > I need to replace a faulty DIMM chim in our log
20-30 minutes, scheduler is a separate node and I could email back any
users who try to SSH while the node is down.
Sid Young
W: https://off-grid-engineering.com
W: (personal) https://sidyoung.com/
W: (personal) https://z900collector.wordpress.com/
Whats wrong with just using the tools as is?
Sid Young
On Thu, Sep 16, 2021 at 5:54 AM Ondrej Valousek
wrote:
> Hi list,
> I am wondering if there is a plugin allowing to submit jobs via SystemD
> (I.e. using systemd-run) on exec nodes.
>
> I have actually modified SGE source
00%|100.00%
#trihpc|energy|0.00%|0.00%|0.00%|0.00%|0.00%|0.00%
#trihpc|billing|14.62%|4.78%|0.00%|80.60%|0.00%|100.00%
#trihpc|fs/disk|0.00%|0.00%|0.00%|0.00%|0.00%|0.00%
#trihpc|vmem|0.00%|0.00%|0.00%|0.00%|0.00%|0.00%
#trihpc|pages|0.00%|0.00%|0.00%|0.00%|0.00%|0.00%
Sid Young
W: https://off-grid-engin
Why not spin them up as Virtual machines... then you could build real
(separate) clusters.
Sid Young
W: https://off-grid-engineering.com
W: (personal) https://sidyoung.com/
W: (personal) https://z900collector.wordpress.com/
On Wed, Jul 28, 2021 at 12:07 AM Brian Andrus wrote:
> You can
Hi Luis,
I have exactly the same issue with a user who needs the reported cores to
reflect the requested cores. If you find a solution that works please
share. :)
Thanks
Sid Young
Translational Research Institute
Sid Young
W: https://off-grid-engineering.com
W: (personal) https
Thanks for the reply... I will look into how to configure it.
Sid Young
Translational Research Institute
On Wed, Jun 23, 2021 at 7:06 AM Prentice Bisbal wrote:
> Yes,
>
> You need to use the cgroups plugin.
>
>
> On Fri, Jun 18, 2021, 12:29 AM Sid Young wrote:
>
>>
ant lines from the slurm.conf file:
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
ReturnToService=1
CpuFreqGovernors=OnDemand,Performance,UserSpace
CpuFreqDef=Performance
Sid Young
Translational Research Institute
G'Day all,
Is there a tool that will extract the job counts in JSON format? Such as
#running, #in pending #onhold etc
I am trying to build some custom dashboards for the our new cluster and
this would be a really useful set of metrics to gather and display.
Sid Young
W: https://off
Hi all,
I'm interested in using the slurmrestd but it does not appear to be built
when you do an rpmbuild reading though the docs does not indicate a
switch needed to include it (unless I missed that)... any ideas on how the
rpm is built?
Sid Young
W: https://off-grid-engineering.
Yes, on reflection I should have said utilization rather than usage! I've
been researching what the most likely combination of metrics would give me
an overall utilization of the HPC.
Sadly its not as clear cut as I would have hoped.
Does anyone have any ideas?
Sid Young
On Fri, May 14,
Hi All,
Is there a way to define an effective "usage rate" of a HPC Cluster using
the data captured in the slurm database.
Primarily I want to see if it can be helpful in presenting to the
business a case for buying more hardware for the HPC :)
Sid Young
You can push a new conf file and issue an "scontrol reconfigure" on the fly
as needed... I do it on our cluster as needed, do the nodes first then
login nodes then the slurm controller... you are making a huge issue of a
very basic task...
Sid
On Tue, 4 May 2021, 22:28 Tina Friedrich,
wrote:
>
Hi David,
I use SaltStack to push out the slurm.conf file to all nodes and do a
"scontrol reconfigure" of the slurmd, this makes management much easier
across the cluster. You can also do service restarts from one point etc.
Avoid NFS mounts for the config, if the mount locks up your screwed.
htt
24 matches
Mail list logo