Hi Robert,
On 4/16/21 12:39 pm, Robert Peck wrote:
Please can anyone suggest how to instruct SLURM not to massacre ALL my
jobs because ONE (or a few) node(s) fails?
You will also probably want this for your srun: --kill-on-bad-exit=0
What does the scontrol command below show?
scontrol show
Hi, Robert,
Robert Peck writes:
> Michael, thanks for the tip. I can give that a go but don't know if it will
> solve my issue.
>
> Jess, sorry I have no knowledge of how the university handles the cluster
> system in that sense.
>
> Has anyone else been reporting bugs with the --no-kill flag
Michael, thanks for the tip. I can give that a go but don't know if it will
solve my issue.
Jess, sorry I have no knowledge of how the university handles the cluster
system in that sense.
Has anyone else been reporting bugs with the --no-kill flag recently on
your forum?
--no-kill isn't all that
Try https://github.com/clusterinthecloud
William
On Mon, 19 Apr 2021, 17:24 Nicholas Yue, wrote:
> Hi,
>
> I am looking for information on how it might be possible to spin up an
> AWS SLURM cluster via Terraform.
>
> Thank you in advance.
>
> Cheers
> --
> Nicholas Yue
> Graphics - Arnold,
Hi,
I am looking for information on how it might be possible to spin up an
AWS SLURM cluster via Terraform.
Thank you in advance.
Cheers
--
Nicholas Yue
Graphics - Arnold, Alembic, RenderMan, OpenGL, HDF5
Custom Dev - C++ porting, OSX, Linux, Windows
http://au.linkedin.com/in/nicholasyue
ht
Just running 'id' provides info for the current user in the working
environment. Here, you are in a bash shell started by slurmd which is
running as root.
Running 'id ' returns the default settings for the user.
Try doing things like 'chgrp' then then run 'id' and you will see the
changes c
You'll definitely need to get slurmd and slurmctld working before proceeding
further. slurmctld is the Slurm controller mentioned when you do the srun.
Though there's probably some other steps you can take to make the slurmd and
slurmctld system services available, it might be simpler to do the
Hi wenxia...@126.com,
I think it is safer to get some experience with Slurm *without* using
initially a High Availability setup for the slurmctld server.
I highly recommend you to study the SchedMD presentations available in the
page https://slurm.schedmd.com/publications.html. In particular
Hi Prentice,
I've just done that on one of my test systems - and it's not deleting a
no longer used QoS, but 'renaming' the most used one. So plenty of test
jobs that used it in the database :)
Removing it from the cluster has not modified the entries in sacct, as
far as I can tell - they st
Hi Florian,
Thanks for the valuable reply and help.
My answers to you are in green.
- Do you have an active support contract with SchedMD? AFAIK they only
offer paid support.
*I don't have an active support contact. I just started learning slurm by
installing it on my fedora machine. This is the
Sorry for my confusion, I shouldn't try to write emails before coffee!
On Mon, Apr 19, 2021 at 7:43 AM Bruno Gomes Pessanha <
bruno.pessa...@gmail.com> wrote:
> That is showing that I'm in different groups depending on how I run the
> command id.
>
> PS: I'm running the controller and workers in
That is showing that I'm in different groups depending on how I run the
command id.
PS: I'm running the controller and workers in docker containers using
privileged mode.
Bruno
On Mon, 19 Apr 2021 at 13:24, Dustin Lang wrote:
> This is telling you you're root in the docker container, right?
>
This is telling you you're root in the docker container, right?
On Mon, Apr 19, 2021 at 4:51 AM Bruno Gomes Pessanha <
bruno.pessa...@gmail.com> wrote:
> Somebody could help me with this?
> Pretty strange behaviour. If I run "id: it shows different groups if I run
> "id myuser":
>
> [root@ctrl-
Hi wenxia...@126.com,
What is your full DNS domain name, and is /etc/resolv.conf consistent with
your DNS? It seems to me that your DNS server is named "slurmctld-source":
NS slurmctld-source.
so you may have an error in the DNS setup.
The DNS SRV record can be looked up by:
$ host -t S
Hi Johnsy,
1. Do you have an active support contract with SchedMD? AFAIK they only
offer paid support.
2. The error message is pretty straight forward, slurmctld is not running.
Did you start it (systemctl start slurmctld)?
3. slurmd needs to run on the node(s) you want to run on as wel
Somebody could help me with this?
Pretty strange behaviour. If I run "id: it shows different groups if I run
"id myuser":
[root@ctrl-slurm /]# srun --pty -p local --uid myuser bash
[myuser@node-slurm /]$ id
uid=868295925(myuser) gid=0(root) groups=0(root),979(cgred)
[myuser@node-slurm /]$ id myu
Hi list,
I configured the DNS as below, found the Slurm could not find the IP of ctld.
add below to the file :/etc/named.conf
zone "slurmctld-source" IN {
type master;
file "slurmctld-source.zone";
};
and add below file /var/named/slurmctld-source.zone:
$TTL 1D
@ IN
Hi list,
There is a problem when dealing with Slurm's high availability.
Now, In my env, I store the state file in the local hard disk for Ctld nodes,
and use a shell script referencing the output of "scontrol ping" to sync files
with interval time (2s, if making the time shorter then it wi
Hi list,
There is a problem when dealing with Slurm's high availability.
Now, In my env, I store the state file in the local hard disk for Ctld nodes,
and use a shell script referencing the output of "scontrol ping" to sync files
with interval time (2s, if making the time shorter then it wi
19 matches
Mail list logo