Re: [slurm-users] Grid engine slaughtering parallel jobs when any one of them fails (copy)

2021-04-19 Thread Christopher Samuel
Hi Robert, On 4/16/21 12:39 pm, Robert Peck wrote: Please can anyone suggest how to instruct SLURM not to massacre ALL my jobs because ONE (or a few) node(s) fails? You will also probably want this for your srun: --kill-on-bad-exit=0 What does the scontrol command below show? scontrol show

Re: [slurm-users] Grid engine slaughtering parallel jobs when any one of them fails (copy)

2021-04-19 Thread Loris Bennett
Hi, Robert, Robert Peck writes: > Michael, thanks for the tip. I can give that a go but don't know if it will > solve my issue. > > Jess, sorry I have no knowledge of how the university handles the cluster > system in that sense. > > Has anyone else been reporting bugs with the --no-kill flag

Re: [slurm-users] Grid engine slaughtering parallel jobs when any one of them fails (copy)

2021-04-19 Thread Robert Peck
Michael, thanks for the tip. I can give that a go but don't know if it will solve my issue. Jess, sorry I have no knowledge of how the university handles the cluster system in that sense. Has anyone else been reporting bugs with the --no-kill flag recently on your forum? --no-kill isn't all that

Re: [slurm-users] SLURM on AWS via Terraform

2021-04-19 Thread William Brown
Try https://github.com/clusterinthecloud William On Mon, 19 Apr 2021, 17:24 Nicholas Yue, wrote: > Hi, > > I am looking for information on how it might be possible to spin up an > AWS SLURM cluster via Terraform. > > Thank you in advance. > > Cheers > -- > Nicholas Yue > Graphics - Arnold,

[slurm-users] SLURM on AWS via Terraform

2021-04-19 Thread Nicholas Yue
Hi, I am looking for information on how it might be possible to spin up an AWS SLURM cluster via Terraform. Thank you in advance. Cheers -- Nicholas Yue Graphics - Arnold, Alembic, RenderMan, OpenGL, HDF5 Custom Dev - C++ porting, OSX, Linux, Windows http://au.linkedin.com/in/nicholasyue ht

Re: [slurm-users] User id inconsistency

2021-04-19 Thread Brian Andrus
Just running 'id' provides info for the current user in the working environment. Here, you are in a bash shell started by slurmd which is running as root. Running 'id ' returns the default settings for the user. Try doing things like 'chgrp' then then run 'id' and you will see the changes c

Re: [slurm-users] [External] Slurm Configuration assistance: Unable to use srun after installation (slurm on fedora 33)

2021-04-19 Thread Renfro, Michael
You'll definitely need to get slurmd and slurmctld working before proceeding further. slurmctld is the Slurm controller mentioned when you do the srun. Though there's probably some other steps you can take to make the slurmd and slurmctld system services available, it might be simpler to do the

Re: [slurm-users] In high availability scenario, what is the best way to synchronize state files with scontrol takeover command?

2021-04-19 Thread Ole Holm Nielsen
Hi wenxia...@126.com, I think it is safer to get some experience with Slurm *without* using initially a High Availability setup for the slurmctld server. I highly recommend you to study the SchedMD presentations available in the page https://slurm.schedmd.com/publications.html. In particular

Re: [slurm-users] safe to delete old QOSes?

2021-04-19 Thread Tina Friedrich
Hi Prentice, I've just done that on one of my test systems - and it's not deleting a no longer used QoS, but 'renaming' the most used one. So plenty of test jobs that used it in the database :) Removing it from the cluster has not modified the entries in sacct, as far as I can tell - they st

Re: [slurm-users] [External] Slurm Configuration assistance: Unable to use srun after installation (slurm on fedora 33)

2021-04-19 Thread Johnsy K. John
Hi Florian, Thanks for the valuable reply and help. My answers to you are in green. - Do you have an active support contract with SchedMD? AFAIK they only offer paid support. *I don't have an active support contact. I just started learning slurm by installing it on my fedora machine. This is the

Re: [slurm-users] User id inconsistency

2021-04-19 Thread Dustin Lang
Sorry for my confusion, I shouldn't try to write emails before coffee! On Mon, Apr 19, 2021 at 7:43 AM Bruno Gomes Pessanha < bruno.pessa...@gmail.com> wrote: > That is showing that I'm in different groups depending on how I run the > command id. > > PS: I'm running the controller and workers in

Re: [slurm-users] User id inconsistency

2021-04-19 Thread Bruno Gomes Pessanha
That is showing that I'm in different groups depending on how I run the command id. PS: I'm running the controller and workers in docker containers using privileged mode. Bruno On Mon, 19 Apr 2021 at 13:24, Dustin Lang wrote: > This is telling you you're root in the docker container, right? >

Re: [slurm-users] User id inconsistency

2021-04-19 Thread Dustin Lang
This is telling you you're root in the docker container, right? On Mon, Apr 19, 2021 at 4:51 AM Bruno Gomes Pessanha < bruno.pessa...@gmail.com> wrote: > Somebody could help me with this? > Pretty strange behaviour. If I run "id: it shows different groups if I run > "id myuser": > > [root@ctrl-

Re: [slurm-users] configless in Slurm, can not find the ip of ctld

2021-04-19 Thread Ole Holm Nielsen
Hi wenxia...@126.com, What is your full DNS domain name, and is /etc/resolv.conf consistent with your DNS? It seems to me that your DNS server is named "slurmctld-source": NS slurmctld-source. so you may have an error in the DNS setup. The DNS SRV record can be looked up by: $ host -t S

Re: [slurm-users] [External] Slurm Configuration assistance: Unable to use srun after installation (slurm on fedora 33)

2021-04-19 Thread Florian Zillner
Hi Johnsy, 1. Do you have an active support contract with SchedMD? AFAIK they only offer paid support. 2. The error message is pretty straight forward, slurmctld is not running. Did you start it (systemctl start slurmctld)? 3. slurmd needs to run on the node(s) you want to run on as wel

[slurm-users] User id inconsistency

2021-04-19 Thread Bruno Gomes Pessanha
Somebody could help me with this? Pretty strange behaviour. If I run "id: it shows different groups if I run "id myuser": [root@ctrl-slurm /]# srun --pty -p local --uid myuser bash [myuser@node-slurm /]$ id uid=868295925(myuser) gid=0(root) groups=0(root),979(cgred) [myuser@node-slurm /]$ id myu

[slurm-users] configless in Slurm, can not find the ip of ctld

2021-04-19 Thread 刘文晓
Hi list, I configured the DNS as below, found the Slurm could not find the IP of ctld. add below to the file :/etc/named.conf zone "slurmctld-source" IN { type master; file "slurmctld-source.zone"; }; and add below file /var/named/slurmctld-source.zone: $TTL 1D @ IN

[slurm-users] In high availability scenario, what is the best way to synchronize state files with scontrol takeover command?

2021-04-19 Thread 刘文晓
Hi list, There is a problem when dealing with Slurm's high availability. Now, In my env, I store the state file in the local hard disk for Ctld nodes, and use a shell script referencing the output of "scontrol ping" to sync files with interval time (2s, if making the time shorter then it wi

[slurm-users] In high availability scenario, what is the best way to synchronize state files with scontrol takeover command?

2021-04-19 Thread 刘文晓
Hi list, There is a problem when dealing with Slurm's high availability. Now, In my env, I store the state file in the local hard disk for Ctld nodes, and use a shell script referencing the output of "scontrol ping" to sync files with interval time (2s, if making the time shorter then it wi