Re: [slurm-users] How to force jobs to run next in queue

2019-03-12 Thread Bjørn-Helge Mevik
Sean Brisbane writes: > Does anyone have a feeling for why setting a high Priority on a partition > makes jobs run in that partition first regardless that a job in a different > Partition may have a much higher overall priority? Perhaps because that is what it was designed to do? Did you try us

Re: [slurm-users] weight setting not working

2019-03-12 Thread Andy Leung Yin Sui
Thank you for your reply. I was running 18.08.1 and updated to 18.08.6. Everything was solved. Thank you. On Tue, 12 Mar 2019 at 20:23, Eli V wrote: > > On Tue, Mar 12, 2019 at 1:14 AM Andy Leung Yin Sui wrote: > > > > Hi, > > > > I am new to slurm and want to use weight option to schedule the j

Re: [slurm-users] problems with slurm and openmpi

2019-03-12 Thread Gilles Gouaillardet
Rick, The issue is SLURM can only provide pmi2 support, and it seems Open MPI only supports pmix One option is to rebuild SLURM with PMIx as explained by Daniel, and then srun --mpi=pmix ... If you do not want (or cannot) rebuilt SLURM, you can use the older pmi or pmi2. In that case,

Re: [slurm-users] problems with slurm and openmpi

2019-03-12 Thread Daniel Letai
Hi. On 12/03/2019 22:53:36, Riccardo Veraldi wrote: Hello, after trynig hard for over 10 days I am forced to write to the list.

[slurm-users] Resolution! was Re: Mysterious job terminations on Slurm 17.11.10

2019-03-12 Thread Andy Riebs
It appears that we have gotten to the bottom of this problem! We discovered that we only seem to see this problem if our overnight test script is run with "nohup," as we have been doing for several years. Typically, we would see the mysterious cancellations about once every other day, or 3-4 ti

Re: [slurm-users] problems with slurm and openmpi

2019-03-12 Thread Cyrus Proctor
Both your Slurm and OpenMPI config.logs would be helpful in debugging here. Throw in your slurm.conf as well for good measure. Also, what type of system are you running, what type of high speed fabric are you trying to run on, and what's your driver stack look like? I know the feeling and will

[slurm-users] problems with slurm and openmpi

2019-03-12 Thread Riccardo Veraldi
Hello, after trynig hard for over 10 days I am forced to write to the list. I am not able to have SLURM work with openmpi. Openmpi compiled binaries won't run on slurm, while all non openmpi progs run just fine under "srun". I am using SLURM 18.08.5 building the rpm from the tarball: rpmbuild -ta s

Re: [slurm-users] How to force jobs to run next in queue

2019-03-12 Thread Thomas M. Payerle
Are you uising the prioirty/multifactor plugin? What are the values of the various Priority* weight factors? On Tue, Mar 12, 2019 at 12:42 PM Sean Brisbane wrote: > Hi, > > Thanks for your help. > > Either setting qos or setting priority doesn't work for me. However I > have found the cause if

Re: [slurm-users] How do I impose a limit the memory requested by a job?

2019-03-12 Thread Paul Edmon
Slurm should automatically block or reject jobs that can't run on that partition in terms of memory usage for a single node.  So you shouldn't need to do anything.  If you need something less than the max memory per node then you will need to enforce some limits.  We do this via a jobsubmit lua

Re: [slurm-users] How to force jobs to run next in queue

2019-03-12 Thread Sean Brisbane
Hi, Thanks for your help. Either setting qos or setting priority doesn't work for me. However I have found the cause if not the reason. Using a Priority setting on the partition called "Priority" in slurm.conf seems to force all jobs waiting on this queue to run first regardless of any qos set

[slurm-users] How do I impose a limit the memory requested by a job?

2019-03-12 Thread David Baker
Hello, I have set up a serial queue to run small jobs in the cluster. Actually, I route jobs to this queue using the job_submit.lua script. Any 1 node job using up to 20 cpus is routed to this queue, unless a user submits their job with an exclusive flag. The partition is shared and so I def

Re: [slurm-users] How to deal with jobs that need to be restarted several time

2019-03-12 Thread Renfro, Michael
If the failures happen right after the job starts (or close enough), I’d use an interactive session with srun (or some other wrapper that calls srun, such as fisbatch). Our hpcshell wrapper for srun is just a bash function: = hpcshell () { srun --partition=interactive $@ --pty bash -i

[slurm-users] How to deal with jobs that need to be restarted several time

2019-03-12 Thread Selch, Brigitte (FIDF)
Hello, Some jobs have to be restarted several times until they run. Users start the Job, it fails, they have to do some changes, they start the job again, it fails again ... and so on. So they want to keep the resources until the job is running properly. Is there a possibility to 'inherit' alloc

Re: [slurm-users] weight setting not working

2019-03-12 Thread Eli V
On Tue, Mar 12, 2019 at 1:14 AM Andy Leung Yin Sui wrote: > > Hi, > > I am new to slurm and want to use weight option to schedule the jobs. > I have some machine with same hardware configuration with GPU cards. I > use QoS to force user at least required 1 gpu gres when submitting > jobs. > The ma

Re: [slurm-users] How to force jobs to run next in queue

2019-03-12 Thread Bjørn-Helge Mevik
Sean Brisbane writes: > I'm trying to troubleshoot why the highest priority job is not next to run, > jobs in the partition called "Priority" seem to run first. > [...] > The partition called "Priority" has a priority boost assigned through qos. > > PartitionName=Priority Nodes=compute[01-02] De