Re: [slurm-users] Using LSPSuite with SBATCH

2017-11-13 Thread Andy Riebs
Paul, it would be incredibly helpful to reveal * What version of Slurm you are using * What Slurm commands you are using * The mpirun command(s) that do effect what you desire * Your slurm configuration -- preferably a copy of slurm.conf (with node names and IP addresses obscured for security re

[slurm-users] Using LSPSuite with SBATCH

2017-11-13 Thread Banks, Paul
Hi, On my cluster I normally run LSP programs across multiple nodes with mpirun (MVAPICH2) and can do that successfully, however I have always had trouble getting it to run successfully with srun. Either it will error out or the program will instead run multiple instances of the same program ac

Re: [slurm-users] [slurm-dev] Re: Installing SLURM locally on Ubuntu 16.04

2017-11-13 Thread E V
On Mon, Nov 13, 2017 at 10:15 AM, Benjamin Redling < benjamin.ra...@uni-jena.de> wrote: > On 11/12/17 4:52 PM, Gennaro Oliva wrote: > >> On Sun, Nov 12, 2017 at 10:03:18AM -0500, Will L wrote: >> > > I just tried `sudo apt-get remove --purge munge`, etc., and munge itself >>> >> > this should have

Re: [slurm-users] Priority wait

2017-11-13 Thread Roe Zohar
Hi guys, Thanks for the replay. I will try to add my slurm.conf tomorrow. Sadly, its a bit of a problem since its on a cluster disconected from the net and with no easy way of getting it out :( I will try tomorrow with the hope that any body could catch some bad parameter. Thanks, Roy On Nov 13,

Re: [slurm-users] Priority wait

2017-11-13 Thread Douglas Jacobsen
Assuming you are using backfill, I suspect this is caused by using default schedulerparameters, specifically the bf maxjobs or other similar limits that would prevent jobs from being reviewed. Setting debugflags=backfill will help greatly in debugging these issues. There are analogous parameters

Re: [slurm-users] Priority wait

2017-11-13 Thread A
I'm guessing you should have sent them to cluster Decepticon, instead In all seriousness though, provide the conf file. You might have accidentally set a maximum number of running jobs somewhere On Nov 13, 2017 7:28 AM, "Benjamin Redling" wrote: > Hi Roy, > > On 11/13/17 2:37 PM, Roe Zohar

[slurm-users] Graphing job metrics

2017-11-13 Thread Nicholas McCollum
Now that there is a slurm-users mailing list, I thought I would share something with the community that I have been working on to see if anyone else is interested in it. I have a lot of students on my cluster and I really wanted a way to show my users how efficient their jobs are, or let them know

Re: [slurm-users] Priority wait

2017-11-13 Thread Benjamin Redling
Hi Roy, On 11/13/17 2:37 PM, Roe Zohar wrote: [...] I sent 3000 jobs with feature Optimus and part are running while part are pendind. Which is ok. But I have sent 1000 jobs to Megatron and they are all in pending stating they wait because of priority. Whay os that? B.t.w if I change their pr

Re: [slurm-users] [slurm-dev] Re: Installing SLURM locally on Ubuntu 16.04

2017-11-13 Thread Benjamin Redling
On 11/12/17 4:52 PM, Gennaro Oliva wrote: On Sun, Nov 12, 2017 at 10:03:18AM -0500, Will L wrote: I just tried `sudo apt-get remove --purge munge`, etc., and munge itself this should have uninstalled slurm-wlm also, did you reinstalled it with apt? seems to be working fine. But I still g

[slurm-users] partition with several nodes not following name pattern

2017-11-13 Thread Vincent Berenz
Hi, For example, this configuration in slurm.conf works fine:   NodeName=kilimanjaro CPUs=16 RealMemory=80419 Sockets=1 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN   PartitionName=slurmtest Nodes=kilimanjaro Default=YES MaxTime=INFINITE State=UP This configuration works also:   NodeName

[slurm-users] Priority wait

2017-11-13 Thread Roe Zohar
Hello all, I have a cluster name: Autobot In this cluster I have servers: Optimus[1-10] and Megatron[1-10]. I sent 3000 jobs with feature Optimus and part are running while part are pendind. Which is ok. But I have sent 1000 jobs to Megatron and they are all in pending stating they wait because of

[slurm-users] No response to persist_init

2017-11-13 Thread Ulf Markwardt
Dear all, we have tried to update to Slurm 17.02.9. What do these messages mean (and how can we get rid of it): slurmstepd error: slurm_persist_conn_open: No response to persist_init slurmstepd error: slurmdbd: Sending PersistInit msg: No error Thanks, Ulf -- __