Re: [slurm-users] Maxjobs to accrue age priority points

2019-12-13 Thread Chris Samuel
On Friday, 13 December 2019 7:01:48 AM PST Christopher Benjamin Coffey wrote: > Maybe because that setting is just not included in the default list of > settings shown? That is counterintuitive to this in the man page for > sacctmgr: > > show [] > Display information about the spec

Re: [slurm-users] error: persistent connection experienced an error

2019-12-13 Thread Chris Samuel
On 13/12/19 12:19 pm, Christopher Benjamin Coffey wrote: error: persistent connection experienced an error Looking at the source code that comes from here: if (ufds.revents & POLLERR) { error("persistent connection experienced an error");

[slurm-users] error: persistent connection experienced an error

2019-12-13 Thread Christopher Benjamin Coffey
Hi All, I wonder if any of you have seen these errors in slurmdbd.log error: persistent connection experienced an error When we see these errors, we are seeing job errors with some kind of accounting in slurm like: slurmstepd: error: _prec_extra: Could not find task_memory_cg, this should nev

Re: [slurm-users] srun: job steps and generic resources

2019-12-13 Thread Brian W. Johanson
If those sruns are wrapped in salloc, they work correctly.  The first srun can be eliminated by adding SallocDefaultCommand for salloc (disabled in this example with --no-shell) SallocDefaultCommand="srun -n1 -N1 --mem-per-cpu=0 --gres=gpu:0 --mpi=none --pty $SHELL" [user@login005 ~]$ salloc

[slurm-users] Efficiency of the profile influxdb plugin for graphing live job stats

2019-12-13 Thread Lech Nieroda
Hi, I’ve been tinkering with the acct_gather_profile/influxdb plugin a bit in order to visualize the cpu and memory usage of live jobs. Both the influxdb backend and Grafana dashboards seem like a perfect fit for our needs. I’ve run into an issue though and made a crude workaround for it, may

Re: [slurm-users] srun: job steps and generic resources

2019-12-13 Thread Kraus, Sebastian
Dear Valantis, thanks for the explanation. But, I have to correct you about the second alternate approach: srun -ppartition -N1 -n4 --gres=gpu:0 --time=00:30:00 --mem=1G -Jjobname --pty /bin/bash -il srun --gres=gpu:1 -l hostname Naturally, this is not working and in consequence the "inner" srun

Re: [slurm-users] [External] Is that possible to submit jobs to a Slurm cluster right from a developer's PC

2019-12-13 Thread Prentice Bisbal
Does Slurm provide an option to allow developers submit jobs right from their own PCs? Yes. They just  need to have the relevant Slurm packages installed, and the necessary configuration file(s). Prentice On 12/11/19 11:39 PM, Victor (Weikai) Xie wrote: Hi, We are trying to setup a tiny

Re: [slurm-users] Maxjobs to accrue age priority points

2019-12-13 Thread Christopher Benjamin Coffey
Hey Chris, Thanks! Ya, my qos name is billybob for testing. I believe I was setting it right, but not able to confirm it correctly. sacctmgr update qos name=billybob set maxjobsaccrueperuser=8 -i [ddd@radar ~ ]$ sacctmgr show qos where name=billybob format=MaxJobsAccruePerUser MaxJobsAccruePU

Re: [slurm-users] srun: job steps and generic resources

2019-12-13 Thread Brian W. Johanson
The gres resource is allocated by the first srun, the second srun is waiting for the gres allocation to be free. If you were to replace that second srun with 'srun -l --gres=gpu:0 hostname' it will complete, but it will not have access to the GPUs. You can use salloc instead of the srun to cr

[slurm-users] srun: job steps and generic resources

2019-12-13 Thread Kraus, Sebastian
Dear all, I am facing the following nasty problem. I use to start interactive batch jobs via: srun -ppartition -N1 -n4 --time=00:30:00 --mem=1G -Jjobname --pty /bin/bash -il Then, explicitly starting a job step within such a session via: srun -l hostname works fine. But, as soon as I add a generic

Re: [slurm-users] sched

2019-12-13 Thread Steve Brasier
Thanks Alex - that is mostly how I understand it too. However my understanding from the docs (and the GCP example actually) is that the cluster isn't reconfigured in the sense of rewriting slurm.conf and restarting the daemons (i.e. how you might manually resize a cluster), it's just nodes are mark