Re: [slurm-users] NoDecay on accounts (or on GrpTRESMins in general)

2020-11-23 Thread Yair Yarom
On Fri, Nov 20, 2020 at 12:11 AM Sebastian T Smith  wrote:

> Hi,
>
> We're setting GrpTRESMins on the account association and have NoDecay
> QOS's for different user classes.  All user associations with a
> GrpTRESMins-limited account are assigned a NoDecay QOS.  I'm not sure if
> it's a better approach... but it's an option.
>

If I follow correctly, your GrpTRESMins usage on the accounts will still
get decayed. From tests I ran here when running with a NoDecay QOS, the
GrpTRESMins of the account still gets decayed, while the GrpTRESMins of the
QOS doesn't.
So do you also have a GrpTRESMins on the QOS itself? And if so, why do you
need both on the QOS and on the account? or am I missing something?

Thanks,
Yair.


[slurm-users] MinJobAge

2020-11-23 Thread Brian Andrus

All,

I always thought that MinJobAge affected how long a job will show up 
when doing 'squeue'


That does not seem to be the case for me.

I have MinJobAge=900, but if I do 'squeue --me' as soon as I finish an 
interactive job, there is nothing in the queue.


I swear I used to see jobs in a completed state for a period of time, 
but they are not showing up at all on our cluster.



How does one have jobs show up that are completed?


Brian Andrus





[slurm-users] Simultaneously running multiple jobs on same node

2020-11-23 Thread Jan van der Laan

Hi,

I am having issues getting slurm to run multiple jobs in parallel on the 
same machine.


Most of our jobs are either (relatively) low on CPU and high on memory 
(data processing) or low on memory and high on CPU (simulations). The 
server we have is generally big enough (256GB Mem; 16 cores) to 
accommodate multiple jobs running at the same time and we would like use 
slurm to schedule these jobs. However, testing on a small (4 CPU) amazon 
server, I am unable to get this working. I would have to use 
`SelectType=select/cons_res` and `SelectTypeParameters=CR_CPU_Memory` as 
far as I know. However, when starting multiple jobs using a single CPU 
these are started sequentially and not in parallel.


My `slurm.conf`

===
ControlMachine=ip-172-31-37-52

MpiDefault=none
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
SlurmUser=slurm
StateSaveLocation=/var/lib/slurm-llnl/slurmctld
SwitchType=switch/none
TaskPlugin=task/none

# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory

# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
ClusterName=cluster
JobAcctGatherType=jobacct_gather/none
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log

# COMPUTE NODES
NodeName=ip-172-31-37-52 CPUs=4 RealMemory=7860 CoresPerSocket=2 
ThreadsPerCore=2 State=UNKNOWN
PartitionName=test Nodes=ip-172-31-37-52 Default=YES MaxTime=INFINITE 
State=UP



`job.sh`
===
#!/bin/bash
sleep 30
env
===

Output when running jobs:
===
ubuntu@ip-172-31-37-52:~$ sbatch -n1 -N1 job.sh
Submitted batch job 2
ubuntu@ip-172-31-37-52:~$ sbatch -n1 -N1 job.sh
Submitted batch job 3
ubuntu@ip-172-31-37-52:~$ sbatch -n1 -N1 job.sh
Submitted batch job 4
ubuntu@ip-172-31-37-52:~$ sbatch -n1 -N1 job.sh
Submitted batch job 5
ubuntu@ip-172-31-37-52:~$ sbatch -n1 -N1 job.sh
Submitted batch job 6
ubuntu@ip-172-31-37-52:~$ sbatch -n1 -N1 job.sh
Submitted batch job 7
ubuntu@ip-172-31-37-52:~$ squeue
 JOBID PARTITION NAME USER ST   TIME  NODES 
NODELIST(REASON)
 3  test   job.sh   ubuntu PD   0:00  1 
(Resources)
 4  test   job.sh   ubuntu PD   0:00  1 
(Priority)
 5  test   job.sh   ubuntu PD   0:00  1 
(Priority)
 6  test   job.sh   ubuntu PD   0:00  1 
(Priority)
 7  test   job.sh   ubuntu PD   0:00  1 
(Priority)
 2  test   job.sh   ubuntu  R   0:03  1 
ip-172-31-37-52

===

The jobs are run sequentially, while in principle it should be possible 
to run 4 jobs in parallel. I am probably missing something simple. How 
do I get this to work?


Best,
Jan



Re: [slurm-users] Simultaneously running multiple jobs on same node

2020-11-23 Thread Alex Chekholko
Hi,

Your job does not request any specific amount of memory, so it gets the
default request.  I believe the default request is all the RAM in the node.

Try something like:
$ scontrol show config | grep -i defmem
DefMemPerNode   = 64000

Regards,
Alex


On Mon, Nov 23, 2020 at 12:33 PM Jan van der Laan  wrote:

> Hi,
>
> I am having issues getting slurm to run multiple jobs in parallel on the
> same machine.
>
> Most of our jobs are either (relatively) low on CPU and high on memory
> (data processing) or low on memory and high on CPU (simulations). The
> server we have is generally big enough (256GB Mem; 16 cores) to
> accommodate multiple jobs running at the same time and we would like use
> slurm to schedule these jobs. However, testing on a small (4 CPU) amazon
> server, I am unable to get this working. I would have to use
> `SelectType=select/cons_res` and `SelectTypeParameters=CR_CPU_Memory` as
> far as I know. However, when starting multiple jobs using a single CPU
> these are started sequentially and not in parallel.
>
> My `slurm.conf`
>
> ===
> ControlMachine=ip-172-31-37-52
>
> MpiDefault=none
> ProctrackType=proctrack/pgid
> ReturnToService=1
> SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
> SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
> SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
> SlurmUser=slurm
> StateSaveLocation=/var/lib/slurm-llnl/slurmctld
> SwitchType=switch/none
> TaskPlugin=task/none
>
> # SCHEDULING
> FastSchedule=1
> SchedulerType=sched/backfill
> SelectType=select/cons_res
> SelectTypeParameters=CR_CPU_Memory
>
> # LOGGING AND ACCOUNTING
> AccountingStorageType=accounting_storage/none
> ClusterName=cluster
> JobAcctGatherType=jobacct_gather/none
> SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
> SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
>
> # COMPUTE NODES
> NodeName=ip-172-31-37-52 CPUs=4 RealMemory=7860 CoresPerSocket=2
> ThreadsPerCore=2 State=UNKNOWN
> PartitionName=test Nodes=ip-172-31-37-52 Default=YES MaxTime=INFINITE
> State=UP
> 
>
> `job.sh`
> ===
> #!/bin/bash
> sleep 30
> env
> ===
>
> Output when running jobs:
> ===
> ubuntu@ip-172-31-37-52:~$ sbatch -n1 -N1 job.sh
> Submitted batch job 2
> ubuntu@ip-172-31-37-52:~$ sbatch -n1 -N1 job.sh
> Submitted batch job 3
> ubuntu@ip-172-31-37-52:~$ sbatch -n1 -N1 job.sh
> Submitted batch job 4
> ubuntu@ip-172-31-37-52:~$ sbatch -n1 -N1 job.sh
> Submitted batch job 5
> ubuntu@ip-172-31-37-52:~$ sbatch -n1 -N1 job.sh
> Submitted batch job 6
> ubuntu@ip-172-31-37-52:~$ sbatch -n1 -N1 job.sh
> Submitted batch job 7
> ubuntu@ip-172-31-37-52:~$ squeue
>   JOBID PARTITION NAME USER ST   TIME  NODES
> NODELIST(REASON)
>   3  test   job.sh   ubuntu PD   0:00  1
> (Resources)
>   4  test   job.sh   ubuntu PD   0:00  1
> (Priority)
>   5  test   job.sh   ubuntu PD   0:00  1
> (Priority)
>   6  test   job.sh   ubuntu PD   0:00  1
> (Priority)
>   7  test   job.sh   ubuntu PD   0:00  1
> (Priority)
>   2  test   job.sh   ubuntu  R   0:03  1
> ip-172-31-37-52
> ===
>
> The jobs are run sequentially, while in principle it should be possible
> to run 4 jobs in parallel. I am probably missing something simple. How
> do I get this to work?
>
> Best,
> Jan
>
>