Hi Alex,

Thanks a lot. I suspected it was something trivial.

ubuntu@ip-172-31-12-211:~$ scontrol show config | grep -i defmem
DefMemPerNode           = UNLIMITED


Specifying `sbatch --mem=1M job.sh` works. I will probably specify a default value in the slurm.conf (just tried; that also helps).

Best,
Jan









On 23-11-2020 22:15, Alex Chekholko wrote:
Hi,

Your job does not request any specific amount of memory, so it gets the default request.  I believe the default request is all the RAM in the node.

Try something like:
$ scontrol show config | grep -i defmem
DefMemPerNode           = 64000

Regards,
Alex


On Mon, Nov 23, 2020 at 12:33 PM Jan van der Laan <sl...@eoos.dds.nl <mailto:sl...@eoos.dds.nl>> wrote:

    Hi,

    I am having issues getting slurm to run multiple jobs in parallel on
    the
    same machine.

    Most of our jobs are either (relatively) low on CPU and high on memory
    (data processing) or low on memory and high on CPU (simulations). The
    server we have is generally big enough (256GB Mem; 16 cores) to
    accommodate multiple jobs running at the same time and we would like
    use
    slurm to schedule these jobs. However, testing on a small (4 CPU)
    amazon
    server, I am unable to get this working. I would have to use
    `SelectType=select/cons_res` and
    `SelectTypeParameters=CR_CPU_Memory` as
    far as I know. However, when starting multiple jobs using a single CPU
    these are started sequentially and not in parallel.

    My `slurm.conf`

    ===
    ControlMachine=ip-172-31-37-52

    MpiDefault=none
    ProctrackType=proctrack/pgid
    ReturnToService=1
    SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
    SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
    SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
    SlurmUser=slurm
    StateSaveLocation=/var/lib/slurm-llnl/slurmctld
    SwitchType=switch/none
    TaskPlugin=task/none

    # SCHEDULING
    FastSchedule=1
    SchedulerType=sched/backfill
    SelectType=select/cons_res
    SelectTypeParameters=CR_CPU_Memory

    # LOGGING AND ACCOUNTING
    AccountingStorageType=accounting_storage/none
    ClusterName=cluster
    JobAcctGatherType=jobacct_gather/none
    SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
    SlurmdLogFile=/var/log/slurm-llnl/slurmd.log

    # COMPUTE NODES
    NodeName=ip-172-31-37-52 CPUs=4 RealMemory=7860 CoresPerSocket=2
    ThreadsPerCore=2 State=UNKNOWN
    PartitionName=test Nodes=ip-172-31-37-52 Default=YES MaxTime=INFINITE
    State=UP
    ====

    `job.sh`
    ===
    #!/bin/bash
    sleep 30
    env
    ===

    Output when running jobs:
    ===
    ubuntu@ip-172-31-37-52:~$ sbatch -n1 -N1 job.sh
    Submitted batch job 2
    ubuntu@ip-172-31-37-52:~$ sbatch -n1 -N1 job.sh
    Submitted batch job 3
    ubuntu@ip-172-31-37-52:~$ sbatch -n1 -N1 job.sh
    Submitted batch job 4
    ubuntu@ip-172-31-37-52:~$ sbatch -n1 -N1 job.sh
    Submitted batch job 5
    ubuntu@ip-172-31-37-52:~$ sbatch -n1 -N1 job.sh
    Submitted batch job 6
    ubuntu@ip-172-31-37-52:~$ sbatch -n1 -N1 job.sh
    Submitted batch job 7
    ubuntu@ip-172-31-37-52:~$ squeue
                   JOBID PARTITION     NAME     USER ST       TIME  NODES
    NODELIST(REASON)
                       3      test   job.sh   ubuntu PD       0:00      1
    (Resources)
                       4      test   job.sh   ubuntu PD       0:00      1
    (Priority)
                       5      test   job.sh   ubuntu PD       0:00      1
    (Priority)
                       6      test   job.sh   ubuntu PD       0:00      1
    (Priority)
                       7      test   job.sh   ubuntu PD       0:00      1
    (Priority)
                       2      test   job.sh   ubuntu  R       0:03      1
    ip-172-31-37-52
    ===

    The jobs are run sequentially, while in principle it should be possible
    to run 4 jobs in parallel. I am probably missing something simple. How
    do I get this to work?

    Best,
    Jan


Reply via email to