Hello Durai,

you did not specify the amount of memory in your node configuration.

Perhaps it defaults to 1MB and so your 1MB-job already uses all the memory that the scheduler thinks the node has...?

What does "scontrol show node slurm-gpu-1" say? Look for the "RealMemory" field in the output.

Best,
Christoph


On 26/08/2020 11.35, Durai Arasan wrote:
Hello,

this is my node configuration:

NodeName=slurm-gpu-1 NodeAddr=192.168.0.200  Procs=16 Gres=gpu:2 State=UNKNOWN NodeName=slurm-gpu-2 NodeAddr=192.168.0.124  Procs=1 Gres=gpu:0 State=UNKNOWN PartitionName=gpu Nodes=slurm-gpu-1 Default=NO MaxTime=INFINITE AllowAccounts=whitelist,gpu_users State=UP PartitionName=compute Nodes=slurm-gpu-1,slurm-gpu-2 Default=YES MaxTime=INFINITE AllowAccounts=whitelist State=UP


and this is one of the job scripts. You can see mem is set to 1M, so very minimal.

#!/bin/bash
#SBATCH -J Test1
#SBATCH --nodelist=slurm-gpu-1
#SBATCH --mem=1M
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH -o /home/centos/Test1-%j.out
#SBATCH -e /home/centos/Test1-%j.err
srun sleep 60

Thanks,
Durai

On Wed, Aug 26, 2020 at 2:49 AM Jacqueline Scoggins <jscogg...@lbl.gov <mailto:jscogg...@lbl.gov>> wrote:

    What is the variable for Oversubscribe is set for your partitions?
    By default Oversubscribe=No which means that none of your Cores will
    be shared with other jobs.  With oversubscribe set to Yes or Force
    you should set a number after the FORCE to allow the number of jobs
    that can run on each core of each node in the partition.
    Look at this page for a better understanding:
    
https://slurm.schedmd.com/cons_res_share.html#:~:text=OverSubscribe%3DYES-,By%20default%20same%20as%20OverSubscribe%3DNO.,the%20srun%20%2D%2Doversubscribe%20option.&text=Each%20core%20can%20be%20allocated,default%204%20jobs%20per%20core).&text=CPUs%20are%20allocated%20to%20jobs.

    You can also check the oversubscribe on a partition using sinfo -o
    "%h" option.
    sinfo -o '%P %.5a %.10h %N ' | head

    PARTITION AVAIL OVERSUBSCR NODELIST


    Look at the sinfo options for further details.


    Jackie


    On Tue, Aug 25, 2020 at 9:58 AM Durai Arasan <arasan.du...@gmail.com
    <mailto:arasan.du...@gmail.com>> wrote:

        Hello,

        On our cluster we have SelectTypeParameters set to "CR_Core_Memory".

        Under these conditions multiple jobs should be able to run on
        the same node. But they refuse to be allocated on the same node
        and only one job runs on the node and rest of the jobs are in
        pending state.

        When we changed SelectTypeParameters to "CR_Core" however, this
        issue was resolved and multiple jobs were successfully allocated
        to the same node and ran concurrently on the same node.

        Does anyone know why such behavior is seen? Why does including
        memory as consumable resource lead to node exclusive behavior?

        Thanks,
        Durai


--
Dr. Christoph Brüning
Universität Würzburg
Rechenzentrum
Am Hubland
D-97074 Würzburg
Tel.: +49 931 31-80499

Reply via email to