Chris

Upon further testing this morning I see the job is assigned two different 
jobid's, something I wasn't expecting.  This lead me down the road  of thinking 
the output was incorrect.

Scontrol on a hetro job will show multi-jobids for the job. So, the output just 
wasn't what I was expecting.

Jeff

[jrlang@tlog1 TEST_CODE]$ sbatch check_nodes.sbatch
Submitted batch job 2611773
 [jrlang@tlog1 TEST_CODE]$ squeue | grep jrlang
         2611773+1     teton CHECK_NO   jrlang  R       0:10      9 t[439-447]
         2611773+0 teton-hug CHECK_NO   jrlang  R       0:10      1 thm03
[jrlang@tlog1 TEST_CODE]$ pestat | grep jrlang
    t439           teton    alloc  32  32    0.02*   128000   119594  2611774 
jrlang  
    t440           teton    alloc  32  32    0.02*   128000   119542  2611774 
jrlang  
    t441           teton    alloc  32  32    0.01*   128000   119760  2611774 
jrlang  
    t442           teton    alloc  32  32    0.01*   128000   121491  2611774 
jrlang  
    t443           teton    alloc  32  32    0.02*   128000   119893  2611774 
jrlang  
    t444           teton    alloc  32  32    0.02*   128000   119607  2611774 
jrlang  
    t445           teton    alloc  32  32    0.03*   128000   119626  2611774 
jrlang  
    t446           teton    alloc  32  32    0.01*   128000   119882  2611774 
jrlang  
    t447           teton    alloc  32  32    0.01*   128000   120037  2611774 
jrlang  
   thm03   teton-hugemem      mix   1  32    0.01*  1024000  1017845  2611773 
jrlang  
[jrlang@tlog1 TEST_CODE]$ scontrol show job 2611773
JobId=2611773 PackJobId=2611773 PackJobOffset=0 JobName=CHECK_NODE
   PackJobIdSet=2611773-2611774
   UserId=jrlang(10024903) GroupId=jrlang(10024903) MCS_label=N/A
   Priority=1004 Nice=0 Account=arcc QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:01:59 TimeLimit=01:00:00 TimeMin=N/A
   SubmitTime=2019-04-24T09:03:00 EligibleTime=2019-04-24T09:03:00
   AccrueTime=2019-04-24T09:03:00
   StartTime=2019-04-24T09:03:20 EndTime=2019-04-24T10:03:20 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2019-04-24T09:03:20
   Partition=teton-hugemem AllocNode:Sid=tlog1:24498
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=thm03
   BatchHost=thm03
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=1000M,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=1000M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/pfs/tsfs1/home/jrlang/TEST_CODE/check_nodes.sbatch
   WorkDir=/pfs/tsfs1/home/jrlang/TEST_CODE
   StdErr=/pfs/tsfs1/home/jrlang/TEST_CODE/slurm-2611773.out
   StdIn=/dev/null
   StdOut=/pfs/tsfs1/home/jrlang/TEST_CODE/slurm-2611773.out
   Power=

JobId=2611774 PackJobId=2611773 PackJobOffset=1 JobName=CHECK_NODE
   PackJobIdSet=2611773-2611774
   UserId=jrlang(10024903) GroupId=jrlang(10024903) MCS_label=N/A
   Priority=1086 Nice=0 Account=arcc QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:01:59 TimeLimit=01:00:00 TimeMin=N/A
   SubmitTime=2019-04-24T09:03:00 EligibleTime=2019-04-24T09:03:00
   AccrueTime=2019-04-24T09:03:00
   StartTime=2019-04-24T09:03:20 EndTime=2019-04-24T10:03:20 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2019-04-24T09:03:20
   Partition=teton AllocNode:Sid=tlog1:24498
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=t[439-447]
   BatchHost=t439
   NumNodes=9 NumCPUs=288 NumTasks=288 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=288,mem=288000M,node=9,billing=288
   Socks/Node=* NtasksPerN:B:S:C=32:0:*:* CoreSpec=*
   MinCPUsNode=32 MinMemoryCPU=1000M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/pfs/tsfs1/home/jrlang/TEST_CODE/check_nodes.sbatch
   WorkDir=/pfs/tsfs1/home/jrlang/TEST_CODE
   StdErr=/pfs/tsfs1/home/jrlang/TEST_CODE/slurm-2611774.out
   StdIn=/dev/null
   StdOut=/pfs/tsfs1/home/jrlang/TEST_CODE/slurm-2611774.out
   Power=


-----Original Message-----
From: slurm-users <slurm-users-boun...@lists.schedmd.com> On Behalf Of Chris 
Samuel
Sent: Tuesday, April 23, 2019 7:39 PM
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] scontrol for a heterogenous job appears incorrect

◆ This message was sent from a non-UWYO address. Please exercise caution when 
clicking links or opening attachments from external sources.


On 23/4/19 3:02 pm, Jeffrey R. Lang wrote:

> Looking at the nodelist and the NumNodes they are both incorrect.   They
> should show the first node and then the additional nodes assigned.

You're only looking at the second of the two pack jobs for your submission, 
could they be assigned in the first one of the pack jobs instead?

All the best,
Chris
--
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA

Reply via email to