Re: [slurm-users] sbatch tasks stuck in queue when a job is hung

Brian Andrus Fri, 30 Aug 2019 09:10:21 -0700

After you restart slurmctld do "scontrol reconfigure"


Brian Andrus

On 8/30/2019 6:57 AM, Robert Kudyba wrote:

I had set RealMemory to a really high number as I mis-interpreted therecommendation.NodeName=node[001-003] CoresPerSocket=12 RealMemory=196489092 Sockets=2 Gres=gpu:1


But now I set it to:
RealMemory=191000

I restarted slurmctld. And according to the Bright Cluster support team:

"Unless it has been overridden in the image, the nodes will have asymlink directly to the slurm.conf on the head node. This means thatany changes made to the file on the head node will automatically beavailable to the compute nodes. All they would need in that case is tohave slurmd restarted"


But now I see these errors:

mcs: MCSParameters = (null). ondemand set.

[2019-08-30T09:22:41.700] error: Node node001 appears to have adifferent slurm.conf than the slurmctld. This could cause issues withcommunication and functionality. Please review both files and makesure they are the same. If this is expected ignore, and setDebugFlags=NO_CONF_HASH in your slurm.conf.[2019-08-30T09:22:41.700] error: Node node002 appears to have adifferent slurm.conf than the slurmctld. This could cause issues withcommunication and functionality. Please review both files and makesure they are the same. If this is expected ignore, and setDebugFlags=NO_CONF_HASH in your slurm.conf.[2019-08-30T09:22:41.701] error: Node node003 appears to have adifferent slurm.conf than the slurmctld. This could cause issues withcommunication and functionality. Please review both files and makesure they are the same. If this is expected ignore, and setDebugFlags=NO_CONF_HASH in your slurm.conf.

[2019-08-30T09:23:16.347] update_node: node node001 state set to IDLE
[2019-08-30T09:23:16.347] got (nil)

[2019-08-30T09:23:16.766]SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2

[2019-08-30T09:23:19.082] update_node: node node002 state set to IDLE
[2019-08-30T09:23:19.082] got (nil)
[2019-08-30T09:23:20.929] update_node: node node003 state set to IDLE
[2019-08-30T09:23:20.929] got (nil)

[2019-08-30T09:45:46.314] _slurm_rpc_submit_batch_job: JobId=449InitPrio=4294901759 usec=355[2019-08-30T09:45:46.430] sched: Allocate JobID=449NodeList=node[001-003] #CPUs=30 Partition=defq[2019-08-30T09:45:46.670] prolog_running_decr: Configuration forJobID=449 is complete[2019-08-30T09:45:46.772] _job_complete: JobID=449 State=0x1 NodeCnt=3WEXITSTATUS 127[2019-08-30T09:45:46.772] _job_complete: JobID=449 State=0x8005NodeCnt=3 done


Is this another option that needs to be set?

On Thu, Aug 29, 2019 at 3:27 PM Alex Chekholko <a...@calicolabs.com<mailto:a...@calicolabs.com>> wrote:


    Sounds like maybe you didn't correctly roll out / update your
    slurm.conf everywhere as your RealMemory value is back to your
    large wrong number.  You need to update your slurm.conf everywhere
    and restart all the slurm daemons.

    I recommend the "safe procedure" from here:
    https://wiki.fysik.dtu.dk/niflheim/SLURM#add-and-remove-nodes
    
<https://urldefense.proofpoint.com/v2/url?u=https-3A__wiki.fysik.dtu.dk_niflheim_SLURM-23add-2Dand-2Dremove-2Dnodes&d=DwMFaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=yUZtCS8lFs9N4Dm1nidebq1bpGa9QMJUap7ZWVR8NVg&s=Fq72zWoETitTA7ayJCyYkbp8E1fInntp4YeBv75o7vU&e=>
    Your Bright manual may have a similar process for updating SLURM
    config "the Bright way".

    On Thu, Aug 29, 2019 at 12:20 PM Robert Kudyba
    <rkud...@fordham.edu <mailto:rkud...@fordham.edu>> wrote:

        I thought I had taken care of this a while back but it appears
        the issue has returned. A very simply sbatch slurmhello.sh:
         cat slurmhello.sh
        #!/bin/sh
        #SBATCH -o my.stdout
        #SBATCH -N 3
        #SBATCH --ntasks=16
        module add shared openmpi/gcc/64/1.10.7 slurm
        mpirun hello

        sbatch slurmhello.sh
        Submitted batch job 419

        squeue
                     JOBID PARTITION     NAME     USER ST  TIME  NODES
        NODELIST(REASON)
                       419      defq slurmhel     root PD  0:00      3
        (Resources)

        In /etc/slurm/slurm.conf:
        # Nodes
        NodeName=node[001-003]  CoresPerSocket=12 RealMemory=196489092
        Sockets=2 Gres=gpu:1

        Logs show:
        [2019-08-29T14:24:40.025] error: _slurm_rpc_node_registration
        node=node001: Invalid argument
        [2019-08-29T14:24:40.025] error: Node node002 has low
        real_memory size (191840 < 196489092)
        [2019-08-29T14:24:40.025] error: _slurm_rpc_node_registration
        node=node002: Invalid argument
        [2019-08-29T14:24:40.026] error: Node node003 has low
        real_memory size (191840 < 196489092)
        [2019-08-29T14:24:40.026] error: _slurm_rpc_node_registration
        node=node003: Invalid argument

        scontrol show jobid -dd 419
        JobId=419 JobName=slurmhello.sh
           UserId=root(0) GroupId=root(0) MCS_label=N/A
           Priority=4294901759 Nice=0 Account=root QOS=normal
           JobState=PENDING Reason=Resources Dependency=(null)
           Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
           DerivedExitCode=0:0
           RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
           SubmitTime=2019-08-28T09:54:22 EligibleTime=2019-08-28T09:54:22
           StartTime=Unknown EndTime=Unknown Deadline=N/A
           PreemptTime=None SuspendTime=None SecsPreSuspend=0
           LastSchedEval=2019-08-28T09:57:22
           Partition=defq AllocNode:Sid=ourcluster:194152
           ReqNodeList=(null) ExcNodeList=(null)
           NodeList=(null)
           NumNodes=3-3 NumCPUs=16 NumTasks=16 CPUs/Task=1
        ReqB:S:C:T=0:0:*:*
           TRES=cpu=16,node=3
           Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
           MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
           Features=(null) DelayBoot=00:00:00
           Gres=(null) Reservation=(null)
           OverSubscribe=YES Contiguous=0 Licenses=(null) Network=(null)
           Command=/root/slurmhello.sh
           WorkDir=/root
           StdErr=/root/my.stdout
           StdIn=/dev/null
           StdOut=/root/my.stdout
           Power=

        scontrol show nodes node001
        NodeName=node001 Arch=x86_64 CoresPerSocket=12
           CPUAlloc=0 CPUErr=0 CPUTot=24 CPULoad=0.06
           AvailableFeatures=(null)
           ActiveFeatures=(null)
           Gres=gpu:1
           NodeAddr=node001 NodeHostName=node001 Version=17.11
           OS=Linux 3.10.0-862.2.3.el7.x86_64 #1 SMP Wed May 9
        18:05:47 UTC 2018
           RealMemory=196489092 AllocMem=0 FreeMem=99923 Sockets=2
        Boards=1
           State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1
        Owner=N/A MCS_label=N/A
           Partitions=defq
           BootTime=2019-07-18T12:08:41
        SlurmdStartTime=2019-07-18T12:09:44
           CfgTRES=cpu=24,mem=196489092M,billing=24
           AllocTRES=
           CapWatts=n/a
           CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
           ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
           Reason=Low RealMemory [slurm@2019-07-18T10:17:24]

        [root@ciscluster ~]# scontrol show nodes| grep -i mem
           RealMemory=196489092 AllocMem=0 FreeMem=99923 Sockets=2
        Boards=1
           CfgTRES=cpu=24,mem=196489092M,billing=24
           Reason=Low RealMemory [slurm@2019-07-18T10:17:24]
           RealMemory=196489092 AllocMem=0 FreeMem=180969 Sockets=2
        Boards=1
           CfgTRES=cpu=24,mem=196489092M,billing=24
           Reason=Low RealMemory [slurm@2019-07-18T10:17:24]
           RealMemory=196489092 AllocMem=0 FreeMem=178999 Sockets=2
        Boards=1
           CfgTRES=cpu=24,mem=196489092M,billing=24
           Reason=Low RealMemory [slurm@2019-07-18T10:17:24]

        sinfo -R
        REASON               USER      TIMESTAMP  NODELIST
        Low RealMemory       slurm     2019-07-18T10:17:24 node[001-003]

        sinfo -N
        NODELIST   NODES PARTITION STATE
        node001        1     defq* drain
        node002        1     defq* drain
        node003        1     defq* drain

        pdsh -w node00[1-3]  "lscpu | grep -iE 'socket|core'"
        node002: Thread(s) per core:    1
        node002: Core(s) per socket:    12
        node002: Socket(s):             2
        node001: Thread(s) per core:    1
        node001: Core(s) per socket:    12
        node001: Socket(s):             2
        node003: Thread(s) per core:    2
        node003: Core(s) per socket:    12
        node003: Socket(s):             2

        scontrol show nodes| grep -i mem
           RealMemory=196489092 AllocMem=0 FreeMem=100054 Sockets=2
        Boards=1
           CfgTRES=cpu=24,mem=196489092M,billing=24
           Reason=Low RealMemory [slurm@2019-07-18T10:17:24]
           RealMemory=196489092 AllocMem=0 FreeMem=181101 Sockets=2
        Boards=1
           CfgTRES=cpu=24,mem=196489092M,billing=24
           Reason=Low RealMemory [slurm@2019-07-18T10:17:24]
           RealMemory=196489092 AllocMem=0 FreeMem=179004 Sockets=2
        Boards=1
           CfgTRES=cpu=24,mem=196489092M,billing=24
           Reason=Low RealMemory

        Does anything look off?

Re: [slurm-users] sbatch tasks stuck in queue when a job is hung

Reply via email to