Has anyone else encountered this problem where slumrctld crashes on an release of a job's hold

I am wondering if there is something unique to our configurations that is 
leading to this crash.

Here is what I have found so far:

There appears to be a bug in slurmctld placing a hold on a job and then 
releasing the hold causes the slurmctld
to core dump due to an Arithmetic exception:

Version of slurm:

    hpc-sched2# rpm -q --info slurm
    Name        : slurm
    Version     : 17.11.6
    Release     : 1usc.el7.centos
    Architecture: x86_64


To produce this error:

  $ sbatch --hold printenv.BATCH
  Submitted batch job 934654


Specs for the job shows:

  $ scontrol show job 934654
  JobId=934654 JobName=printenv.BATCH
     UserId=avalonjo(...) GroupId=... MCS_label=N/A
     Priority=0 Nice=0 Account=lc_hpcc QOS=lc_hpcc_maxcpumins
     JobState=PENDING Reason=JobHeldUser Dependency=(null)
     Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
     RunTime=00:00:00 TimeLimit=01:20:00 TimeMin=N/A
     SubmitTime=2018-06-07T17:21:21 EligibleTime=Unknown
     StartTime=Unknown EndTime=Unknown Deadline=N/A
     PreemptTime=None SuspendTime=None SecsPreSuspend=0
     LastSchedEval=2018-06-07T17:21:21
     Partition=main AllocNode:Sid=...:61228
     ReqNodeList=(null) ExcNodeList=(null)
     NodeList=(null)
     NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
     TRES=cpu=1,mem=1G,node=1
     Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
     MinCPUsNode=1 MinMemoryCPU=1G MinTmpDiskNode=0
     Features=(null) DelayBoot=00:00:00
     Gres=(null) Reservation=(null)
     OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
     Command=...../printenv.BATCH
     WorkDir=..../Infiniband
     StdErr=..../Infiniband/./OutputDir/%x.934654
     StdIn=/dev/null
     StdOut=.../Infiniband/./OutputDir/%x.934654
     Power=

Now release the job:

  $ scontrol release job 931432
  Invalid job id specified for job job
  slurm_suspend error: Invalid job id specified
  Unexpected message received for job 931432
  slurm_suspend error: Unexpected message received


At which point slurmctld core dumps:


Using gdb to analyze the core file:


  # gdb /usr/sbin/slurmctld ./core.28720
  GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-100.el7_4.1
  Copyright (C) 2013 Free Software Foundation, Inc.

  [Thread debugging using libthread_db enabled]
  Using host libthread_db library "/lib64/libthread_db.so.1".
  Core was generated by `/usr/sbin/slurmctld'.
  Program terminated with signal 8, Arithmetic exception.
#0 0x00000000004173ff in _validate_time_limit (time_limit_in=time_limit_in@entry=0x7f3338196568, part_max_time=part_max_time@entry=60, tres_req_cnt=0, max_limit=2000000000, out_max_limit=out_max_limit@entry=0x7f33380864e0, limit_set_time=limit_set_time@entry=0x7f3384ccb472, strict_checking=strict_checking@entry=true, is64=is64@entry=true) at acct_policy.c:1120 #1 0x00000000004174b9 in _validate_tres_time_limits (tres_pos=tres_pos@entry=0x7f3384ccad14, time_limit_in=time_limit_in@entry=0x7f3338196568, part_max_time=60, job_tres_array=0x7f3384ccb3a8, max_tres_array=0xeeead0, out_max_tres_array=0x7f33380864e0, limit_set_time=limit_set_time@entry=0x7f3384ccb472,
      strict_checking=strict_checking@entry=true) at acct_policy.c:1174
#2 0x0000000000418635 in _qos_policy_validate (job_desc=job_desc@entry=0x7f3338196370, assoc_ptr=assoc_ptr@entry=0x1287450, part_ptr=part_ptr@entry=0x1965410, qos_ptr=qos_ptr@entry=0xf06f80,
      qos_out_ptr=qos_out_ptr@entry=0x7f3384ccadf0, reason=reason@entry=0x0,
acct_policy_limit_set=acct_policy_limit_set@entry=0x7f3384ccb470, update_call=update_call@entry=true, user_name=user_name@entry=0x1287610 "avalonjo", job_cnt=job_cnt@entry=1, strict_checking=strict_checking@entry=true)
      at acct_policy.c:1522
#3 0x0000000000418da9 in _acct_policy_validate (job_desc=job_desc@entry=0x7f3338196370, part_ptr=part_ptr@entry=0x1965410, assoc_in=assoc_in@entry=0x1287450, qos_ptr_1=0xf065c0, qos_ptr_2=0xf06f80, reason=reason@entry=0x0, acct_policy_limit_set=acct_policy_limit_set@entry=0x7f3384ccb470,
      update_call=update_call@entry=true) at acct_policy.c:2660
#4 0x000000000041b3fa in acct_policy_validate (job_desc=job_desc@entry=0x7f3338196370, part_ptr=0x1965410,
      assoc_in=0x1287450, qos_ptr=0xf06f80, reason=reason@entry=0x0,
acct_policy_limit_set=acct_policy_limit_set@entry=0x7f3384ccb470, update_call=update_call@entry=true)
      at acct_policy.c:2976
#5 0x0000000000457333 in _update_job (job_ptr=job_ptr@entry=0x2428270, job_specs=job_specs@entry=0x7f3338196370,
      uid=uid@entry=203387) at job_mgr.c:11717
#6 0x000000000045ad71 in update_job_str (msg=msg@entry=0x7f3384ccbe50, uid=uid@entry=203387) at job_mgr.c:13447 #7 0x000000000048da6c in _slurm_rpc_update_job (msg=0x7f3384ccbe50) at proc_req.c:4366
  ---Type <return> to continue, or q <return> to quit---
#8 slurmctld_req (msg=msg@entry=0x7f3384ccbe50, arg=arg@entry=0x7f33b4029480) at proc_req.c:447 #9 0x0000000000424f28 in _service_connection (arg=0x7f33b4029480) at controller.c:1125
  #10 0x00007f33d26c8e25 in start_thread () from /lib64/libpthread.so.0
  #11 0x00007f33d23f634d in clone () from /lib64/libc.so.6


Which shows that it died in _validate_time_limit in file acct_policy.c

validate_time_limit has the following line:

                max_time_limit = (uint32_t)(max_limit / tres_req_cnt);


And using gdb to print tres_req_cnt we get:

    (gdb) p max_limit
  $8 = 2000000000
  (gdb) p tres_req_count
  No symbol "tres_req_count" in current context.
  (gdb) p tres_req_cnt
  $9 = 0

Which will result in the suspected divide by zero.


After tracing back  it appears that the original variable was 'msg' in:

     src/slurmctld/controller.c

As shown by gdb:

  (gdb) frame 8
#8 slurmctld_req (msg=msg@entry=0x7f3384ccbe50, arg=arg@entry=0x7f33b4029480) at proc_req.c:447


  (gdb) p ((job_desc_msg_t *) msg->data)->tres_req_cnt[0]
  $27 = 0


Had the value set to 0

Perhaps since no one else has encountered this it's somehow related to how we have slurm configured but non-the-less it probably shouldn't be dividing by zero.


Avalon Johnson

Systems Programmer
Information Technology Services
CAL 365-104B, University of Southern California
Los Angeles, California 90089-2812

e-mail: avalo...@usc.edu
        It takes a village ..."

Reply via email to