These log lines about the prolog script looks very suspicious to me:

[2020-11-18T10:19:35.388] debug:  [job 110] attempting to run prolog [/cm/local/apps/cmd/scripts/prolog]

then

[2020-11-18T10:21:10.121] debug:  Waiting for job 110's prolog to complete
[2020-11-18T10:21:10.121] debug:  Finished wait for job 110's prolog to complete

After Two minutes past, logs still complaining about waiting prolog script. Would you check prolog.

Ahmet M.



18.11.2020 21:33 tarihinde William Markuske yazdı:

The epilog script does have exit 0 set at the end. Epilogs exit cleanly when run.

With log set to debug5 I get the following results for any scancel call.

Submit host slurmctld.log

[2020-11-18T10:19:34.944] _slurm_rpc_submit_batch_job: JobId=110 InitPrio=110503 usec=191
[2020-11-18T10:19:35.231] debug:  sched: Running job scheduler
[2020-11-18T10:19:35.231] sched: Allocate JobId=110 NodeList=martin-cn-01 #CPUs=1 Partition=martin-compute [2020-11-18T10:19:35.387] prolog_running_decr: Configuration for JobId=110 is complete [2020-11-18T10:19:39.769] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=110 uid 0 [2020-11-18T10:19:39.770] job_signal: 9 of running JobId=110 successful 0x8004 [2020-11-18T10:20:59.282] Resending TERMINATE_JOB request JobId=110 Nodelist=martin-cn-01 [2020-11-18T10:21:10.000] update_node: node martin-cn-01 reason set to: Kill task failed [2020-11-18T10:21:10.000] update_node: node martin-cn-01 state set to DRAINING [2020-11-18T10:21:10.001] error: slurmd error running JobId=110 on node(s)=martin-cn-01: Kill task failed [2020-11-18T10:21:10.124] cleanup_completing: JobId=110 completion process took 91 seconds
Compute node slurmd.log

[2020-11-18T10:19:35.388] task_p_slurmd_batch_request: 110
[2020-11-18T10:19:35.388] task/affinity: job 110 CPU input mask for node: 0x00000000000000000000000000000003 [2020-11-18T10:19:35.388] task/affinity: job 110 CPU final HW mask for node: 0x00000000000000010000000000000001 [2020-11-18T10:19:35.388] debug:  [job 110] attempting to run prolog [/cm/local/apps/cmd/scripts/prolog]
[2020-11-18T10:19:35.392] _run_prolog: run job script took usec=4085
[2020-11-18T10:19:35.392] _run_prolog: prolog with lock for job 110 ran for 0 seconds
[2020-11-18T10:19:35.393] Launching batch job 110 for UID 0
[2020-11-18T10:19:35.398] debug:  AcctGatherEnergy NONE plugin loaded
[2020-11-18T10:19:35.398] debug:  AcctGatherProfile NONE plugin loaded
[2020-11-18T10:19:35.398] debug:  AcctGatherInterconnect NONE plugin loaded
[2020-11-18T10:19:35.398] debug:  AcctGatherFilesystem NONE plugin loaded
[2020-11-18T10:19:35.398] debug:  switch NONE plugin loaded
[2020-11-18T10:19:35.398] debug:  init: Gres GPU plugin loaded
[2020-11-18T10:19:35.400] [110.batch] Considering each NUMA node as a socket [2020-11-18T10:19:35.400] [110.batch] debug:  CPUs:128 Boards:1 Sockets:8 CoresPerSocket:8 ThreadsPerCore:2 [2020-11-18T10:19:35.400] [110.batch] debug:  Job accounting gather cgroup plugin loaded [2020-11-18T10:19:35.401] [110.batch] debug:  cont_id hasn't been set yet not running poll [2020-11-18T10:19:35.401] [110.batch] debug:  laying out the 1 tasks on 1 hosts martin-cn-01 dist 2 [2020-11-18T10:19:35.401] [110.batch] debug:  Message thread started pid = 267029 [2020-11-18T10:19:35.401] [110.batch] task affinity plugin loaded with CPU mask 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffffffffffffffffffffffffffffff [2020-11-18T10:19:35.402] [110.batch] debug:  task/cgroup: now constraining jobs allocated cores [2020-11-18T10:19:35.402] [110.batch] debug: task/cgroup/memory: total:515440M allowed:100%(enforced), swap:0%(enforced), max:100%(515440M) max+swap:100%(1030880M) min:30M kmem:100%(515440M enforced) min:30M swappiness:100(set) [2020-11-18T10:19:35.402] [110.batch] debug:  task/cgroup: now constraining jobs allocated memory [2020-11-18T10:19:35.402] [110.batch] debug:  task/cgroup: unable to open /cm/shared/apps/slurm/var/etc/flamingo/cgroup_allowed_devices_file.conf: No such file or directory [2020-11-18T10:19:35.402] [110.batch] debug:  task/cgroup: now constraining jobs allocated devices
[2020-11-18T10:19:35.402] [110.batch] debug:  task/cgroup: loaded
[2020-11-18T10:19:35.402] [110.batch] debug:  Checkpoint plugin loaded: checkpoint/none [2020-11-18T10:19:35.402] [110.batch] Munge credential signature plugin loaded [2020-11-18T10:19:35.402] [110.batch] debug:  job_container none plugin loaded [2020-11-18T10:19:35.402] [110.batch] debug: xcgroup_instantiate: cgroup '/sys/fs/cgroup/freezer/slurm' already exists [2020-11-18T10:19:35.402] [110.batch] debug: xcgroup_instantiate: cgroup '/sys/fs/cgroup/freezer/slurm/uid_0' already exists [2020-11-18T10:19:35.403] [110.batch] debug:  spank: opening plugin stack /etc/slurm/plugstack.conf [2020-11-18T10:19:35.403] [110.batch] debug: xcgroup_instantiate: cgroup '/sys/fs/cgroup/cpuset/slurm' already exists [2020-11-18T10:19:35.403] [110.batch] debug:  task/cgroup: job abstract cores are '0' [2020-11-18T10:19:35.403] [110.batch] debug:  task/cgroup: step abstract cores are '0' [2020-11-18T10:19:35.403] [110.batch] debug:  task/cgroup: job physical cores are '0,64' [2020-11-18T10:19:35.403] [110.batch] debug:  task/cgroup: step physical cores are '0,64' [2020-11-18T10:19:35.403] [110.batch] debug: xcgroup_instantiate: cgroup '/sys/fs/cgroup/cpuset/slurm/uid_0' already exists [2020-11-18T10:19:35.406] [110.batch] debug: xcgroup_instantiate: cgroup '/sys/fs/cgroup/memory/slurm' already exists [2020-11-18T10:19:35.406] [110.batch] debug: xcgroup_instantiate: cgroup '/sys/fs/cgroup/memory/slurm/uid_0' already exists [2020-11-18T10:19:35.406] [110.batch] debug: xcgroup_instantiate: cgroup '/sys/fs/cgroup/memory/slurm/uid_0/job_110' already exists [2020-11-18T10:19:35.406] [110.batch] task/cgroup: /slurm/uid_0/job_110: alloc=4096MB mem.limit=4096MB memsw.limit=4096MB [2020-11-18T10:19:35.406] [110.batch] debug: xcgroup_instantiate: cgroup '/sys/fs/cgroup/memory/slurm/uid_0/job_110/step_batch' already exists [2020-11-18T10:19:35.406] [110.batch] task/cgroup: /slurm/uid_0/job_110/step_batch: alloc=4096MB mem.limit=4096MB memsw.limit=4096MB
[2020-11-18T10:19:35.406] [110.batch] debug: _oom_event_monitor: started.
[2020-11-18T10:19:35.406] [110.batch] debug: xcgroup_instantiate: cgroup '/sys/fs/cgroup/devices/slurm' already exists [2020-11-18T10:19:35.406] [110.batch] debug: xcgroup_instantiate: cgroup '/sys/fs/cgroup/devices/slurm/uid_0' already exists
[2020-11-18T10:19:35.438] [110.batch] debug level = 2
[2020-11-18T10:19:35.438] [110.batch] starting 1 tasks
[2020-11-18T10:19:35.438] [110.batch] task 0 (267034) started 2020-11-18T10:19:35 [2020-11-18T10:19:35.438] [110.batch] debug:  Setting slurmstepd oom_adj to -1000 [2020-11-18T10:19:35.438] [110.batch] debug: jobacct_gather_cgroup_cpuacct_attach_task: jobid 110 stepid 4294967294 taskid 0 max_task_id 0 [2020-11-18T10:19:35.439] [110.batch] debug: xcgroup_instantiate: cgroup '/sys/fs/cgroup/cpuacct/slurm' already exists [2020-11-18T10:19:35.439] [110.batch] debug: xcgroup_instantiate: cgroup '/sys/fs/cgroup/cpuacct/slurm/uid_0' already exists [2020-11-18T10:19:35.439] [110.batch] debug: xcgroup_instantiate: cgroup '/sys/fs/cgroup/cpuacct/slurm/uid_0/job_110' already exists [2020-11-18T10:19:35.439] [110.batch] debug: xcgroup_instantiate: cgroup '/sys/fs/cgroup/cpuacct/slurm/uid_0/job_110/step_batch' already exists [2020-11-18T10:19:35.439] [110.batch] debug: jobacct_gather_cgroup_memory_attach_task: jobid 110 stepid 4294967294 taskid 0 max_task_id 0 [2020-11-18T10:19:35.439] [110.batch] debug: xcgroup_instantiate: cgroup '/sys/fs/cgroup/memory/slurm' already exists [2020-11-18T10:19:35.439] [110.batch] debug: xcgroup_instantiate: cgroup '/sys/fs/cgroup/memory/slurm/uid_0' already exists [2020-11-18T10:19:35.439] [110.batch] debug: xcgroup_instantiate: cgroup '/sys/fs/cgroup/memory/slurm/uid_0/job_110' already exists [2020-11-18T10:19:35.439] [110.batch] debug: xcgroup_instantiate: cgroup '/sys/fs/cgroup/memory/slurm/uid_0/job_110/step_batch' already exists [2020-11-18T10:19:35.439] [110.batch] debug: xcgroup_instantiate: cgroup '/sys/fs/cgroup/memory/slurm/uid_0/job_110/step_batch/task_0' already exists [2020-11-18T10:19:35.440] [110.batch] debug:  task_p_pre_launch: affinity jobid 110.4294967294, task:0 bind:256 [2020-11-18T10:19:35.440] [110.batch] task_p_pre_launch: Using sched_affinity for tasks
[2020-11-18T10:19:39.770] debug:  _rpc_terminate_job, uid = 450
[2020-11-18T10:19:39.770] debug: task_p_slurmd_release_resources: affinity jobid 110
[2020-11-18T10:19:39.770] debug:  credential for job 110 revoked
[2020-11-18T10:19:39.771] [110.batch] debug:  Handling REQUEST_SIGNAL_CONTAINER [2020-11-18T10:19:39.771] [110.batch] debug: _handle_signal_container for step=110.4294967294 uid=450 signal=18
[2020-11-18T10:19:39.771] [110.batch] Sent signal 18 to 110.4294967294
[2020-11-18T10:19:39.771] [110.batch] debug:  Handling REQUEST_SIGNAL_CONTAINER [2020-11-18T10:19:39.771] [110.batch] debug: _handle_signal_container for step=110.4294967294 uid=450 signal=15 [2020-11-18T10:19:39.771] [110.batch] error: *** JOB 110 ON martin-cn-01 CANCELLED AT 2020-11-18T10:19:39 ***
[2020-11-18T10:19:39.771] [110.batch] Sent signal 15 to 110.4294967294
[2020-11-18T10:19:39.771] [110.batch] debug:  Handling REQUEST_STATE
[2020-11-18T10:19:39.792] [110.batch] debug:  Handling REQUEST_STATE
[2020-11-18T10:19:39.842] [110.batch] debug:  Handling REQUEST_STATE
[2020-11-18T10:19:39.942] [110.batch] debug:  Handling REQUEST_STATE
[2020-11-18T10:19:40.442] [110.batch] debug:  Handling REQUEST_STATE
[2020-11-18T10:19:41.442] [110.batch] debug:  Handling REQUEST_STATE
[2020-11-18T10:19:42.443] [110.batch] debug:  Handling REQUEST_STATE
[2020-11-18T10:19:43.443] [110.batch] debug:  Handling REQUEST_STATE
[2020-11-18T10:19:44.443] [110.batch] debug:  Handling REQUEST_STATE
[2020-11-18T10:19:45.443] [110.batch] debug:  Handling REQUEST_STATE
[2020-11-18T10:19:46.443] [110.batch] debug:  Handling REQUEST_STATE
[2020-11-18T10:19:47.443] [110.batch] debug:  Handling REQUEST_STATE
[2020-11-18T10:19:48.444] [110.batch] debug:  Handling REQUEST_STATE
[2020-11-18T10:19:49.444] [110.batch] debug:  Handling REQUEST_STATE
[2020-11-18T10:19:50.444] [110.batch] debug:  Handling REQUEST_STATE
[2020-11-18T10:20:00.444] [110.batch] debug:  Handling REQUEST_STATE
[2020-11-18T10:20:09.445] [110.batch] debug:  Handling REQUEST_STEP_TERMINATE [2020-11-18T10:20:09.445] [110.batch] debug:  _handle_terminate for step=110.4294967294 uid=0 [2020-11-18T10:20:09.445] [110.batch] Sent SIGKILL signal to 110.4294967294
[2020-11-18T10:20:09.445] [110.batch] debug:  Handling REQUEST_STATE
[2020-11-18T10:20:09.465] [110.batch] debug:  Handling REQUEST_STATE
[2020-11-18T10:20:09.515] [110.batch] debug:  Handling REQUEST_STATE
[2020-11-18T10:20:09.616] [110.batch] debug:  Handling REQUEST_STATE
[2020-11-18T10:20:10.116] [110.batch] debug:  Handling REQUEST_STATE
[2020-11-18T10:20:11.116] [110.batch] debug:  Handling REQUEST_STATE
[2020-11-18T10:20:11.116] [110.batch] debug:  Handling REQUEST_STEP_TERMINATE [2020-11-18T10:20:11.116] [110.batch] debug:  _handle_terminate for step=110.4294967294 uid=0 [2020-11-18T10:20:11.116] [110.batch] Sent SIGKILL signal to 110.4294967294
[2020-11-18T10:20:12.116] [110.batch] debug:  Handling REQUEST_STATE
[2020-11-18T10:20:12.116] [110.batch] debug:  Handling REQUEST_STEP_TERMINATE [2020-11-18T10:20:12.116] [110.batch] debug:  _handle_terminate for step=110.4294967294 uid=0
SIGKILL signals continue until the unkillable timelimit is met and then slurm closes the job leaving the processes running with output:

[2020-11-18T10:20:59.282] debug: _rpc_terminate_job, uid = 450
[2020-11-18T10:20:59.282] debug: task_p_slurmd_release_resources: affinity jobid 110
[2020-11-18T10:21:00.121] [110.batch] debug:  Handling REQUEST_STATE
[2020-11-18T10:21:00.121] [110.batch] debug:  Handling REQUEST_STEP_TERMINATE [2020-11-18T10:21:00.121] [110.batch] debug:  _handle_terminate for step=110.4294967294 uid=0 [2020-11-18T10:21:00.121] [110.batch] Sent SIGKILL signal to 110.4294967294 [2020-11-18T10:21:10.000] [110.batch] error: *** JOB 110 STEPD TERMINATED ON martin-cn-01 AT 2020-11-18T10:21:09 DUE TO JOB NOT ENDING WITH SIGNALS *** [2020-11-18T10:21:10.000] [110.batch] debug:  get_exit_code task 0 killed by cmd [2020-11-18T10:21:10.000] [110.batch] job 110 completed with slurm_rc = 4001, job_rc = -1 [2020-11-18T10:21:10.000] [110.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4001 status:-1
[2020-11-18T10:21:10.001] [110.batch] debug:  Message thread exited
[2020-11-18T10:21:10.001] [110.batch] done with job
[2020-11-18T10:21:10.121] debug:  Waiting for job 110's prolog to complete [2020-11-18T10:21:10.121] debug:  Finished wait for job 110's prolog to complete [2020-11-18T10:21:10.121] debug:  [job 110] attempting to run epilog [/cm/local/apps/cmd/scripts/epilog]
[2020-11-18T10:21:10.124] debug:  completed epilog for jobid 110
[2020-11-18T10:21:10.124] debug:  Job 110: sent epilog complete msg: rc = 0

Willy Markuske

HPC Systems Engineer

        

Research Data Services

P: (858) 246-5593


On 11/18/20 9:41 AM, mercan wrote:
Hi;

Check epilog return value which comes from the return value of the last line of epilog script. Also, you can add a "exit 0" line at the last line of the epilog script to ensure to get a zero return value for testing purpose.

Ahmet M.


18.11.2020 20:00 tarihinde William Markuske yazdı:

Hello,

I am having an odd problem where users are unable to kill their jobs with scancel. Users can submit jobs just fine and when the task completes it is able to close correctly. However, if a user attempts to cancel a job via scancel the SIGKILL signals are sent to the step but don't complete. Slurmd then continues to send SIGKILL requests until the UnkillableTimeout is hit, the slurm job is exits with an error, the node enters a draining state, and the spawn processes continue to run on the node.

I'm at a loss because jobs can complete without issue which seems to suggest it's not a networking or permissions issue for the slurm to do job accounting tasks. A user can ssh to the node once a job is submitted and kill the subprocesses manually at which point slurm completes the epilog and the node returns to idle.

Does anyone know what may be causing such behavior? Please let me know any slurm.conf or cgroup.conf settings that would be helpful to diagnose this issue. I'm quite stumped by this one.

--

Willy Markuske

HPC Systems Engineer



Research Data Services

P: (858) 246-5593


Reply via email to