Re: [slurm-users] Users can't scancel

mercan Wed, 18 Nov 2020 11:22:25 -0800

These log lines about the prolog script looks very suspicious to me:

[2020-11-18T10:19:35.388] debug: [job 110] attempting to run prolog[/cm/local/apps/cmd/scripts/prolog]


then

[2020-11-18T10:21:10.121] debug:  Waiting for job 110's prolog to complete

[2020-11-18T10:21:10.121] debug: Finished wait for job 110's prolog tocomplete

After Two minutes past, logs still complaining about waiting prologscript. Would you check prolog.


Ahmet M.



18.11.2020 21:33 tarihinde William Markuske yazdı:

The epilog script does have exit 0 set at the end. Epilogs exitcleanly when run.
With log set to debug5 I get the following results for any scancel call.

Submit host slurmctld.log
[2020-11-18T10:19:34.944] _slurm_rpc_submit_batch_job: JobId=110InitPrio=110503 usec=191
[2020-11-18T10:19:35.231] debug:  sched: Running job scheduler
[2020-11-18T10:19:35.231] sched: Allocate JobId=110NodeList=martin-cn-01 #CPUs=1 Partition=martin-compute[2020-11-18T10:19:35.387] prolog_running_decr: Configuration forJobId=110 is complete[2020-11-18T10:19:39.769] _slurm_rpc_kill_job: REQUEST_KILL_JOBJobId=110 uid 0[2020-11-18T10:19:39.770] job_signal: 9 of running JobId=110successful 0x8004[2020-11-18T10:20:59.282] Resending TERMINATE_JOB request JobId=110Nodelist=martin-cn-01[2020-11-18T10:21:10.000] update_node: node martin-cn-01 reason setto: Kill task failed[2020-11-18T10:21:10.000] update_node: node martin-cn-01 state set toDRAINING[2020-11-18T10:21:10.001] error: slurmd error running JobId=110 onnode(s)=martin-cn-01: Kill task failed[2020-11-18T10:21:10.124] cleanup_completing: JobId=110 completionprocess took 91 seconds
Compute node slurmd.log
[2020-11-18T10:19:35.388] task_p_slurmd_batch_request: 110
[2020-11-18T10:19:35.388] task/affinity: job 110 CPU input mask fornode: 0x00000000000000000000000000000003[2020-11-18T10:19:35.388] task/affinity: job 110 CPU final HW maskfor node: 0x00000000000000010000000000000001[2020-11-18T10:19:35.388] debug: [job 110] attempting to run prolog[/cm/local/apps/cmd/scripts/prolog]
[2020-11-18T10:19:35.392] _run_prolog: run job script took usec=4085
[2020-11-18T10:19:35.392] _run_prolog: prolog with lock for job 110ran for 0 seconds
[2020-11-18T10:19:35.393] Launching batch job 110 for UID 0
[2020-11-18T10:19:35.398] debug:  AcctGatherEnergy NONE plugin loaded
[2020-11-18T10:19:35.398] debug:  AcctGatherProfile NONE plugin loaded
[2020-11-18T10:19:35.398] debug: AcctGatherInterconnect NONE pluginloaded
[2020-11-18T10:19:35.398] debug:  AcctGatherFilesystem NONE plugin loaded
[2020-11-18T10:19:35.398] debug:  switch NONE plugin loaded
[2020-11-18T10:19:35.398] debug:  init: Gres GPU plugin loaded
[2020-11-18T10:19:35.400] [110.batch] Considering each NUMA node as asocket[2020-11-18T10:19:35.400] [110.batch] debug: CPUs:128 Boards:1Sockets:8 CoresPerSocket:8 ThreadsPerCore:2[2020-11-18T10:19:35.400] [110.batch] debug: Job accounting gathercgroup plugin loaded[2020-11-18T10:19:35.401] [110.batch] debug: cont_id hasn't been setyet not running poll[2020-11-18T10:19:35.401] [110.batch] debug: laying out the 1 taskson 1 hosts martin-cn-01 dist 2[2020-11-18T10:19:35.401] [110.batch] debug: Message thread startedpid = 267029[2020-11-18T10:19:35.401] [110.batch] task affinity plugin loadedwith CPU mask00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffffffffffffffffffffffffffffff[2020-11-18T10:19:35.402] [110.batch] debug: task/cgroup: nowconstraining jobs allocated cores[2020-11-18T10:19:35.402] [110.batch] debug: task/cgroup/memory:total:515440M allowed:100%(enforced), swap:0%(enforced),max:100%(515440M) max+swap:100%(1030880M) min:30M kmem:100%(515440Menforced) min:30M swappiness:100(set)[2020-11-18T10:19:35.402] [110.batch] debug: task/cgroup: nowconstraining jobs allocated memory[2020-11-18T10:19:35.402] [110.batch] debug: task/cgroup: unable toopen/cm/shared/apps/slurm/var/etc/flamingo/cgroup_allowed_devices_file.conf:No such file or directory[2020-11-18T10:19:35.402] [110.batch] debug: task/cgroup: nowconstraining jobs allocated devices
[2020-11-18T10:19:35.402] [110.batch] debug:  task/cgroup: loaded
[2020-11-18T10:19:35.402] [110.batch] debug: Checkpoint pluginloaded: checkpoint/none[2020-11-18T10:19:35.402] [110.batch] Munge credential signatureplugin loaded[2020-11-18T10:19:35.402] [110.batch] debug: job_container noneplugin loaded[2020-11-18T10:19:35.402] [110.batch] debug: xcgroup_instantiate:cgroup '/sys/fs/cgroup/freezer/slurm' already exists[2020-11-18T10:19:35.402] [110.batch] debug: xcgroup_instantiate:cgroup '/sys/fs/cgroup/freezer/slurm/uid_0' already exists[2020-11-18T10:19:35.403] [110.batch] debug: spank: opening pluginstack /etc/slurm/plugstack.conf[2020-11-18T10:19:35.403] [110.batch] debug: xcgroup_instantiate:cgroup '/sys/fs/cgroup/cpuset/slurm' already exists[2020-11-18T10:19:35.403] [110.batch] debug: task/cgroup: jobabstract cores are '0'[2020-11-18T10:19:35.403] [110.batch] debug: task/cgroup: stepabstract cores are '0'[2020-11-18T10:19:35.403] [110.batch] debug: task/cgroup: jobphysical cores are '0,64'[2020-11-18T10:19:35.403] [110.batch] debug: task/cgroup: stepphysical cores are '0,64'[2020-11-18T10:19:35.403] [110.batch] debug: xcgroup_instantiate:cgroup '/sys/fs/cgroup/cpuset/slurm/uid_0' already exists[2020-11-18T10:19:35.406] [110.batch] debug: xcgroup_instantiate:cgroup '/sys/fs/cgroup/memory/slurm' already exists[2020-11-18T10:19:35.406] [110.batch] debug: xcgroup_instantiate:cgroup '/sys/fs/cgroup/memory/slurm/uid_0' already exists[2020-11-18T10:19:35.406] [110.batch] debug: xcgroup_instantiate:cgroup '/sys/fs/cgroup/memory/slurm/uid_0/job_110' already exists[2020-11-18T10:19:35.406] [110.batch] task/cgroup:/slurm/uid_0/job_110: alloc=4096MB mem.limit=4096MB memsw.limit=4096MB[2020-11-18T10:19:35.406] [110.batch] debug: xcgroup_instantiate:cgroup '/sys/fs/cgroup/memory/slurm/uid_0/job_110/step_batch' alreadyexists[2020-11-18T10:19:35.406] [110.batch] task/cgroup:/slurm/uid_0/job_110/step_batch: alloc=4096MB mem.limit=4096MBmemsw.limit=4096MB
[2020-11-18T10:19:35.406] [110.batch] debug: _oom_event_monitor: started.
[2020-11-18T10:19:35.406] [110.batch] debug: xcgroup_instantiate:cgroup '/sys/fs/cgroup/devices/slurm' already exists[2020-11-18T10:19:35.406] [110.batch] debug: xcgroup_instantiate:cgroup '/sys/fs/cgroup/devices/slurm/uid_0' already exists
[2020-11-18T10:19:35.438] [110.batch] debug level = 2
[2020-11-18T10:19:35.438] [110.batch] starting 1 tasks
[2020-11-18T10:19:35.438] [110.batch] task 0 (267034) started2020-11-18T10:19:35[2020-11-18T10:19:35.438] [110.batch] debug: Setting slurmstepdoom_adj to -1000[2020-11-18T10:19:35.438] [110.batch] debug:jobacct_gather_cgroup_cpuacct_attach_task: jobid 110 stepid4294967294 taskid 0 max_task_id 0[2020-11-18T10:19:35.439] [110.batch] debug: xcgroup_instantiate:cgroup '/sys/fs/cgroup/cpuacct/slurm' already exists[2020-11-18T10:19:35.439] [110.batch] debug: xcgroup_instantiate:cgroup '/sys/fs/cgroup/cpuacct/slurm/uid_0' already exists[2020-11-18T10:19:35.439] [110.batch] debug: xcgroup_instantiate:cgroup '/sys/fs/cgroup/cpuacct/slurm/uid_0/job_110' already exists[2020-11-18T10:19:35.439] [110.batch] debug: xcgroup_instantiate:cgroup '/sys/fs/cgroup/cpuacct/slurm/uid_0/job_110/step_batch'already exists[2020-11-18T10:19:35.439] [110.batch] debug:jobacct_gather_cgroup_memory_attach_task: jobid 110 stepid 4294967294taskid 0 max_task_id 0[2020-11-18T10:19:35.439] [110.batch] debug: xcgroup_instantiate:cgroup '/sys/fs/cgroup/memory/slurm' already exists[2020-11-18T10:19:35.439] [110.batch] debug: xcgroup_instantiate:cgroup '/sys/fs/cgroup/memory/slurm/uid_0' already exists[2020-11-18T10:19:35.439] [110.batch] debug: xcgroup_instantiate:cgroup '/sys/fs/cgroup/memory/slurm/uid_0/job_110' already exists[2020-11-18T10:19:35.439] [110.batch] debug: xcgroup_instantiate:cgroup '/sys/fs/cgroup/memory/slurm/uid_0/job_110/step_batch' alreadyexists[2020-11-18T10:19:35.439] [110.batch] debug: xcgroup_instantiate:cgroup '/sys/fs/cgroup/memory/slurm/uid_0/job_110/step_batch/task_0'already exists[2020-11-18T10:19:35.440] [110.batch] debug: task_p_pre_launch:affinity jobid 110.4294967294, task:0 bind:256[2020-11-18T10:19:35.440] [110.batch] task_p_pre_launch: Usingsched_affinity for tasks
[2020-11-18T10:19:39.770] debug:  _rpc_terminate_job, uid = 450
[2020-11-18T10:19:39.770] debug: task_p_slurmd_release_resources:affinity jobid 110
[2020-11-18T10:19:39.770] debug:  credential for job 110 revoked
[2020-11-18T10:19:39.771] [110.batch] debug: HandlingREQUEST_SIGNAL_CONTAINER[2020-11-18T10:19:39.771] [110.batch] debug: _handle_signal_containerfor step=110.4294967294 uid=450 signal=18
[2020-11-18T10:19:39.771] [110.batch] Sent signal 18 to 110.4294967294
[2020-11-18T10:19:39.771] [110.batch] debug: HandlingREQUEST_SIGNAL_CONTAINER[2020-11-18T10:19:39.771] [110.batch] debug: _handle_signal_containerfor step=110.4294967294 uid=450 signal=15[2020-11-18T10:19:39.771] [110.batch] error: *** JOB 110 ONmartin-cn-01 CANCELLED AT 2020-11-18T10:19:39 ***
[2020-11-18T10:19:39.771] [110.batch] Sent signal 15 to 110.4294967294
[2020-11-18T10:19:39.771] [110.batch] debug:  Handling REQUEST_STATE
[2020-11-18T10:19:39.792] [110.batch] debug:  Handling REQUEST_STATE
[2020-11-18T10:19:39.842] [110.batch] debug:  Handling REQUEST_STATE
[2020-11-18T10:19:39.942] [110.batch] debug:  Handling REQUEST_STATE
[2020-11-18T10:19:40.442] [110.batch] debug:  Handling REQUEST_STATE
[2020-11-18T10:19:41.442] [110.batch] debug:  Handling REQUEST_STATE
[2020-11-18T10:19:42.443] [110.batch] debug:  Handling REQUEST_STATE
[2020-11-18T10:19:43.443] [110.batch] debug:  Handling REQUEST_STATE
[2020-11-18T10:19:44.443] [110.batch] debug:  Handling REQUEST_STATE
[2020-11-18T10:19:45.443] [110.batch] debug:  Handling REQUEST_STATE
[2020-11-18T10:19:46.443] [110.batch] debug:  Handling REQUEST_STATE
[2020-11-18T10:19:47.443] [110.batch] debug:  Handling REQUEST_STATE
[2020-11-18T10:19:48.444] [110.batch] debug:  Handling REQUEST_STATE
[2020-11-18T10:19:49.444] [110.batch] debug:  Handling REQUEST_STATE
[2020-11-18T10:19:50.444] [110.batch] debug:  Handling REQUEST_STATE
[2020-11-18T10:20:00.444] [110.batch] debug:  Handling REQUEST_STATE
[2020-11-18T10:20:09.445] [110.batch] debug: HandlingREQUEST_STEP_TERMINATE[2020-11-18T10:20:09.445] [110.batch] debug: _handle_terminate forstep=110.4294967294 uid=0[2020-11-18T10:20:09.445] [110.batch] Sent SIGKILL signal to110.4294967294
[2020-11-18T10:20:09.445] [110.batch] debug:  Handling REQUEST_STATE
[2020-11-18T10:20:09.465] [110.batch] debug:  Handling REQUEST_STATE
[2020-11-18T10:20:09.515] [110.batch] debug:  Handling REQUEST_STATE
[2020-11-18T10:20:09.616] [110.batch] debug:  Handling REQUEST_STATE
[2020-11-18T10:20:10.116] [110.batch] debug:  Handling REQUEST_STATE
[2020-11-18T10:20:11.116] [110.batch] debug:  Handling REQUEST_STATE
[2020-11-18T10:20:11.116] [110.batch] debug: HandlingREQUEST_STEP_TERMINATE[2020-11-18T10:20:11.116] [110.batch] debug: _handle_terminate forstep=110.4294967294 uid=0[2020-11-18T10:20:11.116] [110.batch] Sent SIGKILL signal to110.4294967294
[2020-11-18T10:20:12.116] [110.batch] debug:  Handling REQUEST_STATE
[2020-11-18T10:20:12.116] [110.batch] debug: HandlingREQUEST_STEP_TERMINATE[2020-11-18T10:20:12.116] [110.batch] debug: _handle_terminate forstep=110.4294967294 uid=0
SIGKILL signals continue until the unkillable timelimit is met andthen slurm closes the job leaving the processes running with output:
[2020-11-18T10:20:59.282] debug: _rpc_terminate_job, uid = 450
[2020-11-18T10:20:59.282] debug: task_p_slurmd_release_resources:affinity jobid 110
[2020-11-18T10:21:00.121] [110.batch] debug:  Handling REQUEST_STATE
[2020-11-18T10:21:00.121] [110.batch] debug: HandlingREQUEST_STEP_TERMINATE[2020-11-18T10:21:00.121] [110.batch] debug: _handle_terminate forstep=110.4294967294 uid=0[2020-11-18T10:21:00.121] [110.batch] Sent SIGKILL signal to110.4294967294[2020-11-18T10:21:10.000] [110.batch] error: *** JOB 110 STEPDTERMINATED ON martin-cn-01 AT 2020-11-18T10:21:09 DUE TO JOB NOTENDING WITH SIGNALS ***[2020-11-18T10:21:10.000] [110.batch] debug: get_exit_code task 0killed by cmd[2020-11-18T10:21:10.000] [110.batch] job 110 completed with slurm_rc= 4001, job_rc = -1[2020-11-18T10:21:10.000] [110.batch] sendingREQUEST_COMPLETE_BATCH_SCRIPT, error:4001 status:-1
[2020-11-18T10:21:10.001] [110.batch] debug:  Message thread exited
[2020-11-18T10:21:10.001] [110.batch] done with job
[2020-11-18T10:21:10.121] debug: Waiting for job 110's prolog tocomplete[2020-11-18T10:21:10.121] debug: Finished wait for job 110's prologto complete[2020-11-18T10:21:10.121] debug: [job 110] attempting to run epilog[/cm/local/apps/cmd/scripts/epilog]
[2020-11-18T10:21:10.124] debug:  completed epilog for jobid 110
[2020-11-18T10:21:10.124] debug: Job 110: sent epilog complete msg:rc = 0
Willy Markuske

HPC Systems Engineer

        

Research Data Services

P: (858) 246-5593


On 11/18/20 9:41 AM, mercan wrote:
Hi;
Check epilog return value which comes from the return value of thelast line of epilog script. Also, you can add a "exit 0" line at thelast line of the epilog script to ensure to get a zero return valuefor testing purpose.
Ahmet M.


18.11.2020 20:00 tarihinde William Markuske yazdı:
Hello,
I am having an odd problem where users are unable to kill their jobswith scancel. Users can submit jobs just fine and when the taskcompletes it is able to close correctly. However, if a user attemptsto cancel a job via scancel the SIGKILL signals are sent to the stepbut don't complete. Slurmd then continues to send SIGKILL requestsuntil the UnkillableTimeout is hit, the slurm job is exits with anerror, the node enters a draining state, and the spawn processescontinue to run on the node.
I'm at a loss because jobs can complete without issue which seems tosuggest it's not a networking or permissions issue for the slurm todo job accounting tasks. A user can ssh to the node once a job issubmitted and kill the subprocesses manually at which point slurmcompletes the epilog and the node returns to idle.
Does anyone know what may be causing such behavior? Please let meknow any slurm.conf or cgroup.conf settings that would be helpful todiagnose this issue. I'm quite stumped by this one.
--

Willy Markuske

HPC Systems Engineer



Research Data Services

P: (858) 246-5593

Re: [slurm-users] Users can't scancel

Reply via email to