Hi,

I'm performing diagnostics on an application that isn't terminating correctly.  
While reviewing slurmd logs I found a couple of lines I need help decoding 
(logs are normal):

Line 45: [2020-10-23T14:30:22.610] [2547451.batch] Sent signal 18 to 
2547451.4294967294
Line 46: [2020-10-23T14:30:22.654] [2547451.extern] Sent signal 18 to 
2547451.4294967295

What are step ids 4294967294 (2^32-2) and 4294967295 (2^32-1)?  Do they belong 
to batch and extern steps respectively?  I see the same ids in the logs of 
other jobs.

I've attached the full slurmd log of job 2547451.  The line numbers referenced 
above are from this file.

Thanks,

Sebastian

--

[University of Nevada, Reno]<http://www.unr.edu/>
Sebastian Smith
High-Performance Computing Engineer
Office of Information Technology
1664 North Virginia Street
MS 0291

work-phone: 775-682-5050<tel:7756825050>
email: stsm...@unr.edu<mailto:stsm...@unr.edu>
website: http://rc.unr.edu<http://rc.unr.edu/>

[2020-10-23T14:17:35.519] task_p_slurmd_batch_request: 2547451
[2020-10-23T14:17:35.519] task/affinity: job 2547451 CPU input mask for node: 
0x0000000000000003
[2020-10-23T14:17:35.519] task/affinity: job 2547451 CPU final HW mask for 
node: 0x0000000100000001
[2020-10-23T14:17:35.525] _run_prolog: run job script took usec=6094
[2020-10-23T14:17:35.525] _run_prolog: prolog with lock for job 2547451 ran for 
0 seconds
[2020-10-23T14:17:35.565] [2547451.extern] task affinity plugin loaded with CPU 
mask 
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffffffffffffff
[2020-10-23T14:17:35.569] [2547451.extern] Munge cryptographic signature plugin 
loaded
[2020-10-23T14:17:35.594] [2547451.extern] task/cgroup: 
/slurm/uid_3176402/job_2547451: alloc=65536MB mem.limit=65536MB 
memsw.limit=65536MB
[2020-10-23T14:17:35.594] [2547451.extern] task/cgroup: 
/slurm/uid_3176402/job_2547451/step_extern: alloc=65536MB mem.limit=65536MB 
memsw.limit=65536MB
[2020-10-23T14:17:35.595] Launching batch job 2547451 for UID 3176402
[2020-10-23T14:17:35.611] [2547451.batch] task affinity plugin loaded with CPU 
mask 
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffffffffffffff
[2020-10-23T14:17:35.612] [2547451.batch] Munge cryptographic signature plugin 
loaded
[2020-10-23T14:17:35.621] [2547451.batch] task/cgroup: 
/slurm/uid_3176402/job_2547451: alloc=65536MB mem.limit=65536MB 
memsw.limit=65536MB
[2020-10-23T14:17:35.621] [2547451.batch] task/cgroup: 
/slurm/uid_3176402/job_2547451/step_batch: alloc=65536MB mem.limit=65536MB 
memsw.limit=65536MB
[2020-10-23T14:17:35.776] [2547451.batch] debug level = 2
[2020-10-23T14:17:35.776] [2547451.batch] starting 1 tasks
[2020-10-23T14:17:35.777] [2547451.batch] task 0 (19889) started 
2020-10-23T14:17:35
[2020-10-23T14:17:35.778] [2547451.batch] task_p_pre_launch: Using 
sched_affinity for tasks
[2020-10-23T14:17:35.822] sbcast req_uid=3176402 job_id=2547451 
fname=/data/gpfs/home/shubhamp/resDB/ifconfig.txt block_no=1
[2020-10-23T14:17:40.925] launch task 2547451.0 request from UID:3176402 
GID:3176402 HOST:172.19.4.4 PORT:18832
[2020-10-23T14:17:40.925] lllp_distribution jobid [2547451] implicit auto 
binding: cores, dist 1
[2020-10-23T14:17:40.925] _task_layout_lllp_cyclic 
[2020-10-23T14:17:40.925] _lllp_generate_cpu_bind jobid [2547451]: mask_cpu, 
0x0000000100000001
[2020-10-23T14:17:40.941] [2547451.0] task affinity plugin loaded with CPU mask 
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffffffffffffff
[2020-10-23T14:17:40.942] [2547451.0] Munge cryptographic signature plugin 
loaded
[2020-10-23T14:17:40.951] [2547451.0] task/cgroup: 
/slurm/uid_3176402/job_2547451: alloc=65536MB mem.limit=65536MB 
memsw.limit=65536MB
[2020-10-23T14:17:40.952] [2547451.0] task/cgroup: 
/slurm/uid_3176402/job_2547451/step_0: alloc=65536MB mem.limit=65536MB 
memsw.limit=65536MB
[2020-10-23T14:17:40.965] [2547451.0] debug level = 2
[2020-10-23T14:17:40.965] [2547451.0] starting 1 tasks
[2020-10-23T14:17:40.965] [2547451.0] task 0 (19934) started 2020-10-23T14:17:40
[2020-10-23T14:17:40.967] [2547451.0] task_p_pre_launch: Using sched_affinity 
for tasks
[2020-10-23T14:17:42.969] [2547451.0] task 0 (19934) exited with exit code 0.
[2020-10-23T14:17:42.980] [2547451.0] done with job
[2020-10-23T14:17:42.998] launch task 2547451.1 request from UID:3176402 
GID:3176402 HOST:172.19.4.4 PORT:22416
[2020-10-23T14:17:42.998] lllp_distribution jobid [2547451] auto binding off: 
mask_cpu
[2020-10-23T14:17:43.016] [2547451.1] task affinity plugin loaded with CPU mask 
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffffffffffffff
[2020-10-23T14:17:43.018] [2547451.1] Munge cryptographic signature plugin 
loaded
[2020-10-23T14:17:43.025] [2547451.1] task/cgroup: 
/slurm/uid_3176402/job_2547451: alloc=65536MB mem.limit=65536MB 
memsw.limit=65536MB
[2020-10-23T14:17:43.026] [2547451.1] task/cgroup: 
/slurm/uid_3176402/job_2547451/step_1: alloc=65536MB mem.limit=65536MB 
memsw.limit=65536MB
[2020-10-23T14:17:43.038] [2547451.1] debug level = 2
[2020-10-23T14:17:43.038] [2547451.1] starting 1 tasks
[2020-10-23T14:17:43.038] [2547451.1] task 0 (19956) started 2020-10-23T14:17:43
[2020-10-23T14:17:43.039] [2547451.1] task_p_pre_launch: Using sched_affinity 
for tasks
[2020-10-23T14:30:22.573] [2547451.1] Sent signal 18 to 2547451.1
[2020-10-23T14:30:22.610] [2547451.batch] Sent signal 18 to 2547451.4294967294
[2020-10-23T14:30:22.654] [2547451.extern] Sent signal 18 to 2547451.4294967295
[2020-10-23T14:30:22.694] [2547451.1] error: *** STEP 2547451.1 ON cpu-3 
CANCELLED AT 2020-10-23T14:30:22 ***
[2020-10-23T14:30:22.694] [2547451.1] Sent signal 15 to 2547451.1
[2020-10-23T14:30:22.700] [2547451.batch] error: *** JOB 2547451 ON cpu-3 
CANCELLED AT 2020-10-23T14:30:22 ***
[2020-10-23T14:30:22.700] [2547451.batch] Sent signal 15 to 2547451.4294967294
[2020-10-23T14:30:22.701] [2547451.batch] task 0 (19889) exited. Killed by 
signal 15.
[2020-10-23T14:30:22.703] [2547451.extern] Sent signal 15 to 2547451.4294967295
[2020-10-23T14:30:23.787] [2547451.extern] _oom_event_monitor: oom-kill event 
count: 1
[2020-10-23T14:30:23.791] [2547451.extern] done with job
[2020-10-23T14:30:23.800] [2547451.batch] job 2547451 completed with slurm_rc = 
0, job_rc = 15
[2020-10-23T14:30:23.800] [2547451.batch] sending 
REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 15
[2020-10-23T14:30:23.802] [2547451.batch] done with job
[2020-10-23T14:30:52.384] [2547451.1] Sent SIGKILL signal to 2547451.1
[2020-10-23T14:30:52.392] [2547451.1] task 0 (19956) exited. Killed by signal 9.
[2020-10-23T14:30:54.058] [2547451.1] Sent SIGKILL signal to 2547451.1
[2020-10-23T14:30:55.059] [2547451.1] Sent SIGKILL signal to 2547451.1
[2020-10-23T14:30:56.060] [2547451.1] Sent SIGKILL signal to 2547451.1
[2020-10-23T14:30:57.061] [2547451.1] Sent SIGKILL signal to 2547451.1
[2020-10-23T14:30:58.061] [2547451.1] Sent SIGKILL signal to 2547451.1
[2020-10-23T14:30:59.062] [2547451.1] Sent SIGKILL signal to 2547451.1
[2020-10-23T14:31:00.063] [2547451.1] Sent SIGKILL signal to 2547451.1
[2020-10-23T14:31:01.064] [2547451.1] Sent SIGKILL signal to 2547451.1
[2020-10-23T14:31:02.065] [2547451.1] Sent SIGKILL signal to 2547451.1
[2020-10-23T14:31:03.066] [2547451.1] Sent SIGKILL signal to 2547451.1
[2020-10-23T14:31:13.067] [2547451.1] Sent SIGKILL signal to 2547451.1
[2020-10-23T14:31:23.068] [2547451.1] Sent SIGKILL signal to 2547451.1
[2020-10-23T14:31:33.069] [2547451.1] Sent SIGKILL signal to 2547451.1
[2020-10-23T14:31:43.070] [2547451.1] Sent SIGKILL signal to 2547451.1
[2020-10-23T14:31:53.000] [2547451.1] error: *** STEP 2547451.1 STEPD 
TERMINATED ON cpu-3 AT 2020-10-23T14:31:52 DUE TO JOB NOT ENDING WITH SIGNALS 
***
[2020-10-23T14:31:58.000] [2547451.1] done with job

Reply via email to