We have a user that keeps encountering this error with one type of her jobs. Sometimes her jobs will cancel and other times it will run fine.
slurmstepd: error: _is_a_lwp: open() /proc/195420/status failed: No such file or directory slurmstepd: error: *** JOB 17534 ON pe2dc5-0007 CANCELLED AT 2020-01-23T14:11:36 *** [root@pe2dc5-0007 ~]# grep 17534 /var/log/slurmd.log [2020-01-23T14:10:12.789] task_p_slurmd_batch_request: 17534 [2020-01-23T14:10:12.789] task/affinity: job 17534 CPU input mask for node: 0x03000000000000 [2020-01-23T14:10:12.789] task/affinity: job 17534 CPU final HW mask for node: 0x02000000200000 [2020-01-23T14:10:12.790] _run_prolog: prolog with lock for job 17534 ran for 0 seconds [2020-01-23T14:10:12.875] Launching batch job 17534 for UID 50321 [2020-01-23T14:10:16.937] [17534.batch] task_p_pre_launch: Using sched_affinity for tasks [2020-01-23T14:10:42.895] [17534.batch] error: _is_a_lwp: open() /proc/195420/status failed: No such file or directory [2020-01-23T14:11:36.386] [17534.batch] error: *** JOB 17534 ON pe2dc5-0007 CANCELLED AT 2020-01-23T14:11:36 *** [2020-01-23T14:11:37.394] [17534.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:15 [2020-01-23T14:11:37.396] [17534.batch] done with job I'm also seeing lots of spam in the slurmd.logs on the compute nodes themselves whenever this users jobs lands on them. [2020-02-04T15:29:11.073] [43816.batch] error: _is_a_lwp: 1 read() attempts on /proc/234796/status failed: No such process [2020-02-04T15:37:24.238] [43682.batch] error: _is_a_lwp: open() /proc/74338/status failed: No such file or directory [2020-02-04T15:40:42.064] [43916.batch] error: _is_a_lwp: open() /proc/87034/status failed: No such file or directory [2020-02-04T15:41:11.304] [43840.batch] error: _is_a_lwp: open() /proc/151191/status failed: No such file or directory Has anyone seen this issue before? Regards, Luis Huang | Systems Administrator II, Research Computing New York Genome Center 101 Avenue of the Americas New York, NY 10013 O: (646) 977-7291 lhu...@nygenome.org ________________________________ This message is for the recipient’s use only, and may contain confidential, privileged or protected information. Any unauthorized use or dissemination of this communication is prohibited. If you received this message in error, please immediately notify the sender and destroy all copies of this message. The recipient should check this email and any attachments for the presence of viruses, as we accept no liability for any damage caused by any virus transmitted by this email.