We have a user that keeps encountering this error with one type of her jobs. 
Sometimes her jobs will cancel and other times it will run fine.

slurmstepd: error: _is_a_lwp: open() /proc/195420/status failed: No such file 
or directory
slurmstepd: error: *** JOB 17534 ON pe2dc5-0007 CANCELLED AT 
2020-01-23T14:11:36 ***

[root@pe2dc5-0007 ~]# grep 17534  /var/log/slurmd.log
[2020-01-23T14:10:12.789] task_p_slurmd_batch_request: 17534
[2020-01-23T14:10:12.789] task/affinity: job 17534 CPU input mask for node: 
0x03000000000000
[2020-01-23T14:10:12.789] task/affinity: job 17534 CPU final HW mask for node: 
0x02000000200000
[2020-01-23T14:10:12.790] _run_prolog: prolog with lock for job 17534 ran for 0 
seconds
[2020-01-23T14:10:12.875] Launching batch job 17534 for UID 50321
[2020-01-23T14:10:16.937] [17534.batch] task_p_pre_launch: Using sched_affinity 
for tasks
[2020-01-23T14:10:42.895] [17534.batch] error: _is_a_lwp: open() 
/proc/195420/status failed: No such file or directory
[2020-01-23T14:11:36.386] [17534.batch] error: *** JOB 17534 ON pe2dc5-0007 
CANCELLED AT 2020-01-23T14:11:36 ***
[2020-01-23T14:11:37.394] [17534.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, 
error:0 status:15
[2020-01-23T14:11:37.396] [17534.batch] done with job

I'm also seeing lots of spam in the slurmd.logs on the compute nodes themselves 
whenever this users jobs lands on them.

[2020-02-04T15:29:11.073] [43816.batch] error: _is_a_lwp: 1 read() attempts on 
/proc/234796/status failed: No such process
[2020-02-04T15:37:24.238] [43682.batch] error: _is_a_lwp: open() 
/proc/74338/status failed: No such file or directory
[2020-02-04T15:40:42.064] [43916.batch] error: _is_a_lwp: open() 
/proc/87034/status failed: No such file or directory
[2020-02-04T15:41:11.304] [43840.batch] error: _is_a_lwp: open() 
/proc/151191/status failed: No such file or directory

Has anyone seen this issue before?

Regards,


Luis Huang | Systems Administrator II, Research Computing
New York Genome Center
101 Avenue of the Americas
New York, NY 10013
O: (646) 977-7291
lhu...@nygenome.org




________________________________

This message is for the recipient’s use only, and may contain confidential, 
privileged or protected information. Any unauthorized use or dissemination of 
this communication is prohibited. If you received this message in error, please 
immediately notify the sender and destroy all copies of this message. The 
recipient should check this email and any attachments for the presence of 
viruses, as we accept no liability for any damage caused by any virus 
transmitted by this email.

Reply via email to