We had this issue recently. Some googling led me to the NERSC FAQs, which state: > _is_a_lwp is a function called internally for Slurm job accounting. The > message indicates a rare error situation with a function call. But the error > shouldn't affect anything in the user job. Please ignore the message.
After looking into our logfiles, it seems that this error appears more or less at random, but does not cause any jobs to fail (all errors I got were for jobs that worked perfectly fine). In your case, the job got cancelled an hour after that message. Although it is curious that it does seem to happen to only one user in your case. Best, Marcus On 20-02-04 20:50, Luis Huang wrote: > We have a user that keeps encountering this error with one type of her jobs. > Sometimes her jobs will cancel and other times it will run fine. > > slurmstepd: error: _is_a_lwp: open() /proc/195420/status failed: No such file > or directory > slurmstepd: error: *** JOB 17534 ON pe2dc5-0007 CANCELLED AT > 2020-01-23T14:11:36 *** > > [root@pe2dc5-0007 ~]# grep 17534 /var/log/slurmd.log > [2020-01-23T14:10:12.789] task_p_slurmd_batch_request: 17534 > [2020-01-23T14:10:12.789] task/affinity: job 17534 CPU input mask for node: > 0x03000000000000 > [2020-01-23T14:10:12.789] task/affinity: job 17534 CPU final HW mask for > node: 0x02000000200000 > [2020-01-23T14:10:12.790] _run_prolog: prolog with lock for job 17534 ran for > 0 seconds > [2020-01-23T14:10:12.875] Launching batch job 17534 for UID 50321 > [2020-01-23T14:10:16.937] [17534.batch] task_p_pre_launch: Using > sched_affinity for tasks > [2020-01-23T14:10:42.895] [17534.batch] error: _is_a_lwp: open() > /proc/195420/status failed: No such file or directory > [2020-01-23T14:11:36.386] [17534.batch] error: *** JOB 17534 ON pe2dc5-0007 > CANCELLED AT 2020-01-23T14:11:36 *** > [2020-01-23T14:11:37.394] [17534.batch] sending > REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:15 > [2020-01-23T14:11:37.396] [17534.batch] done with job > > I'm also seeing lots of spam in the slurmd.logs on the compute nodes > themselves whenever this users jobs lands on them. > > [2020-02-04T15:29:11.073] [43816.batch] error: _is_a_lwp: 1 read() attempts > on /proc/234796/status failed: No such process > [2020-02-04T15:37:24.238] [43682.batch] error: _is_a_lwp: open() > /proc/74338/status failed: No such file or directory > [2020-02-04T15:40:42.064] [43916.batch] error: _is_a_lwp: open() > /proc/87034/status failed: No such file or directory > [2020-02-04T15:41:11.304] [43840.batch] error: _is_a_lwp: open() > /proc/151191/status failed: No such file or directory > > Has anyone seen this issue before? > > Regards, > > > Luis Huang | Systems Administrator II, Research Computing > New York Genome Center > 101 Avenue of the Americas > New York, NY 10013 > O: (646) 977-7291 > lhu...@nygenome.org > > > > > ________________________________ > > This message is for the recipient’s use only, and may contain confidential, > privileged or protected information. Any unauthorized use or dissemination of > this communication is prohibited. If you received this message in error, > please immediately notify the sender and destroy all copies of this message. > The recipient should check this email and any attachments for the presence of > viruses, as we accept no liability for any damage caused by any virus > transmitted by this email. -- Marcus Vincent Boden, M.Sc. Arbeitsgruppe eScience Tel.: +49 (0)551 201-2191 E-Mail: mbo...@gwdg.de --------------------------------------- Gesellschaft fuer wissenschaftliche Datenverarbeitung mbH Goettingen (GWDG) Am Fassberg 11, 37077 Goettingen URL: http://www.gwdg.de E-Mail: g...@gwdg.de Tel.: +49 (0)551 201-1510 Fax: +49 (0)551 201-2150 Geschaeftsfuehrer: Prof. Dr. Ramin Yahyapour Aufsichtsratsvorsitzender: Prof. Dr. Christian Griesinger Sitz der Gesellschaft: Goettingen Registergericht: Goettingen Handelsregister-Nr. B 598 ---------------------------------------
smime.p7s
Description: S/MIME cryptographic signature