[slurm-users] LRMS error: (-1) Job missing from SLURM."

2024-08-06 Thread Felix via slurm-users

Hello

at site RO-14-ITIM, after a power failure I get the following problem

2024-08-06 15:53:04 Finished - job id: 
c9INDmclYv5ngvuSSqSAreymYz3jwmOETUEmV71LDmABFKDm7KNpMn, unix user: 
1900:1900, name: "org.nordugrid.ARC-CE-result-ops", owner: 
"/dc=eu/dc=egi/c=hr/o=robots/o=srce/cn=robot:argo-...@cro-ngi.hr", lrms: 
SLURM, queue: debug, lrmsid: 274399, failure: "LRMS error: (-1) Job 
missing from SLURM."
2024-08-06 15:53:04 Finished - job id: 
tjJNDmclYv5ngvuSSqSAreymYz3jwmOETUEmd71LDmABFKDmePf7To, unix user: 
1900:1900, name: "org.nordugrid.ARC-CE-result-ops", owner: 
"/dc=eu/dc=egi/c=hr/o=robots/o=srce/cn=robot:argo-...@cro-ngi.hr", lrms: 
SLURM, queue: debug, lrmsid: 274400, failure: "LRMS error: (-1) Job 
missing from SLURM."
2024-08-06 15:53:04 Finished - job id: 
kiJNDmclYv5ngvuSSqSAreymYz3jwmOETUEml71LDmABFKDmCmwifm, unix user: 
1900:1900, name: "org.nordugrid.ARC-CE-result-ops", owner: 
"/dc=eu/dc=egi/c=hr/o=robots/o=srce/cn=robot:argo-...@cro-ngi.hr", lrms: 
SLURM, queue: debug, lrmsid: 274398, failure: "LRMS error: (-1) Job 
missing from SLURM."


The jobs can not be seen in sinfo or squeue

And indication on how where to look up the problem?

Thank you

Felix

--
Dr. Eng. Farcas Felix
National Institute of Research and Development of Isotopic and Molecular 
Technology,
IT - Department - Cluj-Napoca, Romania
Mobile: +40742195323


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] missing jobs from queue

2025-03-14 Thread Felix via slurm-users

Hello

at site RO-14-ITIM, we have a very strange problem.

A ticket was raised with the following:

"Hi,
The new ticket (Pilots disappearing from local batch system at ITIM) has 
been created by "*Timo Wilken*".

Group: NGIs › NGI_RO
Owner: -
State: in progress
Information:
Hi,
Pilots seem to be lost by the local batch system at ITIM. >50% of pilots 
fail:

https://apfmon.lancs.ac.uk/CERN_central_A:RO-14-ITIM-arcn-node?state=fault
Example log: 
https://aipanda156.cern.ch/condor_logs_2/25-03-12_13/grid.15478691.1.log

Error message: "ARC job failed: LRMS error: (-1) Job missing from SLURM"
Cheers,

Timo"

The problem is I can not find any of the jobs that are failing on the site:

for example:

the last job:

https://aipanda023.cern.ch/condor_logs_2/25-03-14_09/grid.19602200.0.log

with stdlog:

000 (19602200.000.000) 2025-03-14 09:28:03 Job submitted from host: 
<137.138.31.125:38090?addrs=137.138.31.125-38090+[2001-1458-d00-19--75]-38090&alias=aipanda023.cern.ch>
...
027 (19602200.000.000) 2025-03-14 09:28:13 Job submitted to grid resource
GridResource: arc arcn-node.itim-cj.ro:443
GridJobId: arc arcn-node.itim-cj.ro:443 
PPVNDm5VGD7ngvuSSqSAreymYz3jwmOETUEm1dOSDmbfXLDmHXmz9n
...
001 (19602200.000.000) 2025-03-14 09:40:51 Job executing on host: arc 
arcn-node.itim-cj.ro:443
...
012 (19602200.000.000) 2025-03-14 09:40:58 Job was held.
ARC job failed: LRMS error: (-1) Job missing from SLURM
Code 0 Subcode 0
...
009 (19602200.000.000) 2025-03-14 11:03:03 Job was aborted.
Python-initiated action. (by user atlpan)
...
009 (19602200.000.000) 2025-03-14 11:03:17 Job was aborted.
Python-initiated action. (by user atlpan)
This name:
PPVNDm5VGD7ngvuSSqSAreymYz3jwmOETUEm1dOSDmbfXLDmHXmz9n
 is nowhere to find on my server:
[root@arcn-node log]# updatedb
[root@arcn-node log]# locate 
PPVNDm5VGD7ngvuSSqSAreymYz3jwmOETUEm1dOSDmbfXLDmHXmz9n
[root@arcn-node log]#

I have on my site this kind of error

The ENDPOINT affected is
arcn-node.itim-cj.ro (ARC-CE)It became Critical*at*2025-03-14T09:16:44Z due to 
*METRIC*org.nordugrid.ARC-CE-sw-gcc

*Summary:*Script exited with code 127. Could not match GCC version. See 
/etc/arc/nagios/20-dist.ini for debugging hints. *Message:*''


which is going to be ok in the messag, one minute later

The ENDPOINT affected is arcn-node.itim-cj.ro (ARC-CE) It became Ok *at* 
2025-03-14T09:17:53Z due to *METRIC* org.nordugrid.ARC-CE-sw-gcc

*Summary: *Found GCC version 11.5.0. *Message: *''

Can you please advice where to look for any clue in solving this mystery?

Thank you
Felix

:--

Dr. Eng. Farcas Felix
National Institute of Research and Development of Isotopic and Molecular 
Technology,
IT - Department - Cluj-Napoca, Romania
Mobile: +40742195323

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com