Interesting idea, thanks. I don't think this looks like the likely cause though:
# lsof | wc -l 20675 # cat /proc/sys/fs/file-max 52325451 This is on one of the nodes which had failures. The number of open files is tiny compared to the limit. I know there's a per-process limit, but given that the jobs are all identical then this should consistently fail if it was that. Simon. From: slurm-users <slurm-users-boun...@lists.schedmd.com> On Behalf Of William Brown Sent: 29 March 2021 19:13 To: Slurm User Community List <slurm-users@lists.schedmd.com> Subject: Re: [slurm-users] R jobs crashing when run in parallel Maybe you have run out of file handles. William On Mon, 29 Mar 2021, 17:36 Patrick Goetz, <pgo...@math.utexas.edu<mailto:pgo...@math.utexas.edu>> wrote: Could this be a function of the R script you're trying to run, or are you saying you get this error running the same script which works at other times? On 3/29/21 7:47 AM, Simon Andrews wrote: > I've got a weird problem on our slurm cluster. If I submit lots of R > jobs to the queue then as soon as I've got more than about 7 of them > running at the same time I start to get failures, saying: > > /bi/apps/R/4.0.4/lib64/R/bin/exec/R: error while loading shared > libraries: libpcre2-8.so.0: cannot open shared object file: No such file > or directory > > ..which makes no sense because that library is definitely there, and > other jobs on the same nodes worked both before and after the failed > jobs. I recently ran 500 identical jobs and 152 of them failed in this way. > > There are no errors in the log files on the compute nodes where this > failed and it happens across multiple nodes so it's not a single one > being strange. The R binary is on an isilon network share, but the > libpcre2 library is on the local disk for the node. > > Anyone come across anything like this before? Any suggestions for fixes? > > Thanks > > Simon. > > > This message is from an external sender. Learn more about why this > matters. <https://ut.service-now.com/sp?id=kb_article&number=KB0011401> > >