Il 28/10/20 13:38, Diego Zuccato ha scritto: >> What I'm looking for is some way to avoid having to do that. > Now trying UnkillableStepTimeout=300 ... Fingers crossed... Ok. Seems it's working. The problem was that the writing of a big (2.2GB) core file via NFS took too long, and default of 60s was not enough.
Strangely the core file seems corrupted (maybe because it's from a 4-nodes job and they all try to write to the same file?): -8<-- # file core core: ELF 64-bit LSB core file, x86-64, version 1 (SYSV), too many program headers (2533) -8<-- And gdb can't backtrace :( So the core file takes a long time to be created and is useless. Perfect :) -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786