You may have space, but do you have enough inodes?
Two different things to look at when trying to see why you cannot write
to a disk.
Also verify that it is writeable by SlurmUser.
If something happened and it automatically remounted itself as
read-only, that can do it too.
Brian Andrus
On 10/28/2021 11:57 AM, Pedro Luiz de Castro wrote:
Hello all
Since yesterday we’ve been having some trouble with slurm where it
crashes and isn’t able to recover.
I’ve managed to track the fault to a zero sized file, launching
slurmctld -Dvvvv
slurmctld: File
/mnt/nfs/lobo/IMM-NFS/slurm/hash.4/job.2044004/environment has zero size
That’s the StateSaveLocation, so the environment file for this
particular job is not getting correctly created.
I don’t believe it’s a space issue as there’s about 2TB of free space
on this mountpoint.
Shouldn’t be permissions either, as other jobs run fine and get completed.
For now I’ve been launching slurmctld -i to work around this issue,
killing the job in question.
This way slurm can still be running for our users.
Any ideas where I should look next to try and troubleshoot this issue?
Thanks for all the help in advance.
Best regards,
*Pedro Luiz de Castro*
IT Support & System Administrator
Information Systems
iMM_JLA_horizontal_RGB_cor_positivo
Faculdade de Medicina, Universidade de Lisboa
Avenida Professor Egas Moniz, 1649-028, Lisboa, Portugal
iMM Lisboa general contact (+351) 217 999 411 - ext: 47356
*imm.medicina**.ulisboa**.pt*