You may have space, but do you have enough inodes?

Two different things to look at when trying to see why you cannot write to a disk.

Also verify that it is writeable by SlurmUser.

If something happened and it automatically remounted itself as read-only, that can do it too.

Brian Andrus

On 10/28/2021 11:57 AM, Pedro Luiz de Castro wrote:

Hello all

Since yesterday we’ve been having some trouble with slurm where it crashes and isn’t able to recover. I’ve managed to track the fault to a zero sized file, launching slurmctld -Dvvvv

slurmctld: File /mnt/nfs/lobo/IMM-NFS/slurm/hash.4/job.2044004/environment has zero size

That’s the StateSaveLocation, so the environment file for this particular job is not getting correctly created. I don’t believe it’s a space issue as there’s about 2TB of free space on this mountpoint.

Shouldn’t be permissions either, as other jobs run fine and get completed.

For now I’ve been launching slurmctld -i to work around this issue, killing the job in question.

This way slurm can still be running for our users.

Any ideas where I should look next to try and troubleshoot this issue?

Thanks for all the help in advance.

Best regards,

*Pedro Luiz de Castro*

IT Support & System Administrator
Information Systems

iMM_JLA_horizontal_RGB_cor_positivo

Faculdade de Medicina, Universidade de Lisboa
Avenida Professor Egas Moniz, 1649​-​028, Lisboa, Portugal
iMM Lisboa general contact (+​351) ​217 ​999 ​411 - ext: 47356

*imm.medicina*​*.ulisboa*​*.pt*

Reply via email to