i have a strange problem, but honestly i'm not sure where the issue is. we have users running LLM models through pytorch. part of the process saves off checkpoints at periodic intervals. when the checkpoint files are being written we can see in the logs the pytorch writing out the save files from each of the processes to lustre.
it chugs along for a little bit, but then comes to a grinding halt. no error from pytorch is logged and no errors can be found on the lustre clients or servers. the problem is also not transient, it happens every time the process runs the weird part is, if we switch the output directory from lustre to nfs (netapp backed), the pytorch run works perfectly fine has anyone seen anything like this? any suggestions on trouble shooting the issue? given that we have a 10x performance difference between netapp and lustre, i'm pretty keen on getting this fixed _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf