[Beowulf] lustre / pytorch

Michael DiDomenico Thu, 11 Jul 2024 09:25:34 -0700

i have a strange problem, but honestly i'm not sure where the issue
is.  we have users running LLM models through pytorch.  part of the
process saves off checkpoints at periodic intervals.  when the
checkpoint files are being written we can see in the logs the pytorch
writing out the save files from each of the processes to lustre.


it chugs along for a little bit, but then comes to a grinding halt.
no error from pytorch is logged and no errors can be found on the
lustre clients or servers.  the problem is also not transient, it
happens every time the process runs

the weird part is, if we switch the output directory from lustre to
nfs (netapp backed), the pytorch run works perfectly fine

has anyone seen anything like this?  any suggestions on trouble
shooting the issue?

given that we have a 10x performance difference between netapp and
lustre, i'm pretty keen on getting this fixed
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

[Beowulf] lustre / pytorch

Reply via email to