I've seen this issue when running distributed and RANK isn't established. All workers think they are rank 0 and none of them can get a file lock to write. Eventually it just times out.
On Fri, Jul 12, 2024, 1:47 PM plegr...@gmail.com <plegr...@gmail.com> wrote: > I’ve never seen any difficulties with PyTorch saving checkpoint files to > Lustre. Is it a special file format or just torch.save()? When the > processes hang, have you tried using something like py-spy and/or gdb to > get a stack trace of where in the software stack it’s hung? > > > Date: Thu, 11 Jul 2024 12:25:05 -0400 > > From: Michael DiDomenico <mdidomeni...@gmail.com> > > To: Beowulf Mailing List <Beowulf@beowulf.org> > > Subject: [Beowulf] lustre / pytorch > > Message-ID: > > < > cabosp2p7l4j8kjqrqxc9u_yj3mljhj68z6fy17o5+e0weey...@mail.gmail.com> > > Content-Type: text/plain; charset="UTF-8" > > > > i have a strange problem, but honestly i'm not sure where the issue > > is. we have users running LLM models through pytorch. part of the > > process saves off checkpoints at periodic intervals. when the > > checkpoint files are being written we can see in the logs the pytorch > > writing out the save files from each of the processes to lustre. > > > > it chugs along for a little bit, but then comes to a grinding halt. > > no error from pytorch is logged and no errors can be found on the > > lustre clients or servers. the problem is also not transient, it > > happens every time the process runs > > > > the weird part is, if we switch the output directory from lustre to > > nfs (netapp backed), the pytorch run works perfectly fine > > > > has anyone seen anything like this? any suggestions on trouble > > shooting the issue? > > > > given that we have a 10x performance difference between netapp and > > lustre, i'm pretty keen on getting this fixed > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf >
_______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf