unfortunately, so far the lustre system isn't producing any errors on the mgs/mds/ost or the client. i'm going to work with the dev this afternoon and see if can pull a lustre debug trace from the systems and see if that turns up anything
also unfortunate, is that we need 8 nodes to kick the error. the model unfortunately wont find on anything smaller :( On Mon, Jul 15, 2024 at 11:19 AM Ellis Wilson <el...@ellisv3.com> wrote: > > Looks like you cross-posted on the Lustre list, which is a great spot to > ask. The things I would usually do here are: > > 1. If I can managed to reproduce this with a single process from a > single client, then I strace with numerous flags and see what syscall or > similar it's stuck on when it comes to a halt. Alternatively you can > attach to a seemingly hung process and you may see the last syscall it > issued and is waiting on (or issuing and timing out on), but that's not > always been my experience. If you can only repro this with lots of > clients and processes, attaching to a couple and waiting until they > time-out should give you a decent idea of what they are timing out on. > > 2. On Lustre if you have access to the MGS node you should be able to > register changelogs and enable a sufficiently broad changelog mask to > capture all calls to the system. Then trigger your problematic > workload, and finally read the changelogs out and look for what the hung > client(s) were doing around the time when the hang occurred. This is > expensive and you'll need to make sure you disable your changelogs after > the fact or you'll drive your MDS out of space in the long-term. > > Best, > > ellis > > On 7/15/24 11:01, Michael DiDomenico wrote: > > that's interesting on two counts, one that file locks are in play. > > i've tried with both flock and noflock on the clients, but neither > > seemed to make a difference, (i presumed file locks weren't taking > > place) > > > > is there something we should put in the code to ensure all the RANK's > > are established at the beginning or maybe throughout the run (perhaps > > something odd happens later on) > > > > On Sat, Jul 13, 2024 at 3:47 AM Josh Catana <jcat...@gmail.com> wrote: > >> > >> I've seen this issue when running distributed and RANK isn't established. > >> All workers think they are rank 0 and none of them can get a file lock to > >> write. Eventually it just times out. > >> > >> > >> On Fri, Jul 12, 2024, 1:47 PM plegr...@gmail.com <plegr...@gmail.com> > >> wrote: > >>> > >>> I’ve never seen any difficulties with PyTorch saving checkpoint files to > >>> Lustre. Is it a special file format or just torch.save()? When the > >>> processes hang, have you tried using something like py-spy and/or gdb to > >>> get a stack trace of where in the software stack it’s hung? > >>> > >>>> Date: Thu, 11 Jul 2024 12:25:05 -0400 > >>>> From: Michael DiDomenico <mdidomeni...@gmail.com> > >>>> To: Beowulf Mailing List <Beowulf@beowulf.org> > >>>> Subject: [Beowulf] lustre / pytorch > >>>> Message-ID: > >>>> > >>>> <cabosp2p7l4j8kjqrqxc9u_yj3mljhj68z6fy17o5+e0weey...@mail.gmail.com> > >>>> Content-Type: text/plain; charset="UTF-8" > >>>> > >>>> i have a strange problem, but honestly i'm not sure where the issue > >>>> is. we have users running LLM models through pytorch. part of the > >>>> process saves off checkpoints at periodic intervals. when the > >>>> checkpoint files are being written we can see in the logs the pytorch > >>>> writing out the save files from each of the processes to lustre. > >>>> > >>>> it chugs along for a little bit, but then comes to a grinding halt. > >>>> no error from pytorch is logged and no errors can be found on the > >>>> lustre clients or servers. the problem is also not transient, it > >>>> happens every time the process runs > >>>> > >>>> the weird part is, if we switch the output directory from lustre to > >>>> nfs (netapp backed), the pytorch run works perfectly fine > >>>> > >>>> has anyone seen anything like this? any suggestions on trouble > >>>> shooting the issue? > >>>> > >>>> given that we have a 10x performance difference between netapp and > >>>> lustre, i'm pretty keen on getting this fixed > >>> > >>> _______________________________________________ > >>> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > >>> To change your subscription (digest mode or unsubscribe) visit > >>> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf > >> > >> _______________________________________________ > >> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > >> To change your subscription (digest mode or unsubscribe) visit > >> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf > > _______________________________________________ > > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > > To change your subscription (digest mode or unsubscribe) visit > > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf