Re: [Beowulf] lustre / pytorch

Michael DiDomenico Mon, 15 Jul 2024 09:12:59 -0700

unfortunately, so far the lustre system isn't producing any errors on
the mgs/mds/ost or the client.  i'm going to work with the dev this
afternoon and see if can pull a lustre debug trace from the systems
and see if that turns up anything


also unfortunate, is that we need 8 nodes to kick the error.  the
model unfortunately wont find on anything smaller :(

On Mon, Jul 15, 2024 at 11:19 AM Ellis Wilson <el...@ellisv3.com> wrote:
>
> Looks like you cross-posted on the Lustre list, which is a great spot to
> ask.  The things I would usually do here are:
>
> 1. If I can managed to reproduce this with a single process from a
> single client, then I strace with numerous flags and see what syscall or
> similar it's stuck on when it comes to a halt.  Alternatively you can
> attach to a seemingly hung process and you may see the last syscall it
> issued and is waiting on (or issuing and timing out on), but that's not
> always been my experience.  If you can only repro this with lots of
> clients and processes, attaching to a couple and waiting until they
> time-out should give you a decent idea of what they are timing out on.
>
> 2. On Lustre if you have access to the MGS node you should be able to
> register changelogs and enable a sufficiently broad changelog mask to
> capture all calls to the system.  Then trigger your problematic
> workload, and finally read the changelogs out and look for what the hung
> client(s) were doing around the time when the hang occurred.  This is
> expensive and you'll need to make sure you disable your changelogs after
> the fact or you'll drive your MDS out of space in the long-term.
>
> Best,
>
> ellis
>
> On 7/15/24 11:01, Michael DiDomenico wrote:
> > that's interesting on two counts, one that file locks are in play.
> > i've tried with both flock and noflock on the clients, but neither
> > seemed to make a difference, (i presumed file locks weren't taking
> > place)
> >
> > is there something we should put in the code to ensure all the RANK's
> > are established at the beginning or maybe throughout the run (perhaps
> > something odd happens later on)
> >
> > On Sat, Jul 13, 2024 at 3:47 AM Josh Catana <jcat...@gmail.com> wrote:
> >>
> >> I've seen this issue when running distributed and RANK isn't established. 
> >> All workers think they are rank 0 and none of them can get a file lock to 
> >> write.  Eventually it just times out.
> >>
> >>
> >> On Fri, Jul 12, 2024, 1:47 PM plegr...@gmail.com <plegr...@gmail.com> 
> >> wrote:
> >>>
> >>> I’ve never seen any difficulties with PyTorch saving checkpoint files to 
> >>> Lustre. Is it a special file format or just torch.save()? When the 
> >>> processes hang, have you tried using something like py-spy and/or gdb to 
> >>> get a stack trace of where in the software stack it’s hung?
> >>>
> >>>> Date: Thu, 11 Jul 2024 12:25:05 -0400
> >>>> From: Michael DiDomenico <mdidomeni...@gmail.com>
> >>>> To: Beowulf Mailing List <Beowulf@beowulf.org>
> >>>> Subject: [Beowulf] lustre / pytorch
> >>>> Message-ID:
> >>>>        
> >>>> <cabosp2p7l4j8kjqrqxc9u_yj3mljhj68z6fy17o5+e0weey...@mail.gmail.com>
> >>>> Content-Type: text/plain; charset="UTF-8"
> >>>>
> >>>> i have a strange problem, but honestly i'm not sure where the issue
> >>>> is.  we have users running LLM models through pytorch.  part of the
> >>>> process saves off checkpoints at periodic intervals.  when the
> >>>> checkpoint files are being written we can see in the logs the pytorch
> >>>> writing out the save files from each of the processes to lustre.
> >>>>
> >>>> it chugs along for a little bit, but then comes to a grinding halt.
> >>>> no error from pytorch is logged and no errors can be found on the
> >>>> lustre clients or servers.  the problem is also not transient, it
> >>>> happens every time the process runs
> >>>>
> >>>> the weird part is, if we switch the output directory from lustre to
> >>>> nfs (netapp backed), the pytorch run works perfectly fine
> >>>>
> >>>> has anyone seen anything like this?  any suggestions on trouble
> >>>> shooting the issue?
> >>>>
> >>>> given that we have a 10x performance difference between netapp and
> >>>> lustre, i'm pretty keen on getting this fixed
> >>>
> >>> _______________________________________________
> >>> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> >>> To change your subscription (digest mode or unsubscribe) visit 
> >>> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
> >>
> >> _______________________________________________
> >> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> >> To change your subscription (digest mode or unsubscribe) visit 
> >> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
> > _______________________________________________
> > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> > To change your subscription (digest mode or unsubscribe) visit 
> > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
> _______________________________________________
> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit 
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

Re: [Beowulf] lustre / pytorch

Reply via email to