that's interesting on two counts, one that file locks are in play.
i've tried with both flock and noflock on the clients, but neither
seemed to make a difference, (i presumed file locks weren't taking
place)

is there something we should put in the code to ensure all the RANK's
are established at the beginning or maybe throughout the run (perhaps
something odd happens later on)

On Sat, Jul 13, 2024 at 3:47 AM Josh Catana <jcat...@gmail.com> wrote:
>
> I've seen this issue when running distributed and RANK isn't established. All 
> workers think they are rank 0 and none of them can get a file lock to write.  
> Eventually it just times out.
>
>
> On Fri, Jul 12, 2024, 1:47 PM plegr...@gmail.com <plegr...@gmail.com> wrote:
>>
>> I’ve never seen any difficulties with PyTorch saving checkpoint files to 
>> Lustre. Is it a special file format or just torch.save()? When the processes 
>> hang, have you tried using something like py-spy and/or gdb to get a stack 
>> trace of where in the software stack it’s hung?
>>
>> > Date: Thu, 11 Jul 2024 12:25:05 -0400
>> > From: Michael DiDomenico <mdidomeni...@gmail.com>
>> > To: Beowulf Mailing List <Beowulf@beowulf.org>
>> > Subject: [Beowulf] lustre / pytorch
>> > Message-ID:
>> >       <cabosp2p7l4j8kjqrqxc9u_yj3mljhj68z6fy17o5+e0weey...@mail.gmail.com>
>> > Content-Type: text/plain; charset="UTF-8"
>> >
>> > i have a strange problem, but honestly i'm not sure where the issue
>> > is.  we have users running LLM models through pytorch.  part of the
>> > process saves off checkpoints at periodic intervals.  when the
>> > checkpoint files are being written we can see in the logs the pytorch
>> > writing out the save files from each of the processes to lustre.
>> >
>> > it chugs along for a little bit, but then comes to a grinding halt.
>> > no error from pytorch is logged and no errors can be found on the
>> > lustre clients or servers.  the problem is also not transient, it
>> > happens every time the process runs
>> >
>> > the weird part is, if we switch the output directory from lustre to
>> > nfs (netapp backed), the pytorch run works perfectly fine
>> >
>> > has anyone seen anything like this?  any suggestions on trouble
>> > shooting the issue?
>> >
>> > given that we have a 10x performance difference between netapp and
>> > lustre, i'm pretty keen on getting this fixed
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit 
>> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>
> _______________________________________________
> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit 
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

Reply via email to