Looks like you cross-posted on the Lustre list, which is a great spot to ask. The things I would usually do here are:

1. If I can managed to reproduce this with a single process from a single client, then I strace with numerous flags and see what syscall or similar it's stuck on when it comes to a halt. Alternatively you can attach to a seemingly hung process and you may see the last syscall it issued and is waiting on (or issuing and timing out on), but that's not always been my experience. If you can only repro this with lots of clients and processes, attaching to a couple and waiting until they time-out should give you a decent idea of what they are timing out on.

2. On Lustre if you have access to the MGS node you should be able to register changelogs and enable a sufficiently broad changelog mask to capture all calls to the system. Then trigger your problematic workload, and finally read the changelogs out and look for what the hung client(s) were doing around the time when the hang occurred. This is expensive and you'll need to make sure you disable your changelogs after the fact or you'll drive your MDS out of space in the long-term.

Best,

ellis

On 7/15/24 11:01, Michael DiDomenico wrote:
that's interesting on two counts, one that file locks are in play.
i've tried with both flock and noflock on the clients, but neither
seemed to make a difference, (i presumed file locks weren't taking
place)

is there something we should put in the code to ensure all the RANK's
are established at the beginning or maybe throughout the run (perhaps
something odd happens later on)

On Sat, Jul 13, 2024 at 3:47 AM Josh Catana <jcat...@gmail.com> wrote:

I've seen this issue when running distributed and RANK isn't established. All 
workers think they are rank 0 and none of them can get a file lock to write.  
Eventually it just times out.


On Fri, Jul 12, 2024, 1:47 PM plegr...@gmail.com <plegr...@gmail.com> wrote:

I’ve never seen any difficulties with PyTorch saving checkpoint files to 
Lustre. Is it a special file format or just torch.save()? When the processes 
hang, have you tried using something like py-spy and/or gdb to get a stack 
trace of where in the software stack it’s hung?

Date: Thu, 11 Jul 2024 12:25:05 -0400
From: Michael DiDomenico <mdidomeni...@gmail.com>
To: Beowulf Mailing List <Beowulf@beowulf.org>
Subject: [Beowulf] lustre / pytorch
Message-ID:
       <cabosp2p7l4j8kjqrqxc9u_yj3mljhj68z6fy17o5+e0weey...@mail.gmail.com>
Content-Type: text/plain; charset="UTF-8"

i have a strange problem, but honestly i'm not sure where the issue
is.  we have users running LLM models through pytorch.  part of the
process saves off checkpoints at periodic intervals.  when the
checkpoint files are being written we can see in the logs the pytorch
writing out the save files from each of the processes to lustre.

it chugs along for a little bit, but then comes to a grinding halt.
no error from pytorch is logged and no errors can be found on the
lustre clients or servers.  the problem is also not transient, it
happens every time the process runs

the weird part is, if we switch the output directory from lustre to
nfs (netapp backed), the pytorch run works perfectly fine

has anyone seen anything like this?  any suggestions on trouble
shooting the issue?

given that we have a 10x performance difference between netapp and
lustre, i'm pretty keen on getting this fixed

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

Reply via email to