Re: [Beowulf] lustre / pytorch

Ellis Wilson Mon, 15 Jul 2024 08:19:23 -0700

Looks like you cross-posted on the Lustre list, which is a great spot toask. The things I would usually do here are:

1. If I can managed to reproduce this with a single process from asingle client, then I strace with numerous flags and see what syscall orsimilar it's stuck on when it comes to a halt. Alternatively you canattach to a seemingly hung process and you may see the last syscall itissued and is waiting on (or issuing and timing out on), but that's notalways been my experience. If you can only repro this with lots ofclients and processes, attaching to a couple and waiting until theytime-out should give you a decent idea of what they are timing out on.

2. On Lustre if you have access to the MGS node you should be able toregister changelogs and enable a sufficiently broad changelog mask tocapture all calls to the system. Then trigger your problematicworkload, and finally read the changelogs out and look for what the hungclient(s) were doing around the time when the hang occurred. This isexpensive and you'll need to make sure you disable your changelogs afterthe fact or you'll drive your MDS out of space in the long-term.


Best,

ellis

On 7/15/24 11:01, Michael DiDomenico wrote:

that's interesting on two counts, one that file locks are in play.
i've tried with both flock and noflock on the clients, but neither
seemed to make a difference, (i presumed file locks weren't taking
place)

is there something we should put in the code to ensure all the RANK's
are established at the beginning or maybe throughout the run (perhaps
something odd happens later on)

On Sat, Jul 13, 2024 at 3:47 AM Josh Catana <[email protected]> wrote:


I've seen this issue when running distributed and RANK isn't established. All 
workers think they are rank 0 and none of them can get a file lock to write.  
Eventually it just times out.


On Fri, Jul 12, 2024, 1:47 PM [email protected] <[email protected]> wrote:


I’ve never seen any difficulties with PyTorch saving checkpoint files to 
Lustre. Is it a special file format or just torch.save()? When the processes 
hang, have you tried using something like py-spy and/or gdb to get a stack 
trace of where in the software stack it’s hung?

Date: Thu, 11 Jul 2024 12:25:05 -0400
From: Michael DiDomenico <[email protected]>
To: Beowulf Mailing List <[email protected]>
Subject: [Beowulf] lustre / pytorch
Message-ID:
       <cabosp2p7l4j8kjqrqxc9u_yj3mljhj68z6fy17o5+e0weey...@mail.gmail.com>
Content-Type: text/plain; charset="UTF-8"

i have a strange problem, but honestly i'm not sure where the issue
is.  we have users running LLM models through pytorch.  part of the
process saves off checkpoints at periodic intervals.  when the
checkpoint files are being written we can see in the logs the pytorch
writing out the save files from each of the processes to lustre.

it chugs along for a little bit, but then comes to a grinding halt.
no error from pytorch is logged and no errors can be found on the
lustre clients or servers.  the problem is also not transient, it
happens every time the process runs

the weird part is, if we switch the output directory from lustre to
nfs (netapp backed), the pytorch run works perfectly fine

has anyone seen anything like this?  any suggestions on trouble
shooting the issue?

given that we have a 10x performance difference between netapp and
lustre, i'm pretty keen on getting this fixed


_______________________________________________
Beowulf mailing list, [email protected] sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf


_______________________________________________
Beowulf mailing list, [email protected] sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

_______________________________________________
Beowulf mailing list, [email protected] sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

_______________________________________________
Beowulf mailing list, [email protected] sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

Re: [Beowulf] lustre / pytorch

Reply via email to