Re: [Beowulf] lustre / pytorch

2024-07-15 Thread Josh Catana
g odd happens later on) > > On Sat, Jul 13, 2024 at 3:47 AM Josh Catana wrote: > > > > I've seen this issue when running distributed and RANK isn't > established. All workers think they are rank 0 and none of them can get a > file lock to write. Eventu

Re: [Beowulf] lustre / pytorch

2024-07-12 Thread Josh Catana
I've seen this issue when running distributed and RANK isn't established. All workers think they are rank 0 and none of them can get a file lock to write. Eventually it just times out. On Fri, Jul 12, 2024, 1:47 PM plegr...@gmail.com wrote: > I’ve never seen any difficulties with PyTorch saving

[Beowulf] Monitoring and Metrics

2017-10-07 Thread Josh Catana
This may have been brought up in the past, but I couldn't find much in my message archive. What are people using for HPC cluster monitoring and metrics lately? I've been low on time to add features to my home grown solution and looking at some OTS products. I'm looking for something that can do mo

Re: [Beowulf] LXD containers for cluster services and cgroups?

2017-06-16 Thread Josh Catana
I know they have a canned scheduler hook to run docker. If you're familiar with python modifying their code to run singularity shouldn't be difficult. I rewrote their hook to operate in my environment pretty easily. On Jun 16, 2017 4:29 AM, "John Hearns" wrote: > Lance, thankyou very much for th

Re: [Beowulf] Mellanox ConnectX-3 MT27500 problems

2013-04-27 Thread Josh Catana
I noticed on systems running xen-kernel netback driver for virtualization, bandwidth drops to very low rates. On Apr 27, 2013 6:19 PM, "Brice Goglin" wrote: > Hello, > > These cards are QDR and even FDR, you should get 56Gbit/s (we see about > 50Gbit/s in benchmarks iirc). That what I get on sand

Re: [Beowulf] Degree

2012-10-25 Thread Josh Catana
As some working in HPC as an indefinite length contractor in the U.S. this topic intrigues me. Even though the company I contract for has me training their new employees and basically in control of everything in their environment for the last 3 years, they refuse to hire me on directly because I do