[Beowulf] malloc on filesystem

2025-02-05 Thread Michael DiDomenico
this might sound like a bit of an oddify, but does anyone know if there's a library out there that will let me override malloc calls to memory and direct them to a filesystem instead? ie using the filesystem as memory instead of ram for a program. ideally something i can LD_PRELOAD on top of a st

Re: [Beowulf] Monitoring communication + processing time

2025-01-17 Thread Michael DiDomenico
sadly most people still use printf's to debug C code. but there are some parallel debuggers on the market like Totalview, but it's pricey depending on how man ranks you want to spin up under the debugger On Thu, Jan 16, 2025 at 7:48 AM Alexandre Ferreira Ramos via Beowulf wrote: > > Hi all, I ho

Re: [Beowulf] lustre / pytorch

2024-07-15 Thread Michael DiDomenico
ere doing around the time when the hang occurred. This is > expensive and you'll need to make sure you disable your changelogs after > the fact or you'll drive your MDS out of space in the long-term. > > Best, > > ellis > > On 7/15/24 11:01, Michael DiDomenico wrot

Re: [Beowulf] lustre / pytorch

2024-07-15 Thread Michael DiDomenico
e()? When the processes >> hang, have you tried using something like py-spy and/or gdb to get a stack >> trace of where in the software stack it’s hung? >> >> > Date: Thu, 11 Jul 2024 12:25:05 -0400 >> > From: Michael DiDomenico >> > To: Beowulf

Re: [Beowulf] lustre / pytorch

2024-07-15 Thread Michael DiDomenico
t torch.save()? When the processes > hang, have you tried using something like py-spy and/or gdb to get a stack > trace of where in the software stack it’s hung? > > > Date: Thu, 11 Jul 2024 12:25:05 -0400 > > From: Michael DiDomenico > > To: Beowulf Mailing List >

[Beowulf] parts storage?

2024-07-11 Thread Michael DiDomenico
this might be a little out of left field, but for those supporting large machines (ie 3-5k+ nodes) and multi-generational, how much storage space do you allocate for on site spares? my org is in the process of designing a new data center space and i find myself having to fight for every sq ft of s

[Beowulf] lustre / pytorch

2024-07-11 Thread Michael DiDomenico
i have a strange problem, but honestly i'm not sure where the issue is. we have users running LLM models through pytorch. part of the process saves off checkpoints at periodic intervals. when the checkpoint files are being written we can see in the logs the pytorch writing out the save files fro

Re: [Beowulf] immersion

2024-03-24 Thread Michael DiDomenico
aybe it's not even a power limit per se, but DLC is pretty complicated with all the piping/manifolds/connectors/CDU's, does there come a point where its just not worth it unless it's a big custom solution like the HPE stuff On Sun, Mar 24, 2024 at 1:46 PM Scott Atchley wrote

Re: [Beowulf] immersion

2024-03-23 Thread Michael DiDomenico
to answer some of my own questions :) and anyone else interested https://dug.com/dug-cool/ https://dug.com/wp-content/uploads/2024/03/DUG-Cool-spec-sheet_240319.pdf On Sat, Mar 23, 2024 at 10:17 AM Michael DiDomenico wrote: > i caught this on linkedin the other day. i'm not su

[Beowulf] immersion

2024-03-23 Thread Michael DiDomenico
i caught this on linkedin the other day. i'm not sure if Dr Midgely is still on the list or not. If he is, i was wondering if he could shed some technical details on the installation and since it's been a few years since DUG first started with immersion what his thoughts are now versus then http

Re: [Beowulf] [External] position adverts?

2024-02-23 Thread Michael DiDomenico
Maybe we should come with some kind of standard/wording/what-have-you to post such. I have some open positions as well. might liven the list up a little too... :) On Thu, Feb 22, 2024 at 7:45 PM Douglas Eadline wrote: > > > I've always thought employment opps were fine, but e-mails trying to >

Re: [Beowulf] Fwd: [EESSI] "Best Practices for CernVM-FS in HPC" online tutorial on Mon 4 Dec 2023 (13:30-17:00 CET)

2023-11-13 Thread Michael DiDomenico
On Mon, 13 Nov 2023 at 15:35, Michael DiDomenico > wrote: > >> unfortunately, it looks like registration is full... :( >> >> >> On Mon, Nov 13, 2023 at 4:34 AM Jörg Saßmannshausen < >> sassy-w...@sassy.formativ.net> wrote: >> >>> Dear all, &g

Re: [Beowulf] Fwd: [EESSI] "Best Practices for CernVM-FS in HPC" online tutorial on Mon 4 Dec 2023 (13:30-17:00 CET)

2023-11-13 Thread Michael DiDomenico
unfortunately, it looks like registration is full... :( On Mon, Nov 13, 2023 at 4:34 AM Jörg Saßmannshausen < sassy-w...@sassy.formativ.net> wrote: > Dear all, > > just in case you are interested, there is a EESSI online tutorial coming > up. EESSI is a way to share microarchitecture-specific an

[Beowulf] cisco lacp redhat9

2023-10-26 Thread Michael DiDomenico
does anyone have teaming w/lacp between cisco switches (ios) and redhat9 working? i config'd the switch and setup teaming through network manager. i can see the LACP pkts flowing between the switch and server after the link goes up, but then 45secs or so later something decides the LACP link coul

Re: [Beowulf] ib neighbor

2023-09-20 Thread Michael DiDomenico
Ryan Novosielski - novos...@rutgers.edu > || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus > || \\of NJ | Office of Advanced Research Computing - MSB A555B, Newark > `' > > On Sep 20, 2023, at 14:49, Michael DiDomenico wrote: > > mean

Re: [Beowulf] ib neighbor

2023-09-20 Thread Michael DiDomenico
; can get neighbours. >> A few years since I used it. >> >> On Tue, 19 Sep 2023, 19:03 Michael DiDomenico, >> wrote: >>> >>> does anyone know if there's a simple command to pull the neighbor of >>> the an ib port? for instance, this horri

[Beowulf] ib neighbor

2023-09-19 Thread Michael DiDomenico
does anyone know if there's a simple command to pull the neighbor of the an ib port? for instance, this horrible shell command line # for x in `ibstat | awk -F \' '/^CA/{print $2}'`; do iblinkinfo -C ${x} -n 1 -l | grep `hostname -s`; done 0x08006900fbcc "SwitchX - Mellanox Technologies" 41

Re: [Beowulf] NFS alternative for 200 core compute (beowulf) cluster

2023-08-10 Thread Michael DiDomenico
i would definitely look more at tuning nfs/backend disks rather then going down the rabbit hole of gluster/lustre/beegfs. you only have five nodes. nfs is a hog, but you're not likely to bottleneck the nfs protocol with only five nodes but for anyone here to give you better advice you'd have to

Re: [Beowulf] interconnect wars... again...

2023-07-26 Thread Michael DiDomenico
gt; > > > On Fri, Jul 21, 2023 at 9:12 AM Michael DiDomenico > wrote: >> >> ugh, as someone who worked the front lines in the 00's i got front row >> seat to the interconnect mud slinging... but franky if they're going >> to come out of the gate with

[Beowulf] interconnect wars... again...

2023-07-21 Thread Michael DiDomenico
ugh, as someone who worked the front lines in the 00's i got front row seat to the interconnect mud slinging... but franky if they're going to come out of the gate with a product named "Ultra Ethernet", i smell a loser... :) (sarcasm...) https://www.nextplatform.com/2023/07/20/ethernet-consortium

Re: [Beowulf] [External] Re: Your thoughts on the latest RHEL drama?

2023-07-11 Thread Michael DiDomenico
not sure i understand suse's move there. they can't run two competing linux ventures. people are going to be pretty apprehensive about investing time in a forked rhel clone, i would think even more so one run by a competing distro. i've been watching this play out in the media and how redhat kee

Re: [Beowulf] old sm/sgi bios

2023-03-23 Thread Michael DiDomenico
ics > https://wisecorp.co.uk, .us & .ru > > On 23/03/2023 16:51, Michael DiDomenico wrote: > > does anyone happen to have an old sgi / supermicro bios for an > > X9DRG-QF+ motherboard squirreled away somewhere? sgi is long gone, > > hpe might have something s

[Beowulf] old sm/sgi bios

2023-03-23 Thread Michael DiDomenico
does anyone happen to have an old sgi / supermicro bios for an X9DRG-QF+ motherboard squirreled away somewhere? sgi is long gone, hpe might have something still but who knows where. i reached out to supermicro, but i suspect they'll say no. ___ Beowulf

Re: [Beowulf] milan and rhel7

2022-06-29 Thread Michael DiDomenico
no doubts from me. thanks for the info Kilian. unfortunately sometimes purchasing out paces infrastructure. fortunately nothings set in stone so we'll see what can be changed On Wed, Jun 29, 2022 at 10:02 AM Joe Landman wrote: > > Egads ... if you are still running a 3 series kernel in product

[Beowulf] milan and rhel7

2022-06-28 Thread Michael DiDomenico
milan cpu's aren't officially supported on less then rhel8.3. but there's anecdotal evidence that rhel7 will run on milan cpu's. if the evidence is true, is anyone on the list doing so and can confirm? ___ Beowulf mailing list, Beowulf@beowulf.org spons

Re: [Beowulf] [External] beowulf hall of fame

2022-02-28 Thread Michael DiDomenico
it might be worthwhile to start with a note to the award committee and see if his name was left off intentionally because of some criteria or maybe it was just an oversight On Mon, Feb 28, 2022 at 11:34 AM Prentice Bisbal via Beowulf wrote: > > Is this where we start a change.org petition to get

[Beowulf] beowulf hall of fame

2022-02-25 Thread Michael DiDomenico
in case you missed it, apparently beowulf computing is being inducted into the space technologies hall of fame. in other news apparently there's a space technologies hall of fame... https://www.hpcwire.com/off-the-wire/beowulf-computing-cluster-will-be-inducted-into-the-space-technologies-hall-of