Re: [Beowulf] 512 nodes Myrinet cluster Challanges

2006-05-02 Thread David Kewley
On Tuesday 02 May 2006 14:02, Bill Broadley wrote: > Mark Hahn said: > > moving it, stripped them out as I didn't need them. (I _do_ always > > require net-IPMI on anything newly purchased.) I've added more nodes > > to the cluster > > Net-IPMI on all hardware? Why? Running a second (or 3rd) net

Re: [Beowulf] 512 nodes Myrinet cluster Challanges

2006-05-02 Thread Robert Latham
On Fri, Apr 28, 2006 at 09:27:19AM +0200, Jaime Perea wrote: > >From my point of view, the big problem there is the IO, we installed > on our small cluster the pvfs2 system, it works well, using the > myrinet gm for the passing mechanism, the pvfs2 is only a solution > for parallel IO, since mpi ca

Re: [Beowulf] Kill zombies after a parallel run

2006-05-02 Thread Chris Samuel
On Tuesday 02 May 2006 17:49, mg wrote: > I use MPICH-1.2.5.2 to generate and run an FEM parallel application. > > During a parallel run, one process can crash, leaving the other > processes run and OS commands have to be used for kill these zombies. > So, does someone have a solution to avoid zom

Re: [Beowulf] 512 nodes Myrinet cluster Challanges

2006-05-02 Thread Dan Stromberg
I think IMPI sounds pretty worthwhile, although I don't have any first hand experience with it yet. The abilities to reboot a hung system or get decent statistics about this, that and the other thing, seems worth the cost in many cases, and my management has decided to require it on all of our ne

Re: [Beowulf] 512 nodes Myrinet cluster Challanges

2006-05-02 Thread Jim Lux
At 01:17 PM 5/2/2006, Bill Broadley wrote: > blower, rather than a bunch of 40mm axial/muffin fans. a much larger cluster > I'm working on now (768 nodes) has 14 40mm muffin fans in each node! while > I know I can rely on the vendor (HP) to replace failures promptly and without > complaint,

Re: [Beowulf] 512 nodes Myrinet cluster Challanges

2006-05-02 Thread Patrick Geoffray
Vincent, So, I just get back from vacation today and I find this post in my huge mailbox. Reason would tell me to not waste time and ignore it, but I can't resist such a treat. Diepeveen wrote: With so many nodes i'd go for either infiniband or quadrics, assuming the largest partition also gets

Re: [Beowulf] Kill zombies after a parallel run

2006-05-02 Thread David Mathog
mg <[EMAIL PROTECTED]> wrote: > I use MPICH-1.2.5.2 to generate and run an FEM parallel application. > > During a parallel run, one process can crash, leaving the other > processes run and OS commands have to be used for kill these zombies. > So, does someone have a solution to avoid zombies aft

Re: [Beowulf] 512 nodes Myrinet cluster Challanges

2006-05-02 Thread Mark Hahn
> > moving it, stripped them out as I didn't need them. (I _do_ always require > > net-IPMI on anything newly purchased.) I've added more nodes to the cluster > > Net-IPMI on all hardware? Why? Running a second (or 3rd) network isn't > a trivial amount of additional complexity, cables, or cost.

RE: [Beowulf] 512 nodes Myrinet cluster Challanges

2006-05-02 Thread Michael Will
IPMI nowadays comes for free on the mainboard, and if you don't want to run a separate infrastructure for the light weight control traffic, then you don't even need to add ports/cables/switches. In case of a scyld beowulf cluster the compute nodes are on their own private network switch anyways so

Re: [Beowulf] 512 nodes Myrinet cluster Challanges

2006-05-02 Thread Mark Hahn
> > in the cluster above, I choose a chassis (AIC) which has a large centrifugal > > Which? I noticed some of their designs redirect all heat from the powersupply > into the side of the rack. > > > blower, rather than a bunch of 40mm axial/muffin fans. a much larger > > cluster > > I'm working

Re: [Beowulf] 512 nodes Myrinet cluster Challanges

2006-05-02 Thread Bill Broadley
Mark Hahn said: > moving it, stripped them out as I didn't need them. (I _do_ always require > net-IPMI on anything newly purchased.) I've added more nodes to the cluster Net-IPMI on all hardware? Why? Running a second (or 3rd) network isn't a trivial amount of additional complexity, cables, o

Re: [Beowulf] 512 nodes Myrinet cluster Challanges

2006-05-02 Thread Bill Broadley
> in the cluster above, I choose a chassis (AIC) which has a large centrifugal Which? I noticed some of their designs redirect all heat from the powersupply into the side of the rack. > blower, rather than a bunch of 40mm axial/muffin fans. a much larger cluster > I'm working on now (768 nodes)

Re: [Beowulf] Kill zombies after a parallel run

2006-05-02 Thread David Kewley
I don't have a solution for your case, but here's an idea: MPICH-GM (MPICH for the Myrinet GM protocol) has an option to mpirun.ch_gm that would do what you want, if you were running Myrinet/GM: --gm-killKill all processes seconds after the first exits. Other than that, a resource man

[Beowulf] bwbug: New iGrid Storage Technology at Beowulf Meeting & Web Cast - May 9, 2006

2006-05-02 Thread Michael Fitzmaurice
You are invited to participate in our first a web cast discussing Crosswalk Inc.'s innovative approach to storage for grid & HPC environments. Specifically, iGrid, an Intelligent Storage Grid System, which provides a scalable architectural fabric enabling any application server to reach any

Re: [Beowulf] Opteron cooling specifications?

2006-05-02 Thread Dan Stromberg
On Mon, 2006-05-01 at 17:04 -0400, Mark Hahn wrote: > (also, I agree that the clumsiest level is shoulder-ish high...) In which case, one might use a small platform to elevate oneself and one's shoulders. ___ Beowulf mailing list, Beowulf@beowulf.org

[Beowulf] Kill zombies after a parallel run

2006-05-02 Thread mg
Hi all, I use MPICH-1.2.5.2 to generate and run an FEM parallel application. During a parallel run, one process can crash, leaving the other processes run and OS commands have to be used for kill these zombies. So, does someone have a solution to avoid zombies after a failed parallel run: can