[Beowulf] PNNL: large new cluster

2007-09-29 Thread Mark Hahn
http://www.hpcwire.com/hpc/1805360.html http://www.pnl.gov/topstory.asp?id=275 4620 2.2 GHz quad-core Barcelona, presumably 2-socket nodes, 2GB/core with pretty agressive IO setup. if anyone involved in this cluster is reading the list, it would be most appreciated to see some comments re:

Re: [Beowulf] Problems with a JS21 - Ah, the networking...

2007-09-29 Thread Chris Samuel
On Sat, 29 Sep 2007, Ivan Paganini wrote: > I sniffed the network in the store nodes interface, and i got lots > of TCP lost fragment, previos lost fragments, ack lost fragments > and TCP window size full. Some suggestions would be to check that all network interfaces are negotiating gigabit bac

Re: [Beowulf] Problems with a JS21 - Ah, the networking...

2007-09-29 Thread Mark Hahn
and ran using mpich.ch_mx -v -machinefile list -np 4 ./program This still involves ethernet? I think that would work fine. you can simply run tcpdump on the eth interface one of the target machines to test, though. my experience is that it's naive to assume a vendor has a clue: _someone_ at t

Re: [Beowulf] Problems with a JS21 - Ah, the networking...

2007-09-29 Thread Ivan Paganini
Hello Mark! 2007/9/29, Mark Hahn <[EMAIL PROTECTED]>: > > I sniffed the network in the store nodes interface, and i got lots of > > TCP lost fragment, previos lost fragments, ack lost fragments and TCP > > window size full. The GPFS is now heavily used. > > so this indicates that you have a seriou

Re: [Beowulf] Problems with a JS21 - Ah, the networking...

2007-09-29 Thread Mark Hahn
I sniffed the network in the store nodes interface, and i got lots of TCP lost fragment, previos lost fragments, ack lost fragments and TCP window size full. The GPFS is now heavily used. so this indicates that you have a serious ethernet problem, no? The myrinet connection was working right,

Re: [Beowulf] Problems with a JS21 - Ah, the networking...

2007-09-29 Thread Ivan Paganini
Thank you, Bruce, I will try as soon I have access to the cluster. I already contacted Myricom support, John, and they are working to try to solve this, but still no solution to the problem. mx_counters in the two nodes that I am trying the test mpich programs dont show anything unusual: 1 ports

Re: [Beowulf] Problems with a JS21 - Ah, the networking...

2007-09-29 Thread John Hearns
On Fri, 2007-09-28 at 17:43 -0300, Ivan Paganini wrote: > Hello everybody, > > I am beginning to take care of an IBM's JS21. The cluster consists of > The myrinet connection was working right, but sometimes a user program > just got stuck - one of the processes was sleeping, and all others > were