Hello Mark! 2007/9/29, Mark Hahn <[EMAIL PROTECTED]>: > > I sniffed the network in the store nodes interface, and i got lots of > > TCP lost fragment, previos lost fragments, ack lost fragments and TCP > > window size full. The GPFS is now heavily used. > > so this indicates that you have a serious ethernet problem, no?
I also think so, and this is my strongest possibility. But IBM does not accept that there is a error in the hardware, and while I argue with then about it, I was trying to search for other causes of the ether problem. > > > The myrinet connection was working right, but sometimes a user program > > just got stuck - one of the processes was sleeping, and all others > > were running. Then, the program hangs. Investigating this further, > > this happened with the simple mpich examples like cpi, cpilog, etc. We > > are using the mx driver version 1.1.6, and mpich-mx 1.2.7..5. mx_info > > shows all nodes connected when this happens, and the switch did not > > overheat. mpirun.ch_mx -v shows that all the processes are issued ok > > to the nodes, but somehow one (or more) process go to sleep or never > > starts, and all the other processes just hangs. The mx diagnose tools > > did not show any problem so far, but we still did not have done a > > but spawning myrinet jobs normally involves some use of ethernet, > which has known problems. as I recall, the protocol involves a > rendezvous ethernet socket managed by the rank0 node. couldn't the > myrinet-starting problem simply be due to the eth problem, rather than > anything specific to myrinet? > > here's an idea: configure ip-over-myrinet, and use it exclusively > to start the jobs. if that works, then you know for sure that the > problem is solely on the eth side (switch, perhaps, or maybe a nic > that's jabbering or otherwise misbehaving?) I have configured the ip-over-myrinet, but I'm not sure how to use exclusively myrinet. I will have to search more about this. My configuration is as follows: I am using mpich-mx v 1.2.7..5, and configured all the blades with one ip using ifconfig, like ifconfig myri0 192.168.30.101 Then, in a file called list, I put 192.168.30.101:4 (each blade has 4 cores). and ran using mpich.ch_mx -v -machinefile list -np 4 ./program This still involves ethernet? Thank you very much. -- ----------------------------------------------------------- Ivan S. P. Marin ---------------------------------------------------------- _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf