Just a update: trying several times, the strace stops in different points, the speficied in the other email and here: _______________________________________________ munmap(0x40176000, 4096) = 0 time([1191243868]) = 1191243868 open("/etc/hosts", O_RDONLY) = 4 fcntl64(4, F_GETFD) = 0 fcntl64(4, F_SETFD, FD_CLOEXEC) = 0 fstat64(4, {st_mode=S_IFREG|0644, st_size=10247, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40176000 read(4, "#\n# hosts This file desc"..., 4096) = 4096 read(4, "yriBlade077\n192.168.30.178 myri"..., 4096) = 4096 read(4, " blade067 blade067.lcca.usp.br\n1"..., 4096) = 2055 read(4, "", 4096) = 0 close(4) = 0 munmap(0x40176000, 4096) = 0 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x40046f68) = 31382 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x40046f68) = 31383 brk(0x102ab000) = 0x102ab000 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x40046f68) = 31384 waitpid(-1, _______________________________________________
Thanks. 2007/10/1, Ivan Paganini <[EMAIL PROTECTED]>: > Hello Chris, everybody: > > I am not using jumbo frames, and I'm now considering this option, but > first I wanted to know for sure that there is no other problem before, > just to control the number of variables at hand. But thanks for your > help. > > I did a strace on the hanged process, and the output is this: > ______________________________________________ > > mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = > 0x401 > 76000 > read(4, "#\n# hosts This file desc"..., 4096) = 4096 > read(4, "yriBlade077\n192.168.30.178 myri"..., 4096) = 4096 > read(4, " blade067 blade067.lcca.usp.br\n1"..., 4096) = 2055 > read(4, "", 4096) = 0 > close(4) = 0 > munmap(0x40176000, 4096) = 0 > clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, > chil > d_tidptr=0x40046f68) = 25994 > clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, > chil > d_tidptr=0x40046f68) = 25995 > brk(0x102ab000) = 0x102ab000 > clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, > chil > d_tidptr=0x40046f68) = 25996 > waitpid(-1, 0xffffdbc8, 0) = ? ERESTARTSYS (To be restarted) > --- SIGWINCH (Window changed) @ 0 (0) --- > waitpid(-1, 0xffffdbc8, 0) = ? ERESTARTSYS (To be restarted) > --- SIGWINCH (Window changed) @ 0 (0) --- > waitpid(-1, 0xffffdbc8, 0) = ? ERESTARTSYS (To be restarted) > --- SIGWINCH (Window changed) @ 0 (0) --- > waitpid(-1, 0xffffdbc8, 0) = ? ERESTARTSYS (To be restarted) > --- SIGWINCH (Window changed) @ 0 (0) --- > waitpid(-1, 0xffffdbc8, 0) = ? ERESTARTSYS (To be restarted) > --- SIGWINCH (Window changed) @ 0 (0) --- > waitpid(-1, 0xffffdbc8, 0) = ? ERESTARTSYS (To be restarted) > --- SIGWINCH (Window changed) @ 0 (0) --- > waitpid(-1, 0xffffdbc8, 0) = ? ERESTARTSYS (To be restarted) > --- SIGWINCH (Window changed) @ 0 (0) --- > waitpid(-1, 0xffffdbc8, 0) = ? ERESTARTSYS (To be restarted) > --- SIGWINCH (Window changed) @ 0 (0) --- > waitpid(-1, 0xffffdbc8, 0) = ? ERESTARTSYS (To be restarted) > --- SIGWINCH (Window changed) @ 0 (0) --- > waitpid(-1, 0xffffdbc8, 0) = ? ERESTARTSYS (To be restarted) > --- SIGWINCH (Window changed) @ 0 (0) --- > waitpid(-1, 0xffffdbc8, 0) = ? ERESTARTSYS (To be restarted) > --- SIGWINCH (Window changed) @ 0 (0) --- > waitpid(-1, 0xffffdbc8, 0) = ? ERESTARTSYS (To be restarted) > --- SIGWINCH (Window changed) @ 0 (0) --- > waitpid(-1, 0xffffdbc8, 0) = ? ERESTARTSYS (To be restarted) > --- SIGWINCH (Window changed) @ 0 (0) --- > waitpid(-1, 0xffffdbc8, 0) = ? ERESTARTSYS (To be restarted) > --- SIGWINCH (Window changed) @ 0 (0) --- > waitpid(-1, 0xffffdbc8, 0) = ? ERESTARTSYS (To be restarted) > --- SIGWINCH (Window changed) @ 0 (0) --- > waitpid(-1, > > ______________________________________________ > and just that. I'm now trying to make a better undestanding that what > is happening. > > Thank you. > > Ivan > > > 2007/9/29, Chris Samuel <[EMAIL PROTECTED]>: > > On Sat, 29 Sep 2007, Ivan Paganini wrote: > > > > > I sniffed the network in the store nodes interface, and i got lots > > > of TCP lost fragment, previos lost fragments, ack lost fragments > > > and TCP window size full. > > > > Some suggestions would be to check that all network interfaces are > > negotiating gigabit back to the switch, and that if you are using > > jumbo frames then all interfaces are indeed using jumbo frames. > > > > A useful check to verify 2 way jumbo frames connectivity is by using > > the ping command, doing: > > > > ping -c 1 -M do -s 8900 $hostname > > > > should tell you whether or not it is working. > > > > Best of luck! > > Chris > > -- > > Christopher Samuel - (03) 9925 4751 - Systems Manager > > The Victorian Partnership for Advanced Computing > > P.O. Box 201, Carlton South, VIC 3053, Australia > > VPAC is a not-for-profit Registered Research Agency > > _______________________________________________ > > Beowulf mailing list, Beowulf@beowulf.org > > To change your subscription (digest mode or unsubscribe) visit > > http://www.beowulf.org/mailman/listinfo/beowulf > > > > > -- > ----------------------------------------------------------- > Ivan S. P. Marin > ---------------------------------------------------------- > -- ----------------------------------------------------------- Ivan S. P. Marin ---------------------------------------------------------- _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf