Am 06.02.2008 um 19:21 schrieb Robert G. Brown:
On Wed, 6 Feb 2008, Bill Rankin wrote:
Hey Rob,
Could it be a node naming issue where the wireless IP does not
resolve to the same address as that used in the machinefile? I
seem to recall a similar issue back when we PVM on machines with
multiple network connections.
pvmd is actually starting up on the target machine -- it works that
far.
The master node IP number is correct, as is the slave IP number (both
visible as arguments to pvmd). The name I'm using is the one
associated
with the wireless interface in question, both machines ping in all
four
directions by name with the correct internet address. All my machines
are configured more or less identically, use the same environment
variables, support transparent ssh command execution (which obviously
works even in PVM as the daemon is being spawned on the correct
target).
The wireless interfaces have the right MTU and look exactly like the
ethernet devices they in fact are to the kernel AFAIK. In every other
aspect I've ever tested, including my own homemade socket code,
response
to both tcp and udp daemons, ability to mount NFS, support ssh, and so
on and so forth, they behave like TCP/IP sockets over ethernet devices
as far as systems calls go -- they use the same interface, and the
whole
point of OSI/ISO is that code should not depend on the hardware layer
and in general on even a roughly posix compliant machine using
standard
devices and e.g. the socket API it doesn't.
Last time I encountered this, I actually cranked up the -d0x0 stuff
and
"watched" as the system went through to where it hung in the middle of
doing some part of the post-spawn handshaking.
Just an idea to check: PVM can also be started without rsh/ssh
between the machines. You have to copy and paste some things from
here to there and back and can startup all daemons this way by hand
(page 30 in the PVM book). Maybe this works - just to narrow the cause.
-- Reuti
I suspect a race condition, probably caused by using raw UDP with some
assumption of latency during the handshake. The one way I can
think of
that the two connections differ is in their latency -- even the
bandwidth of wireless is every bit as great as 10B2 networks I've run
PVM on in years past (on proportionally slower CPUs, of course).
If the
master or slave send out an acknowledgement packet either before the
window where the other can receive it or after it has grown bored and
stopped listening, it might fail to properly bind or something. It
seems like it would be a bug, not a feature, but if I were feeling
infinitely masochistic and were to wander down into Other People's
Source (ouch!) to try to debug this, that's what I'd look for first.
Any PVM developers still on list? Any comments from them?
rgb
Just a thought,
-bill
On Feb 6, 2008, at 10:40 AM, Robert G. Brown wrote:
Anybody on list have any idea why PVM fails to add hosts over a
wireless
link? I've now tried this over multiple distro version and at
least one
PVM update, and it just doesn't work. Works fine over a wire,
fails on
wireless, and as far as I know wire and wireless are both
"identical"
at the kernel interface layer so that any e.g. socket one might
open is
absolutely ecumenical about what the underlying hardware is (good
old
ISO/OSI layering, right?).
--
Robert G. Brown Phone(cell): 1-919-280-8443
Duke University Physics Dept, Box 90305
Durham, N.C. 27708-0305
Web: http://www.phy.duke.edu/~rgb
Book of Lilith Website: http://www.phy.duke.edu/~rgb/Lilith/Lilith.php
Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf