[Beowulf] IPoIB arp's disappearing

Michael Di Domenico Wed, 16 Jul 2008 13:12:10 -0700

I'm having a bit of a weird problem that i cannot figure out.  If anyone can
help from the community it would be appreciated.
Here's the packet flow


cn(ib0)->io(ib0)->io(eth5)->pan(*)

cn = compute node
io = io node
pan = panasas storage network

We have 12 shelves of panasas network storage on a seperate network, which
is being fronted by bridge servers which are routing IPoIB traffic to 10G
ethernet traffic.  We're using Mellanox Connect-X Ethernet/IB adapters
everwhere.  We're running Ofed 1.3.1 and the latest firmwares for IB/Eth
everywhere.

Here's the problem.  I can mount the storage on the compute nodes, but if i
try to send anything more then 50MB of data via dd.  I seem to loose the ARP
entries for the compute nodes on the IO servers.  This seems to happen
whether I use the filesystem or a netperf run from the compute node to the
panasas storage

I can run netperf between the compute node and io node and get full IPoIB
line rate with no issues
I can run netperf between the io node and the panasas storage and get full
10G ethernet line rate with no issues

When looking at the TCP traces, i can clearly see that a big chunk of data
is sent between the end-points and then it stalls.  Immediately after the
stall is an ARP request and then another chunk of data, and this scenario
repeats over and over.

Any thoughts or questions?

Thanks
- Michael

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

[Beowulf] IPoIB arp's disappearing

Reply via email to