I'm having a bit of a weird problem that i cannot figure out. If anyone can help from the community it would be appreciated. Here's the packet flow
cn(ib0)->io(ib0)->io(eth5)->pan(*) cn = compute node io = io node pan = panasas storage network We have 12 shelves of panasas network storage on a seperate network, which is being fronted by bridge servers which are routing IPoIB traffic to 10G ethernet traffic. We're using Mellanox Connect-X Ethernet/IB adapters everwhere. We're running Ofed 1.3.1 and the latest firmwares for IB/Eth everywhere. Here's the problem. I can mount the storage on the compute nodes, but if i try to send anything more then 50MB of data via dd. I seem to loose the ARP entries for the compute nodes on the IO servers. This seems to happen whether I use the filesystem or a netperf run from the compute node to the panasas storage I can run netperf between the compute node and io node and get full IPoIB line rate with no issues I can run netperf between the io node and the panasas storage and get full 10G ethernet line rate with no issues When looking at the TCP traces, i can clearly see that a big chunk of data is sent between the end-points and then it stalls. Immediately after the stall is an ARP request and then another chunk of data, and this scenario repeats over and over. Any thoughts or questions? Thanks - Michael
_______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf