> On 22 Nov 2022, at 06:16, Christopher Samuel <ch...@csamuel.org> wrote:
> 
> On 11/21/22 4:39 am, Scott Atchley wrote:
> 
>> We have OpenMPI running on Frontier with libfabric. We are using HPE's CXI 
>> (Cray eXascale Interface) provider instead of RoCE though.
> 
> Yeah I'm curious to know if Matt's issues are about OpenMPI->libfabric or 
> libfabric->RoCE ?
> 
> FWIW we're using Cray's MPICH over libfabric (also over CXI), the ABI 
> portability of MPICH is really useful to us as it allows us to patch 
> containers used via Shifter to replace their MPI libraries with the Cray ones 
> and have their code use the HSN natively.

At the moment I’m stuck trying to pinpoint this myself.

We’re using Intel E810 NICs. Low level RDMA seems to be working, iperf gives 
the expected performance, however for MPI, these apparently need PSM3.
For MPI performance I’ve been running the OSU Microbenchmarks, in particular, 
osu_bw.

I’ve had osu_bw working over tcp, about 1.8GB/sec. So I know it was working.

PSM3 support comes from libfabric at this time, OpenMPI itself seems to top out 
at PSM2. So in the interest of not installing the entire OneAPI stack, I 
thought I would just rebuild OpenMPI to use libfabric, and libfabric to support 
PSM3.

Used spack to get it done, initial result after the first build was a series of 
errors from mpirun telling me that the PSM3 module could not open the VLAN 
interface that’s being used for this. While not ideal, this suggests that my 
compilation worked. Pinged Intel, they believe it should work, but ask me to 
upgrade to the latest ice driver.

Upgrade to the latest ice driver, and now there’s nothing. Every mpirun hangs 
indefinitely. There’s no orted on the remote node, nothing. I’ve left it for 
30+ minutes. No errors, no time outs nothing.

Intel gave me some environment variables to set, nothing. It’s like the module 
is no longer being loaded.

Discussions with Intel, realise there’s a bunch of MLNX libraries still 
floating around, set about purging all of those. Nothing, no change.

I’ve tried stracing, it sits with a single poll, no runaway loop, just poll 
something and poll forever, previous entries are just the usual library look 
ups and it seems to find what it needs.

I’ve installed the Intel Fabric Suite which comes with its own OpenMPI. Same 
result. 

I’m about to rebuild it and the ice driver, I’m just confused at how it went 
from PSM3 complaining about an interface to nothing at all. LDD shows all the 
correct libraries are being found, lsmod shows the correct modules in the 
kernel.

Matt.   
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

Reply via email to