Re: [Beowulf] IB troubles - mca_mpool_openib_register

Michael Huntingdon Thu, 22 Jun 2006 11:18:51 -0700

Bill

If you are going to look into a different MPI implementation,consider HP-MPI. The choice of interconnect (GigE, Myrinet, IB, andQuadrics) are all written into it, so you can create a single(common) operating environment for your programmers. I had a look atthe benchmarks a few months ago, which appear pretty consistentacross the board.


Michael


At 10:37 AM 6/22/2006, Bill Wichser wrote:

Thanks.

No I have not tried a different version of MPI to test but will doso. As for a later version of OpenIB, there is incentive to do sobut I don't know how quickly that can be accomplished.


Bill

Lombard, David N wrote:

More memory in your nodes?  Not sure what size of queues and such
openmpi allocates, but you could simply be running out of memory if
openmpi allocates large queue depths.
Have you tried an alternate MPI to see if you have the same problem?
Intel MPI, MVAPICH, MVAPICH2, as well as others support OpenIB.
Can you consider moving to a newer version of OpenIB?
--
David N. Lombard
My statements represent my opinions, not those of Intel Corporation

-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]

On

Behalf Of Bill Wichser
Sent: Thursday, June 22, 2006 6:02 AM
To: beowulf@beowulf.org
Subject: [Beowulf] IB troubles - mca_mpool_openib_register


Cluster with dual Xeons and Topsping IB adapters running a RH
2.6.9-34.ELsmp kernel (x86_64) with the RH IB stack installed, each

node

w/8G of memory.

Updated firmware as per Mellanox in the IB cards.

Updates /etc/security/limits.conf to have memlock be 8192, both soft

and

hard limits to overcome the initial trouble of pool allocation.

Application is cpi.c.

I can run across the 64 nodes using nodes=64:ppn=1 without trouble,
except for the

[btl_openib_endpoint.c:889:mca_btl_openib_endpoint_create_qp]

ibv_create

_qp: returned 0 byte(s) for max inline data

error messages, to be fixed I suppose in the next release.  These I

can

live with, perhaps, for now.

The problem is that when I run with nodes=64:ppn=2 and only use -np 64
with my openmpi (v 1.0.2 gcc compiled), it still runs fine, but when I
run with -np 65 I get megabytes of error messages and the job never
completes.  The errors all look like this:

mca_mpool_openib_register: ibv_reg_mr(0x2a96641000,1060864)
failed with error: Cannot allocate memory

I've submitted to the openib-general mailing list with no responses.

I'm

not sure if this is an openmpi problem, an openib problem, or some
configuration problem with the IB fabric.  Other programs fail with

even

less processors being allocated with these same errors.  Running over
TCP, albeit across the GigE network and not over IB, works fine.

I'm stuck here not knowing how to proceed.  Has anyone found this

issue

and, more importantly, found a solution?  I don't believe it to be a
limits.conf issue as I can allocate both processors on a node up to 32
nodes (-np 64) without problems.

Thanks,
Bill
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org

To change your subscription (digest mode or unsubscribe) visithttp://www.beowulf.org/mailman/listinfo/beowulf

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org

To change your subscription (digest mode or unsubscribe) visithttp://www.beowulf.org/mailman/listinfo/beowulf



_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] IB troubles - mca_mpool_openib_register

Reply via email to