On Wed, 2008-07-16 at 16:00 -0400, Adam C Powell IV wrote:
> On Wed, 2008-07-16 at 13:42 -0400, Adam C Powell IV wrote:
> > Package: blacs-mpi
> > Version: 1.1-28
> > Severity: wishlist
> > 
> > Greetings,
> > 
> > Please add OpenMPI to the existing LAM and MPICH builds for blacs-mpi.
> > As you may know, LAM is deprecated in favor of OpenMPI, so this will be
> > a prominent MPI implementation moving forward.
> > 
> > I would be happy to provide a patch if needed.
> 
> Okay, so, I got impatient, and went ahead and made a patch, which is
> attached.  The "if [ -e ... ]" in the openmpi target is to make sure it
> skips those parts on arches which don't have openmpi.
> 
> The one problem is: the openmpi test package depends on liblam4.
> Perhaps the MPI=openmpi bit doesn't quite work in the TESTING dir?

Indeed.  I've attached an additional patch, which you'd need to apply
after the previous patch, to actually make it build using OpenMPI.  For
explanation:
      * Bmake.inc needs a new openmpi section, and the fortran programs
        need -lmpi_f77 as well as -lmpi to link properly
      * Openmpi needs three mpif*.h files, so the two makefiles need to
        link all of them to the working directory
      * In rules, because lam doesn't come first, it requires a clean
        step before it can build

> The packages install fine, and have about the same contents as the -lam
> and -mpich packages.  I haven't run the tests yet.

It builds and installs fine now, and everything has the right
dependencies.  And the openmpi fortran test runs fine.

However, the openmpi C test segfaults early on. :-(
Here's the output (orted is similar to lamboot):

252 workhorse% orted
253 workhorse% mpirun -np 4 ./cblacs_test_shared-openmpi
BLACS WARNING 'No need to set message ID range due to MPI communicator.'
from {-1,-1}, pnum=0, Contxt=-1, on line 18 of file 'blacs_set_.c'.

BLACS WARNING 'No need to set message ID range due to MPI communicator.'
from {-1,-1}, pnum=1, Contxt=-1, on line 18 of file 'blacs_set_.c'.

[workhorse:23590] *** Process received signal ***
[workhorse:23590] Signal: Segmentation fault (11)
[workhorse:23590] Signal code: Address not mapped (1)
[workhorse:23590] Failing at address: 0xb08a0cf8
[workhorse:23590] [ 0] /lib/libc.so.6 [0x7f4daf93ff80]
[workhorse:23589] *** Process received signal ***
[workhorse:23589] Signal: Segmentation fault (11)
[workhorse:23589] Signal code: Address not mapped (1)
[workhorse:23589] Failing at address: 0x1fb95cf8
BLACS WARNING 'No need to set message ID range due to MPI communicator.'
from {-1,-1}, pnum=3, Contxt=-1, on line 18 of file 'blacs_set_.c'.

[workhorse:23589] [ 0] /lib/libc.so.6 [0x7f6f1ec34f80]
[workhorse:23589] [ 1] /usr/lib/libmpi.so.0(PMPI_Comm_group+0x50) 
[0x7f6f1f954a20]
[workhorse:23589] [ 2] /usr/lib/libblacs-openmpi.so.1(BI_TransUserComm+0x25) 
[0x7f6f1fbbcc05]
[workhorse:23589] [ 3] /usr/lib/libblacs-openmpi.so.1(Cblacs_gridmap+0x132) 
[0x7f6f1fbae7b2]
[workhorse:23589] [ 4] /usr/lib/libblacs-openmpi.so.1(Cblacs_gridinit+0x2ea) 
[0x7f6f1fbb239a]
[workhorse:23589] [ 5] ./cblacs_test_shared-openmpi [0x4036c4]
[workhorse:23589] [ 6] ./cblacs_test_shared-openmpi [0x478dcc]
[workhorse:23589] [ 7] /lib/libc.so.6(__libc_start_main+0xe6) [0x7f6f1ec211a6]
[workhorse:23589] [ 8] ./cblacs_test_shared-openmpi [0x403559]
[workhorse:23589] *** End of error message ***
[workhorse:23592] *** Process received signal ***
[workhorse:23592] Signal: Segmentation fault (11)
[workhorse:23592] Signal code: Address not mapped (1)
[workhorse:23592] Failing at address: 0x9a823cf8
[workhorse:23590] [ 1] /usr/lib/libmpi.so.0(PMPI_Comm_group+0x50) 
[0x7f4db065fa20]
[workhorse:23590] [ 2] /usr/lib/libblacs-openmpi.so.1(BI_TransUserComm+0x25) 
[0x7f4db08c7c05]
[workhorse:23590] [ 3] /usr/lib/libblacs-openmpi.so.1(Cblacs_gridmap+0x132) 
[0x7f4db08b97b2]
[workhorse:23590] [ 4] /usr/lib/libblacs-openmpi.so.1(Cblacs_gridinit+0x2ea) 
[0x7f4db08bd39a]
BLACS WARNING 'No need to set message ID range due to MPI communicator.'
from {-1,-1}, pnum=2, Contxt=-1, on line 18 of file 'blacs_set_.c'.

[workhorse:23591] *** Process received signal ***
[workhorse:23591] Signal: Segmentation fault (11)
[workhorse:23591] Signal code: Address not mapped (1)
[workhorse:23591] Failing at address: 0xb940ccf8
[workhorse:23591] [ 0] /lib/libc.so.6 [0x7f6bb84abf80]
[workhorse:23591] [ 1] /usr/lib/libmpi.so.0(PMPI_Comm_group+0x50) 
[0x7f6bb91cba20]
[workhorse:23591] [ 2] /usr/lib/libblacs-openmpi.so.1(BI_TransUserComm+0x25) 
[0x7f6bb9433c05]
[workhorse:23591] [ 3] /usr/lib/libblacs-openmpi.so.1(Cblacs_gridmap+0x132) 
[0x7f6bb94257b2]
[workhorse:23591] [ 4] /usr/lib/libblacs-openmpi.so.1(Cblacs_gridinit+0x2ea) 
[0x7f6bb942939a]
[workhorse:23591] [ 5] ./cblacs_test_shared-openmpi [0x4036c4]
[workhorse:23591] [ 6] ./cblacs_test_shared-openmpi [0x478dcc]
[workhorse:23591] [ 7] /lib/libc.so.6(__libc_start_main+0xe6) [0x7f6bb84981a6]
[workhorse:23591] [ 8] ./cblacs_test_shared-openmpi [0x403559]
[workhorse:23591] *** End of error message ***
[workhorse:23592] [ 0] /lib/libc.so.6 [0x7fe0998c2f80]
[workhorse:23592] [ 1] /usr/lib/libmpi.so.0(PMPI_Comm_group+0x50) 
[0x7fe09a5e2a20]
[workhorse:23592] [ 2] /usr/lib/libblacs-openmpi.so.1(BI_TransUserComm+0x25) 
[0x7fe09a84ac05]
[workhorse:23592] [ 3] /usr/lib/libblacs-openmpi.so.1(Cblacs_gridmap+0x132) 
[0x7fe09a83c7b2]
[workhorse:23592] [ 4] /usr/lib/libblacs-openmpi.so.1(Cblacs_gridinit+0x2ea) 
[0x7fe09a84039a]
[workhorse:23592] [ 5] ./cblacs_test_shared-openmpi [0x4036c4]
[workhorse:23592] [ 6] ./cblacs_test_shared-openmpi [0x478dcc]
[workhorse:23592] [ 7] /lib/libc.so.6(__libc_start_main+0xe6) [0x7fe0998af1a6]
[workhorse:23592] [ 8] ./cblacs_test_shared-openmpi [0x403559]
[workhorse:23592] *** End of error message ***
[workhorse:23586] [0,0,0] ORTE_ERROR_LOG: Timeout in file 
base/pls_base_orted_cmds.c at line 275
[workhorse:23586] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at 
line 1166
[workhorse:23586] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 
90
mpirun noticed that job rank 0 with PID 23589 on node workhorse exited on 
signal 11 (Segmentation fault). 
1 additional process aborted (not shown)
[workhorse:23586] [0,0,0] ORTE_ERROR_LOG: Timeout in file 
base/pls_base_orted_cmds.c at line 188
[workhorse:23586] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at 
line 1198
--------------------------------------------------------------------------
mpirun was unable to cleanly terminate the daemons for this job. Returned value 
Timeout instead of ORTE_SUCCESS.
--------------------------------------------------------------------------

Do you want to send this upstream, or shall I?

Cheers,
-Adam
-- 
GPG fingerprint: D54D 1AEE B11C CE9B A02B  C5DD 526F 01E8 564E E4B6

Engineering consulting with open source tools
http://www.opennovation.com/
--- blacs-mpi-1.1/Bmake.inc~	2008-08-13 22:23:52.000000000 +0000
+++ blacs-mpi-1.1/Bmake.inc	2008-08-13 22:42:47.000000000 +0000
@@ -55,12 +55,20 @@
    MPILIBdir = $(MPIdir)/lib
    MPIINCdir = $(MPIdir)/include
    MPILIB = $(MPILIBdir)/shared/libmpich.so $(MPILIBdir)/shared/libpmpich.so $(MPILIBdir)/libmpich.a
-else
+endif
+ifeq ($(MPI),lam)
 # for compilation with lam:
    MPILIBdir = /usr/lib/lam/lib
    MPIINCdir = /usr/include/lam
    MPILIB = -L/usr/lib/lam/lib -llam
 endif
+ifeq ($(MPI),openmpi)
+# for compilation with openmpi:
+   MPIdir = /usr/lib/openmpi
+   MPILIBdir = $(MPIdir)/lib
+   MPIINCdir = $(MPIdir)/include
+   MPILIB = -L/usr/lib/openmpi/lib -lmpi -lmpi_f77
+endif
 
 
 #  -------------------------------------
--- blacs-mpi-1.1/SRC/MPI/Makefile~	2008-08-13 22:23:52.000000000 +0000
+++ blacs-mpi-1.1/SRC/MPI/Makefile	2008-08-13 22:55:32.000000000 +0000
@@ -194,8 +194,8 @@
 	$(F77) -c $(F77FLAGS) $*.f
 
 mpif.h: $(MPIINCdir)/mpif.h
-	rm -f mpif.h
-	ln -s $< $@
+	rm -f mpif*
+	ln -s $(MPIINCdir)/mpif* .
 
 #  ------------------------------------------------------------------------
 #  We move C .o files to .C so that we can use the portable suffix rule for
--- blacs-mpi-1.1/TESTING/Makefile~	2008-08-13 22:23:52.000000000 +0000
+++ blacs-mpi-1.1/TESTING/Makefile	2008-08-13 23:01:46.000000000 +0000
@@ -59,8 +59,8 @@
 	$(F77) -c $(F77FLAGS) $*.f
 
 mpif.h: $(MPIINCdir)/mpif.h
-	rm -f mpif.h
-	ln -s $< $@
+	rm -f mpif*
+	ln -s $(MPIINCdir)/mpif* .
 
 fpvm3.h : $(PVMINCdir)/fpvm3.h
 	rm -f fpvm3.h
--- blacs-mpi-1.1/debian/rules~	2008-08-13 23:15:42.000000000 +0000
+++ blacs-mpi-1.1/debian/rules	2008-08-13 23:16:19.000000000 +0000
@@ -56,6 +56,9 @@
 build-stamp-lam:
 	dh_testdir
 	[ -d TESTING/EXE ] || mkdir TESTING/EXE
+# next is a clean
+	BASEDIR=$(topdir) make cleanall
+	cd TESTING && BASEDIR=$(topdir) make clean
 # build the static libraries
 	BASEDIR=$(topdir) MPI=lam make mpi
 # the testing binaries

Attachment: signature.asc
Description: This is a digitally signed message part

Reply via email to