Dear Jeff,
Concerning 1. : I just ran the simple MPI fortran program hello.f which
uses:
call MPI_INIT(ierror)
call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
The program ran with no problem.
Some more information:
The AZTEC test case I am trying to run is running with no problem on my
old PC cluster (Redhat operating system) using gcc, g77 and:
LIB_LINUX = /usr/lib/gcc-lib/i386-redhat-linux/2.96/libg2c.a
Concerning 2. Can you instruct me how to perform the check?
Rachel
On Thu, 2 Sep 2010, Jeff Squyres wrote:
I'm afraid I have no insight into Aztec itself; I don't know anything about it.
Two questions:
1. Can you run simple MPI fortran programs that call MPI_Comm_size with
MPI_COMM_WORLD?
2. Can you get any more information than the stack trace? I.e., can you gdb a
corefile to see exactly where in Aztec it's failing and confirm that it's not
actually a bug in Aztec? I'm not trying to finger point, but if something is
failing right away in the beginning with a call to MPI_COMM_SIZE, it's
*usually* an application error of some sort (we haven't even gotten to anything
complicated yet like MPI_SEND, etc.). For example:
- The fact that it got through the parameter error checking in
MPI_COMM_SIZE is a good thing, but it doesn't necessarily mean that the
communicator it passed was valid (for example).
- Did they leave of the ierr argument? (unlikely, but always possible)
On Sep 2, 2010, at 8:06 AM, Rachel Gordon wrote:
Dear Jeff,
The cluster has only the openmpi version of MPI and the mpi.h file is installed
in /shared/include/mpi.h
Anyhow, I omitted the COMM size parameter and recompiled/linked the case using:
mpif77 -O -I../lib -c -o az_tutorial_with_MPI.o az_tutorial_with_MPI.f
mpif77 az_tutorial_with_MPI.o -O -L../lib -laztec -o sample
But when I try running 'sample' I get the same:
[cluster:00377] *** Process received signal ***
[cluster:00377] Signal: Segmentation fault (11)
[cluster:00377] Signal code: Address not mapped (1)
[cluster:00377] Failing at address: 0x100000098
[cluster:00377] [ 0] /lib/libpthread.so.0 [0x7f6b55040a80]
[cluster:00377] [ 1] /shared/lib/libmpi.so.0(MPI_Comm_size+0x6e)
[0x7f6b564d834e]
[cluster:00377] [ 2] sample(parallel_info+0x24) [0x41d2ba]
[cluster:00377] [ 3] sample(AZ_set_proc_config+0x2d) [0x408417]
[cluster:00377] [ 4] sample(az_set_proc_config_+0xc) [0x407b85]
[cluster:00377] [ 5] sample(MAIN__+0x54) [0x407662]
[cluster:00377] [ 6] sample(main+0x2c) [0x44e8ec]
[cluster:00377] [ 7] /lib/libc.so.6(__libc_start_main+0xe6) [0x7f6b54cfd1a6]
[cluster:00377] [ 8] sample [0x407459]
[cluster:00377] *** End of error message ***
--------------------------------------------------------------------------
Rachel
On Thu, 2 Sep 2010, Jeff Squyres (jsquyres) wrote:
If you're segv'ing in comm size, this usually means you are using the wrong
mpi.h. Ensure you are using ompi's mpi.h so that you get the right values for
all the MPI constants.
Sent from my PDA. No type good.
On Sep 2, 2010, at 7:35 AM, Rachel Gordon <rgor...@techunix.technion.ac.il>
wrote:
Dear Manuel,
Sorry, it didn't help.
The cluster I am trying to run on has only the openmpi MPI version. So, mpif77
is equivalent to mpif77.openmpi and mpicc is equivalent to mpicc.openmpi
I changed the Makefile, replacing gfortran by mpif77 and gcc by mpicc.
The compilation and linkage stage ran with no problem:
mpif77 -O -I../lib -DMAX_MEM_SIZE=16731136 -DCOMM_BUFF_SIZE=200000
-DMAX_CHUNK_SIZE=200000 -c -o az_tutorial_with_MPI.o az_tutorial_with_MPI.f
mpif77 az_tutorial_with_MPI.o -O -L../lib -laztec -o sample
But again when I try to run 'sample' I get:
mpirun -np 1 sample
[cluster:24989] *** Process received signal ***
[cluster:24989] Signal: Segmentation fault (11)
[cluster:24989] Signal code: Address not mapped (1)
[cluster:24989] Failing at address: 0x100000098
[cluster:24989] [ 0] /lib/libpthread.so.0 [0x7f5058036a80]
[cluster:24989] [ 1] /shared/lib/libmpi.so.0(MPI_Comm_size+0x6e)
[0x7f50594ce34e]
[cluster:24989] [ 2] sample(parallel_info+0x24) [0x41d2ba]
[cluster:24989] [ 3] sample(AZ_set_proc_config+0x2d) [0x408417]
[cluster:24989] [ 4] sample(az_set_proc_config_+0xc) [0x407b85]
[cluster:24989] [ 5] sample(MAIN__+0x54) [0x407662]
[cluster:24989] [ 6] sample(main+0x2c) [0x44e8ec]
[cluster:24989] [ 7] /lib/libc.so.6(__libc_start_main+0xe6) [0x7f5057cf31a6]
[cluster:24989] [ 8] sample [0x407459]
[cluster:24989] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 24989 on node cluster exited on
signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Thanks for your help and cooperation,
Sincerely,
Rachel
On Wed, 1 Sep 2010, Manuel Prinz wrote:
Hi Rachel,
I'm not very familiar with Fortran, so I'm most likely of not too much
help here. I added Jeff to CC, maybe he can shed some lights into this.
Am Montag, den 09.08.2010, 12:59 +0300 schrieb Rachel Gordon:
package: openmpi
dpkg --search openmpi
gromacs-openmpi: /usr/share/doc/gromacs-openmpi/copyright
gromacs-dev: /usr/lib/libmd_mpi_openmpi.la
gromacs-dev: /usr/lib/libgmx_mpi_d_openmpi.la
gromacs-openmpi: /usr/share/lintian/overrides/gromacs-openmpi
gromacs-openmpi: /usr/lib/libmd_mpi_openmpi.so.5
gromacs-openmpi: /usr/lib/libmd_mpi_d_openmpi.so.5.0.0
gromacs-dev: /usr/lib/libmd_mpi_openmpi.so
gromacs-dev: /usr/lib/libgmx_mpi_d_openmpi.so
gromacs-openmpi: /usr/lib/libmd_mpi_openmpi.so.5.0.0
gromacs-openmpi: /usr/bin/mdrun_mpi_d.openmpi
gromacs-openmpi: /usr/lib/libgmx_mpi_d_openmpi.so.5.0.0
gromacs-openmpi: /usr/share/doc/gromacs-openmpi/README.Debian
gromacs-dev: /usr/lib/libgmx_mpi_d_openmpi.a
gromacs-openmpi: /usr/bin/mdrun_mpi.openmpi
gromacs-openmpi: /usr/share/doc/gromacs-openmpi/changelog.Debian.gz
gromacs-dev: /usr/lib/libmd_mpi_d_openmpi.la
gromacs-openmpi: /usr/share/man/man1/mdrun_mpi_d.openmpi.1.gz
gromacs-dev: /usr/lib/libgmx_mpi_openmpi.a
gromacs-openmpi: /usr/lib/libgmx_mpi_openmpi.so.5.0.0
gromacs-dev: /usr/lib/libmd_mpi_d_openmpi.so
gromacs-openmpi: /usr/lib/libmd_mpi_d_openmpi.so.5
gromacs-dev: /usr/lib/libgmx_mpi_openmpi.la
gromacs-openmpi: /usr/share/man/man1/mdrun_mpi.openmpi.1.gz
gromacs-openmpi: /usr/share/doc/gromacs-openmpi
gromacs-dev: /usr/lib/libmd_mpi_openmpi.a
gromacs-dev: /usr/lib/libgmx_mpi_openmpi.so
gromacs-openmpi: /usr/lib/libgmx_mpi_openmpi.so.5
gromacs-openmpi: /usr/lib/libgmx_mpi_d_openmpi.so.5
gromacs-dev: /usr/lib/libmd_mpi_d_openmpi.a
Dear support,
I am trying to run a test case of AZTEC library named
az_tutorial_with_MPI.f . The example uses gfortran + MPI. The
compilation and linkage stage goes O.K., generating an executable
'sample'. But when I try to run sample (on 1 or more
processors) the run crushes immediately.
The compilation and linkage stage is done as follows:
gfortran -O -I/shared/include -I/shared/include/openmpi/ompi/mpi/cxx
-I../lib -DMAX_MEM_SIZE=16731136
-DCOMM_BUFF_SIZE=200000 -DMAX_CHUNK_SIZE=200000 -c -o
az_tutorial_with_MPI.o az_tutorial_with_MPI.f
gfortran az_tutorial_with_MPI.o -O -L../lib -laztec -lm -L/shared/lib
-lgfortran -lmpi -lmpi_f77 -o sample
Generally, when compiling programs for use with MPI, you should use the
compiler wrappers which do all the magic. In Debian's case this is
mpif77.openmpi and mpi90.openmpi, respectively. Could you give that a
try?
The run:
/shared/home/gordon/Aztec_lib.dir/app>mpirun -np 1 sample
[cluster:12046] *** Process received signal ***
[cluster:12046] Signal: Segmentation fault (11)
[cluster:12046] Signal code: Address not mapped (1)
[cluster:12046] Failing at address: 0x100000098
[cluster:12046] [ 0] /lib/libc.so.6 [0x7fd4a2fa8f60]
[cluster:12046] [ 1] /shared/lib/libmpi.so.0(MPI_Comm_size+0x6e)
[0x7fd4a376c34e]
[cluster:12046] [ 2] sample [0x4178aa]
[cluster:12046] [ 3] sample [0x402a07]
[cluster:12046] [ 4] sample [0x402175]
[cluster:12046] [ 5] sample [0x401c52]
[cluster:12046] [ 6] sample [0x448edc]
[cluster:12046] [ 7] /lib/libc.so.6(__libc_start_main+0xe6)
[0x7fd4a2f951a6]
[cluster:12046] [ 8] sample [0x401a49]
[cluster:12046] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 12046 on node cluster exited
on signal 11 (Segmentation fault).
Here is some information about the machine:
uname -a
Linux cluster 2.6.26-2-amd64 #1 SMP Sun Jun 20 20:16:30 UTC 2010 x86_64
GNU/Linux
lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description: Debian GNU/Linux 5.0.5 (lenny)
Release: 5.0.5
Codename: lenny
gcc --version
gcc (Debian 4.3.2-1.1) 4.3.2
gfortran --version
GNU Fortran (Debian 4.3.2-1.1) 4.3.2
ldd sample
linux-vdso.so.1 => (0x00007fffffffe000)
libgfortran.so.3 => /usr/lib/libgfortran.so.3 (0x00007fd29db16000)
libm.so.6 => /lib/libm.so.6 (0x00007fd29d893000)
libmpi.so.0 => /shared/lib/libmpi.so.0 (0x00007fd29d5e7000)
libmpi_f77.so.0 => /shared/lib/libmpi_f77.so.0
(0x00007fd29d3af000)
libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x00007fd29d198000)
libc.so.6 => /lib/libc.so.6 (0x00007fd29ce45000)
libopen-rte.so.0 => /shared/lib/libopen-rte.so.0
(0x00007fd29cbf8000)
libopen-pal.so.0 => /shared/lib/libopen-pal.so.0
(0x00007fd29c9a2000)
libdl.so.2 => /lib/libdl.so.2 (0x00007fd29c79e000)
libnsl.so.1 => /lib/libnsl.so.1 (0x00007fd29c586000)
libutil.so.1 => /lib/libutil.so.1 (0x00007fd29c383000)
libpthread.so.0 => /lib/libpthread.so.0 (0x00007fd29c167000)
/lib64/ld-linux-x86-64.so.2 (0x00007fd29ddf1000)
Let me just mention that the C+MPI test case of the AZTEC library
'az_tutorial.c' runs with no problem.
Also, az_tutorial_with_MPI.f runs O.K. on my 32bit LINUX cluster running
gcc,g77 and MPICH, and on my SGI 16 processors
Ithanium 64 bit machine.
The IA64 architecture is supported by Open MPI, so this should be OK.
Thank you for your help,
Best regards,
Manuel
--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/
--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org