I'm afraid I have no insight into Aztec itself; I don't know anything about it. Two questions:
1. Can you run simple MPI fortran programs that call MPI_Comm_size with MPI_COMM_WORLD? 2. Can you get any more information than the stack trace? I.e., can you gdb a corefile to see exactly where in Aztec it's failing and confirm that it's not actually a bug in Aztec? I'm not trying to finger point, but if something is failing right away in the beginning with a call to MPI_COMM_SIZE, it's *usually* an application error of some sort (we haven't even gotten to anything complicated yet like MPI_SEND, etc.). For example: - The fact that it got through the parameter error checking in MPI_COMM_SIZE is a good thing, but it doesn't necessarily mean that the communicator it passed was valid (for example). - Did they leave of the ierr argument? (unlikely, but always possible) On Sep 2, 2010, at 8:06 AM, Rachel Gordon wrote: > Dear Jeff, > > The cluster has only the openmpi version of MPI and the mpi.h file is > installed in /shared/include/mpi.h > > Anyhow, I omitted the COMM size parameter and recompiled/linked the case > using: > > mpif77 -O -I../lib -c -o az_tutorial_with_MPI.o az_tutorial_with_MPI.f > mpif77 az_tutorial_with_MPI.o -O -L../lib -laztec -o sample > > But when I try running 'sample' I get the same: > > [cluster:00377] *** Process received signal *** > [cluster:00377] Signal: Segmentation fault (11) > [cluster:00377] Signal code: Address not mapped (1) > [cluster:00377] Failing at address: 0x100000098 > [cluster:00377] [ 0] /lib/libpthread.so.0 [0x7f6b55040a80] > [cluster:00377] [ 1] /shared/lib/libmpi.so.0(MPI_Comm_size+0x6e) > [0x7f6b564d834e] > [cluster:00377] [ 2] sample(parallel_info+0x24) [0x41d2ba] > [cluster:00377] [ 3] sample(AZ_set_proc_config+0x2d) [0x408417] > [cluster:00377] [ 4] sample(az_set_proc_config_+0xc) [0x407b85] > [cluster:00377] [ 5] sample(MAIN__+0x54) [0x407662] > [cluster:00377] [ 6] sample(main+0x2c) [0x44e8ec] > [cluster:00377] [ 7] /lib/libc.so.6(__libc_start_main+0xe6) [0x7f6b54cfd1a6] > [cluster:00377] [ 8] sample [0x407459] > [cluster:00377] *** End of error message *** > -------------------------------------------------------------------------- > > Rachel > > > > On Thu, 2 Sep 2010, Jeff Squyres (jsquyres) wrote: > >> If you're segv'ing in comm size, this usually means you are using the wrong >> mpi.h. Ensure you are using ompi's mpi.h so that you get the right values >> for all the MPI constants. >> >> Sent from my PDA. No type good. >> >> On Sep 2, 2010, at 7:35 AM, Rachel Gordon <rgor...@techunix.technion.ac.il> >> wrote: >> >>> Dear Manuel, >>> >>> Sorry, it didn't help. >>> >>> The cluster I am trying to run on has only the openmpi MPI version. So, >>> mpif77 is equivalent to mpif77.openmpi and mpicc is equivalent to >>> mpicc.openmpi >>> >>> I changed the Makefile, replacing gfortran by mpif77 and gcc by mpicc. >>> The compilation and linkage stage ran with no problem: >>> >>> >>> mpif77 -O -I../lib -DMAX_MEM_SIZE=16731136 -DCOMM_BUFF_SIZE=200000 >>> -DMAX_CHUNK_SIZE=200000 -c -o az_tutorial_with_MPI.o az_tutorial_with_MPI.f >>> mpif77 az_tutorial_with_MPI.o -O -L../lib -laztec -o sample >>> >>> >>> But again when I try to run 'sample' I get: >>> >>> mpirun -np 1 sample >>> >>> >>> [cluster:24989] *** Process received signal *** >>> [cluster:24989] Signal: Segmentation fault (11) >>> [cluster:24989] Signal code: Address not mapped (1) >>> [cluster:24989] Failing at address: 0x100000098 >>> [cluster:24989] [ 0] /lib/libpthread.so.0 [0x7f5058036a80] >>> [cluster:24989] [ 1] /shared/lib/libmpi.so.0(MPI_Comm_size+0x6e) >>> [0x7f50594ce34e] >>> [cluster:24989] [ 2] sample(parallel_info+0x24) [0x41d2ba] >>> [cluster:24989] [ 3] sample(AZ_set_proc_config+0x2d) [0x408417] >>> [cluster:24989] [ 4] sample(az_set_proc_config_+0xc) [0x407b85] >>> [cluster:24989] [ 5] sample(MAIN__+0x54) [0x407662] >>> [cluster:24989] [ 6] sample(main+0x2c) [0x44e8ec] >>> [cluster:24989] [ 7] /lib/libc.so.6(__libc_start_main+0xe6) [0x7f5057cf31a6] >>> [cluster:24989] [ 8] sample [0x407459] >>> [cluster:24989] *** End of error message *** >>> -------------------------------------------------------------------------- >>> mpirun noticed that process rank 0 with PID 24989 on node cluster exited on >>> signal 11 (Segmentation fault). >>> -------------------------------------------------------------------------- >>> >>> Thanks for your help and cooperation, >>> Sincerely, >>> Rachel >>> >>> >>> >>> On Wed, 1 Sep 2010, Manuel Prinz wrote: >>> >>>> Hi Rachel, >>>> >>>> I'm not very familiar with Fortran, so I'm most likely of not too much >>>> help here. I added Jeff to CC, maybe he can shed some lights into this. >>>> >>>> Am Montag, den 09.08.2010, 12:59 +0300 schrieb Rachel Gordon: >>>>> package: openmpi >>>>> >>>>> dpkg --search openmpi >>>>> gromacs-openmpi: /usr/share/doc/gromacs-openmpi/copyright >>>>> gromacs-dev: /usr/lib/libmd_mpi_openmpi.la >>>>> gromacs-dev: /usr/lib/libgmx_mpi_d_openmpi.la >>>>> gromacs-openmpi: /usr/share/lintian/overrides/gromacs-openmpi >>>>> gromacs-openmpi: /usr/lib/libmd_mpi_openmpi.so.5 >>>>> gromacs-openmpi: /usr/lib/libmd_mpi_d_openmpi.so.5.0.0 >>>>> gromacs-dev: /usr/lib/libmd_mpi_openmpi.so >>>>> gromacs-dev: /usr/lib/libgmx_mpi_d_openmpi.so >>>>> gromacs-openmpi: /usr/lib/libmd_mpi_openmpi.so.5.0.0 >>>>> gromacs-openmpi: /usr/bin/mdrun_mpi_d.openmpi >>>>> gromacs-openmpi: /usr/lib/libgmx_mpi_d_openmpi.so.5.0.0 >>>>> gromacs-openmpi: /usr/share/doc/gromacs-openmpi/README.Debian >>>>> gromacs-dev: /usr/lib/libgmx_mpi_d_openmpi.a >>>>> gromacs-openmpi: /usr/bin/mdrun_mpi.openmpi >>>>> gromacs-openmpi: /usr/share/doc/gromacs-openmpi/changelog.Debian.gz >>>>> gromacs-dev: /usr/lib/libmd_mpi_d_openmpi.la >>>>> gromacs-openmpi: /usr/share/man/man1/mdrun_mpi_d.openmpi.1.gz >>>>> gromacs-dev: /usr/lib/libgmx_mpi_openmpi.a >>>>> gromacs-openmpi: /usr/lib/libgmx_mpi_openmpi.so.5.0.0 >>>>> gromacs-dev: /usr/lib/libmd_mpi_d_openmpi.so >>>>> gromacs-openmpi: /usr/lib/libmd_mpi_d_openmpi.so.5 >>>>> gromacs-dev: /usr/lib/libgmx_mpi_openmpi.la >>>>> gromacs-openmpi: /usr/share/man/man1/mdrun_mpi.openmpi.1.gz >>>>> gromacs-openmpi: /usr/share/doc/gromacs-openmpi >>>>> gromacs-dev: /usr/lib/libmd_mpi_openmpi.a >>>>> gromacs-dev: /usr/lib/libgmx_mpi_openmpi.so >>>>> gromacs-openmpi: /usr/lib/libgmx_mpi_openmpi.so.5 >>>>> gromacs-openmpi: /usr/lib/libgmx_mpi_d_openmpi.so.5 >>>>> gromacs-dev: /usr/lib/libmd_mpi_d_openmpi.a >>>>> >>>>> >>>>> Dear support, >>>>> I am trying to run a test case of AZTEC library named >>>>> az_tutorial_with_MPI.f . The example uses gfortran + MPI. The >>>>> compilation and linkage stage goes O.K., generating an executable >>>>> 'sample'. But when I try to run sample (on 1 or more >>>>> processors) the run crushes immediately. >>>>> >>>>> The compilation and linkage stage is done as follows: >>>>> >>>>> gfortran -O -I/shared/include -I/shared/include/openmpi/ompi/mpi/cxx >>>>> -I../lib -DMAX_MEM_SIZE=16731136 >>>>> -DCOMM_BUFF_SIZE=200000 -DMAX_CHUNK_SIZE=200000 -c -o >>>>> az_tutorial_with_MPI.o az_tutorial_with_MPI.f >>>>> gfortran az_tutorial_with_MPI.o -O -L../lib -laztec -lm -L/shared/lib >>>>> -lgfortran -lmpi -lmpi_f77 -o sample >>>> >>>> Generally, when compiling programs for use with MPI, you should use the >>>> compiler wrappers which do all the magic. In Debian's case this is >>>> mpif77.openmpi and mpi90.openmpi, respectively. Could you give that a >>>> try? >>>> >>>>> The run: >>>>> /shared/home/gordon/Aztec_lib.dir/app>mpirun -np 1 sample >>>>> >>>>> [cluster:12046] *** Process received signal *** >>>>> [cluster:12046] Signal: Segmentation fault (11) >>>>> [cluster:12046] Signal code: Address not mapped (1) >>>>> [cluster:12046] Failing at address: 0x100000098 >>>>> [cluster:12046] [ 0] /lib/libc.so.6 [0x7fd4a2fa8f60] >>>>> [cluster:12046] [ 1] /shared/lib/libmpi.so.0(MPI_Comm_size+0x6e) >>>>> [0x7fd4a376c34e] >>>>> [cluster:12046] [ 2] sample [0x4178aa] >>>>> [cluster:12046] [ 3] sample [0x402a07] >>>>> [cluster:12046] [ 4] sample [0x402175] >>>>> [cluster:12046] [ 5] sample [0x401c52] >>>>> [cluster:12046] [ 6] sample [0x448edc] >>>>> [cluster:12046] [ 7] /lib/libc.so.6(__libc_start_main+0xe6) >>>>> [0x7fd4a2f951a6] >>>>> [cluster:12046] [ 8] sample [0x401a49] >>>>> [cluster:12046] *** End of error message *** >>>>> -------------------------------------------------------------------------- >>>>> mpirun noticed that process rank 0 with PID 12046 on node cluster exited >>>>> on signal 11 (Segmentation fault). >>>>> >>>>> Here is some information about the machine: >>>>> >>>>> uname -a >>>>> Linux cluster 2.6.26-2-amd64 #1 SMP Sun Jun 20 20:16:30 UTC 2010 x86_64 >>>>> GNU/Linux >>>>> >>>>> >>>>> lsb_release -a >>>>> No LSB modules are available. >>>>> Distributor ID: Debian >>>>> Description: Debian GNU/Linux 5.0.5 (lenny) >>>>> Release: 5.0.5 >>>>> Codename: lenny >>>>> >>>>> gcc --version >>>>> gcc (Debian 4.3.2-1.1) 4.3.2 >>>>> >>>>> gfortran --version >>>>> GNU Fortran (Debian 4.3.2-1.1) 4.3.2 >>>>> >>>>> ldd sample >>>>> linux-vdso.so.1 => (0x00007fffffffe000) >>>>> libgfortran.so.3 => /usr/lib/libgfortran.so.3 (0x00007fd29db16000) >>>>> libm.so.6 => /lib/libm.so.6 (0x00007fd29d893000) >>>>> libmpi.so.0 => /shared/lib/libmpi.so.0 (0x00007fd29d5e7000) >>>>> libmpi_f77.so.0 => /shared/lib/libmpi_f77.so.0 >>>>> (0x00007fd29d3af000) >>>>> libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x00007fd29d198000) >>>>> libc.so.6 => /lib/libc.so.6 (0x00007fd29ce45000) >>>>> libopen-rte.so.0 => /shared/lib/libopen-rte.so.0 >>>>> (0x00007fd29cbf8000) >>>>> libopen-pal.so.0 => /shared/lib/libopen-pal.so.0 >>>>> (0x00007fd29c9a2000) >>>>> libdl.so.2 => /lib/libdl.so.2 (0x00007fd29c79e000) >>>>> libnsl.so.1 => /lib/libnsl.so.1 (0x00007fd29c586000) >>>>> libutil.so.1 => /lib/libutil.so.1 (0x00007fd29c383000) >>>>> libpthread.so.0 => /lib/libpthread.so.0 (0x00007fd29c167000) >>>>> /lib64/ld-linux-x86-64.so.2 (0x00007fd29ddf1000) >>>>> >>>>> >>>>> Let me just mention that the C+MPI test case of the AZTEC library >>>>> 'az_tutorial.c' runs with no problem. >>>>> Also, az_tutorial_with_MPI.f runs O.K. on my 32bit LINUX cluster running >>>>> gcc,g77 and MPICH, and on my SGI 16 processors >>>>> Ithanium 64 bit machine. >>>> >>>> The IA64 architecture is supported by Open MPI, so this should be OK. >>>> >>>>> Thank you for your help, >>>> >>>> Best regards, >>>> Manuel >>>> >>>> >>>> >> -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org