I'm afraid I have no insight into Aztec itself; I don't know anything about it. 
 Two questions:

1. Can you run simple MPI fortran programs that call MPI_Comm_size with 
MPI_COMM_WORLD?

2. Can you get any more information than the stack trace?  I.e., can you gdb a 
corefile to see exactly where in Aztec it's failing and confirm that it's not 
actually a bug in Aztec? I'm not trying to finger point, but if something is 
failing right away in the beginning with a call to MPI_COMM_SIZE, it's 
*usually* an application error of some sort (we haven't even gotten to anything 
complicated yet like MPI_SEND, etc.).  For example:

- The fact that it got through the parameter error checking in MPI_COMM_SIZE is 
a good thing, but it doesn't necessarily mean that the communicator it passed 
was valid (for example).
- Did they leave of the ierr argument?  (unlikely, but always possible)



On Sep 2, 2010, at 8:06 AM, Rachel Gordon wrote:

> Dear Jeff,
> 
> The cluster has only the openmpi version of MPI and the mpi.h file is 
> installed in /shared/include/mpi.h
> 
> Anyhow, I omitted the COMM size parameter and recompiled/linked the case 
> using:
> 
> mpif77 -O   -I../lib  -c -o az_tutorial_with_MPI.o az_tutorial_with_MPI.f
> mpif77 az_tutorial_with_MPI.o -O -L../lib -laztec      -o sample
> 
> But when I try running 'sample' I get the same:
> 
> [cluster:00377] *** Process received signal ***
> [cluster:00377] Signal: Segmentation fault (11)
> [cluster:00377] Signal code: Address not mapped (1)
> [cluster:00377] Failing at address: 0x100000098
> [cluster:00377] [ 0] /lib/libpthread.so.0 [0x7f6b55040a80]
> [cluster:00377] [ 1] /shared/lib/libmpi.so.0(MPI_Comm_size+0x6e) 
> [0x7f6b564d834e]
> [cluster:00377] [ 2] sample(parallel_info+0x24) [0x41d2ba]
> [cluster:00377] [ 3] sample(AZ_set_proc_config+0x2d) [0x408417]
> [cluster:00377] [ 4] sample(az_set_proc_config_+0xc) [0x407b85]
> [cluster:00377] [ 5] sample(MAIN__+0x54) [0x407662]
> [cluster:00377] [ 6] sample(main+0x2c) [0x44e8ec]
> [cluster:00377] [ 7] /lib/libc.so.6(__libc_start_main+0xe6) [0x7f6b54cfd1a6]
> [cluster:00377] [ 8] sample [0x407459]
> [cluster:00377] *** End of error message ***
> --------------------------------------------------------------------------
> 
> Rachel
> 
> 
> 
> On Thu, 2 Sep 2010, Jeff Squyres (jsquyres) wrote:
> 
>> If you're segv'ing in comm size, this usually means you are using the wrong 
>> mpi.h.  Ensure you are using ompi's mpi.h so that you get the right values 
>> for all the MPI constants.
>> 
>> Sent from my PDA. No type good.
>> 
>> On Sep 2, 2010, at 7:35 AM, Rachel Gordon <rgor...@techunix.technion.ac.il> 
>> wrote:
>> 
>>> Dear Manuel,
>>> 
>>> Sorry, it didn't help.
>>> 
>>> The cluster I am trying to run on has only the openmpi MPI version. So, 
>>> mpif77 is equivalent to mpif77.openmpi and mpicc is equivalent to 
>>> mpicc.openmpi
>>> 
>>> I changed the Makefile, replacing gfortran by mpif77 and gcc by mpicc.
>>> The compilation and linkage stage ran with no problem:
>>> 
>>> 
>>> mpif77 -O   -I../lib -DMAX_MEM_SIZE=16731136 -DCOMM_BUFF_SIZE=200000 
>>> -DMAX_CHUNK_SIZE=200000  -c -o az_tutorial_with_MPI.o az_tutorial_with_MPI.f
>>> mpif77 az_tutorial_with_MPI.o -O -L../lib -laztec      -o sample
>>> 
>>> 
>>> But again when I try to run 'sample' I get:
>>> 
>>> mpirun -np 1 sample
>>> 
>>> 
>>> [cluster:24989] *** Process received signal ***
>>> [cluster:24989] Signal: Segmentation fault (11)
>>> [cluster:24989] Signal code: Address not mapped (1)
>>> [cluster:24989] Failing at address: 0x100000098
>>> [cluster:24989] [ 0] /lib/libpthread.so.0 [0x7f5058036a80]
>>> [cluster:24989] [ 1] /shared/lib/libmpi.so.0(MPI_Comm_size+0x6e) 
>>> [0x7f50594ce34e]
>>> [cluster:24989] [ 2] sample(parallel_info+0x24) [0x41d2ba]
>>> [cluster:24989] [ 3] sample(AZ_set_proc_config+0x2d) [0x408417]
>>> [cluster:24989] [ 4] sample(az_set_proc_config_+0xc) [0x407b85]
>>> [cluster:24989] [ 5] sample(MAIN__+0x54) [0x407662]
>>> [cluster:24989] [ 6] sample(main+0x2c) [0x44e8ec]
>>> [cluster:24989] [ 7] /lib/libc.so.6(__libc_start_main+0xe6) [0x7f5057cf31a6]
>>> [cluster:24989] [ 8] sample [0x407459]
>>> [cluster:24989] *** End of error message ***
>>> --------------------------------------------------------------------------
>>> mpirun noticed that process rank 0 with PID 24989 on node cluster exited on 
>>> signal 11 (Segmentation fault).
>>> --------------------------------------------------------------------------
>>> 
>>> Thanks for your help and cooperation,
>>> Sincerely,
>>> Rachel
>>> 
>>> 
>>> 
>>> On Wed, 1 Sep 2010, Manuel Prinz wrote:
>>> 
>>>> Hi Rachel,
>>>> 
>>>> I'm not very familiar with Fortran, so I'm most likely of not too much
>>>> help here. I added Jeff to CC, maybe he can shed some lights into this.
>>>> 
>>>> Am Montag, den 09.08.2010, 12:59 +0300 schrieb Rachel Gordon:
>>>>> package:  openmpi
>>>>> 
>>>>> dpkg --search openmpi
>>>>> gromacs-openmpi: /usr/share/doc/gromacs-openmpi/copyright
>>>>> gromacs-dev: /usr/lib/libmd_mpi_openmpi.la
>>>>> gromacs-dev: /usr/lib/libgmx_mpi_d_openmpi.la
>>>>> gromacs-openmpi: /usr/share/lintian/overrides/gromacs-openmpi
>>>>> gromacs-openmpi: /usr/lib/libmd_mpi_openmpi.so.5
>>>>> gromacs-openmpi: /usr/lib/libmd_mpi_d_openmpi.so.5.0.0
>>>>> gromacs-dev: /usr/lib/libmd_mpi_openmpi.so
>>>>> gromacs-dev: /usr/lib/libgmx_mpi_d_openmpi.so
>>>>> gromacs-openmpi: /usr/lib/libmd_mpi_openmpi.so.5.0.0
>>>>> gromacs-openmpi: /usr/bin/mdrun_mpi_d.openmpi
>>>>> gromacs-openmpi: /usr/lib/libgmx_mpi_d_openmpi.so.5.0.0
>>>>> gromacs-openmpi: /usr/share/doc/gromacs-openmpi/README.Debian
>>>>> gromacs-dev: /usr/lib/libgmx_mpi_d_openmpi.a
>>>>> gromacs-openmpi: /usr/bin/mdrun_mpi.openmpi
>>>>> gromacs-openmpi: /usr/share/doc/gromacs-openmpi/changelog.Debian.gz
>>>>> gromacs-dev: /usr/lib/libmd_mpi_d_openmpi.la
>>>>> gromacs-openmpi: /usr/share/man/man1/mdrun_mpi_d.openmpi.1.gz
>>>>> gromacs-dev: /usr/lib/libgmx_mpi_openmpi.a
>>>>> gromacs-openmpi: /usr/lib/libgmx_mpi_openmpi.so.5.0.0
>>>>> gromacs-dev: /usr/lib/libmd_mpi_d_openmpi.so
>>>>> gromacs-openmpi: /usr/lib/libmd_mpi_d_openmpi.so.5
>>>>> gromacs-dev: /usr/lib/libgmx_mpi_openmpi.la
>>>>> gromacs-openmpi: /usr/share/man/man1/mdrun_mpi.openmpi.1.gz
>>>>> gromacs-openmpi: /usr/share/doc/gromacs-openmpi
>>>>> gromacs-dev: /usr/lib/libmd_mpi_openmpi.a
>>>>> gromacs-dev: /usr/lib/libgmx_mpi_openmpi.so
>>>>> gromacs-openmpi: /usr/lib/libgmx_mpi_openmpi.so.5
>>>>> gromacs-openmpi: /usr/lib/libgmx_mpi_d_openmpi.so.5
>>>>> gromacs-dev: /usr/lib/libmd_mpi_d_openmpi.a
>>>>> 
>>>>> 
>>>>> Dear support,
>>>>> I am trying to run a test case of AZTEC library named
>>>>> az_tutorial_with_MPI.f . The example uses gfortran + MPI. The
>>>>> compilation and linkage stage goes O.K., generating an executable
>>>>> 'sample'. But when I try to run sample (on 1 or more
>>>>> processors) the run crushes immediately.
>>>>> 
>>>>> The compilation and linkage stage is done as follows:
>>>>> 
>>>>> gfortran -O  -I/shared/include -I/shared/include/openmpi/ompi/mpi/cxx
>>>>> -I../lib -DMAX_MEM_SIZE=16731136
>>>>> -DCOMM_BUFF_SIZE=200000 -DMAX_CHUNK_SIZE=200000  -c -o
>>>>> az_tutorial_with_MPI.o az_tutorial_with_MPI.f
>>>>> gfortran az_tutorial_with_MPI.o -O -L../lib -laztec  -lm -L/shared/lib
>>>>> -lgfortran -lmpi -lmpi_f77 -o sample
>>>> 
>>>> Generally, when compiling programs for use with MPI, you should use the
>>>> compiler wrappers which do all the magic. In Debian's case this is
>>>> mpif77.openmpi and mpi90.openmpi, respectively. Could you give that a
>>>> try?
>>>> 
>>>>> The run:
>>>>> /shared/home/gordon/Aztec_lib.dir/app>mpirun -np 1 sample
>>>>> 
>>>>> [cluster:12046] *** Process received signal ***
>>>>> [cluster:12046] Signal: Segmentation fault (11)
>>>>> [cluster:12046] Signal code: Address not mapped (1)
>>>>> [cluster:12046] Failing at address: 0x100000098
>>>>> [cluster:12046] [ 0] /lib/libc.so.6 [0x7fd4a2fa8f60]
>>>>> [cluster:12046] [ 1] /shared/lib/libmpi.so.0(MPI_Comm_size+0x6e)
>>>>> [0x7fd4a376c34e]
>>>>> [cluster:12046] [ 2] sample [0x4178aa]
>>>>> [cluster:12046] [ 3] sample [0x402a07]
>>>>> [cluster:12046] [ 4] sample [0x402175]
>>>>> [cluster:12046] [ 5] sample [0x401c52]
>>>>> [cluster:12046] [ 6] sample [0x448edc]
>>>>> [cluster:12046] [ 7] /lib/libc.so.6(__libc_start_main+0xe6)
>>>>> [0x7fd4a2f951a6]
>>>>> [cluster:12046] [ 8] sample [0x401a49]
>>>>> [cluster:12046] *** End of error message ***
>>>>> --------------------------------------------------------------------------
>>>>> mpirun noticed that process rank 0 with PID 12046 on node cluster exited
>>>>> on signal 11 (Segmentation fault).
>>>>> 
>>>>> Here is some information about the machine:
>>>>> 
>>>>> uname -a
>>>>> Linux cluster 2.6.26-2-amd64 #1 SMP Sun Jun 20 20:16:30 UTC 2010 x86_64
>>>>> GNU/Linux
>>>>> 
>>>>> 
>>>>> lsb_release -a
>>>>> No LSB modules are available.
>>>>> Distributor ID: Debian
>>>>> Description:    Debian GNU/Linux 5.0.5 (lenny)
>>>>> Release:        5.0.5
>>>>> Codename:       lenny
>>>>> 
>>>>> gcc --version
>>>>> gcc (Debian 4.3.2-1.1) 4.3.2
>>>>> 
>>>>> gfortran --version
>>>>> GNU Fortran (Debian 4.3.2-1.1) 4.3.2
>>>>> 
>>>>> ldd sample
>>>>>        linux-vdso.so.1 =>  (0x00007fffffffe000)
>>>>>        libgfortran.so.3 => /usr/lib/libgfortran.so.3 (0x00007fd29db16000)
>>>>>        libm.so.6 => /lib/libm.so.6 (0x00007fd29d893000)
>>>>>        libmpi.so.0 => /shared/lib/libmpi.so.0 (0x00007fd29d5e7000)
>>>>>        libmpi_f77.so.0 => /shared/lib/libmpi_f77.so.0
>>>>> (0x00007fd29d3af000)
>>>>>        libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x00007fd29d198000)
>>>>>        libc.so.6 => /lib/libc.so.6 (0x00007fd29ce45000)
>>>>>        libopen-rte.so.0 => /shared/lib/libopen-rte.so.0
>>>>> (0x00007fd29cbf8000)
>>>>>        libopen-pal.so.0 => /shared/lib/libopen-pal.so.0
>>>>> (0x00007fd29c9a2000)
>>>>>        libdl.so.2 => /lib/libdl.so.2 (0x00007fd29c79e000)
>>>>>        libnsl.so.1 => /lib/libnsl.so.1 (0x00007fd29c586000)
>>>>>        libutil.so.1 => /lib/libutil.so.1 (0x00007fd29c383000)
>>>>>        libpthread.so.0 => /lib/libpthread.so.0 (0x00007fd29c167000)
>>>>>        /lib64/ld-linux-x86-64.so.2 (0x00007fd29ddf1000)
>>>>> 
>>>>> 
>>>>> Let me just mention that the C+MPI test case of the AZTEC library
>>>>> 'az_tutorial.c' runs with no problem.
>>>>> Also, az_tutorial_with_MPI.f runs O.K. on my 32bit LINUX cluster running
>>>>> gcc,g77 and MPICH, and on my SGI 16 processors
>>>>> Ithanium 64 bit machine.
>>>> 
>>>> The IA64 architecture is supported by Open MPI, so this should be OK.
>>>> 
>>>>> Thank you for your help,
>>>> 
>>>> Best regards,
>>>> Manuel
>>>> 
>>>> 
>>>> 
>> 


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Reply via email to