Hi Carsten

The problem is most likely mpich 1.2.7.
MPICH-1 is old and no longer maintained.
It is based on the P4 lower level libraries, which don't
seem to talk properly to current Linux kernels and/or
to current Ethernet card drivers.

There were several postings on this list, on the ROCKS Clusters list,
on the MPICH list, etc, reporting errors very similar to yours:
a p4 error followed by a segmentation fault.
The MPICH developers recommend upgrading to MPICH2 because of
these problems, besides performance, ease of use, etc.

The easy fix is to use another MPI, say, OpenMPI or MPICH2.
I would guess they are available as packages for Debian.

However, you can build both very easily
from source using just gcc/g++/gfortran.
Get the source code and documentation,
then read the README files, FAQ (OpenMPI),
and Install Guide, User Guide (MPICH2) for details:

OpenMPI
http://www.open-mpi.org/
http://www.open-mpi.org/software/ompi/v1.4/
http://www.open-mpi.org/faq/
http://www.open-mpi.org/faq/?category=building

MPICH2:
http://www.mcs.anl.gov/research/projects/mpich2/
http://www.mcs.anl.gov/research/projects/mpich2/downloads/index.php?s=downloads
http://www.mcs.anl.gov/research/projects/mpich2/documentation/index.php?s=docs

I compiled and ran HPL here with both OpenMPI and MPICH2
(and MVAPICH2 as well), and it works just fine,
over Ethernet and over Infiniband.

I hope this helps.
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------


Carsten Aulbert wrote:
Hi all,

I wanted to run high performance linpack mostly for fun (and of course to learn more about it and stress test a couple of machines). However, so far I've had very mixed results.

I downloaded the 2.0 version released in September 2008 and managed it to compile with mpich 1.2.7 on Debian Lenny. The resulting xhpl file is dynamically linked like this:

        linux-vdso.so.1 =>  (0x00007fffca372000)
        libpthread.so.0 => /lib/libpthread.so.0 (0x00007fb47bca8000)
        librt.so.1 => /lib/librt.so.1 (0x00007fb47ba9f000)
        libgfortran.so.3 => /usr/lib/libgfortran.so.3 (0x00007fb47b7c4000)
        libm.so.6 => /lib/libm.so.6 (0x00007fb47b541000)
        libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x00007fb47b32a000)
        libc.so.6 => /lib/libc.so.6 (0x00007fb47afd7000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fb47bec4000)

Then I wanted to run a couple of tests on a single quad-CPU node (with 12 GB physical RAM), I used

http://www.advancedclustering.com/faq/how-do-i-tune-my-hpldat-file.html

to generate files for a single and a dual core test [1] and [2].

Starting the single core run does not pose any problem:
/usr/bin/mpirun.mpich -np 1  -machinefile machines /nfs/xhpl

where machines is just a simple file containing 4 times the name of this host. So far so good. ============================================================================
T/V                N    NB     P     Q               Time             Gflops
----------------------------------------------------------------------------
WR11C2R4       14592   128     1     1             407.94          5.078e+00
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1  * N        ) =        0.0087653 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) =        0.0209927 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =        0.0045327 ...... PASSED
============================================================================

When starting the two core run, I receive the following error message after a couple of seconds (after RSS hits the VIRT RAM value in top):

/usr/bin/mpirun.mpich -np 2  -machinefile machines /nfs/xhpl
p0_20535:  p4_error: interrupt SIGSEGV: 11
rm_l_1_20540: (1.804688) net_send: could not write to fd=5, errno = 32

SIGSEGV with p4_error indicates a seg fault within hpl - that's as far as I've come with google, but right now I have no idea how to proceed. I somehow doubt that this venerable program is so buggy that I'd hit it on my first day ;)

Any ideas where I might do something wrong?

Cheers

Carsten

[1]
single core test
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out output file name (if any) 8 device out (6=stdout,7=stderr,file)
1            # of problems sizes (N)
14592         Ns
1            # of NBs
128           NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
1            Ps
1            Qs
16.0         threshold
1            # of panel fact
2            PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
4            NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
1            RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
1            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
1            DEPTHs (>=0)
2            SWAP (0=bin-exch,1=long,2=mix)
64           swapping threshold
0            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)
##### This line (no. 32) is ignored (it serves as a separator). ######
0                               Number of additional problem sizes for PTRANS
1200 10000 30000                values of N
0                               number of additional blocking sizes for PTRANS
40 9 8 13 13 20 16 32 64        values of NB

[2]
dual core setup
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out output file name (if any) 8 device out (6=stdout,7=stderr,file)
1            # of problems sizes (N)
14592         Ns
1            # of NBs
128           NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
1            Ps
2            Qs
16.0         threshold
1            # of panel fact
2            PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
4            NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
1            RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
1            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
1            DEPTHs (>=0)
2            SWAP (0=bin-exch,1=long,2=mix)
64           swapping threshold
0            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)
##### This line (no. 32) is ignored (it serves as a separator). ######
0                               Number of additional problem sizes for PTRANS
1200 10000 30000                values of N
0                               number of additional blocking sizes for PTRANS
40 9 8 13 13 20 16 32 64        values of NB
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to