Hi Carsten
The problem is most likely mpich 1.2.7.
MPICH-1 is old and no longer maintained.
It is based on the P4 lower level libraries, which don't
seem to talk properly to current Linux kernels and/or
to current Ethernet card drivers.
There were several postings on this list, on the ROCKS Clusters list,
on the MPICH list, etc, reporting errors very similar to yours:
a p4 error followed by a segmentation fault.
The MPICH developers recommend upgrading to MPICH2 because of
these problems, besides performance, ease of use, etc.
The easy fix is to use another MPI, say, OpenMPI or MPICH2.
I would guess they are available as packages for Debian.
However, you can build both very easily
from source using just gcc/g++/gfortran.
Get the source code and documentation,
then read the README files, FAQ (OpenMPI),
and Install Guide, User Guide (MPICH2) for details:
OpenMPI
http://www.open-mpi.org/
http://www.open-mpi.org/software/ompi/v1.4/
http://www.open-mpi.org/faq/
http://www.open-mpi.org/faq/?category=building
MPICH2:
http://www.mcs.anl.gov/research/projects/mpich2/
http://www.mcs.anl.gov/research/projects/mpich2/downloads/index.php?s=downloads
http://www.mcs.anl.gov/research/projects/mpich2/documentation/index.php?s=docs
I compiled and ran HPL here with both OpenMPI and MPICH2
(and MVAPICH2 as well), and it works just fine,
over Ethernet and over Infiniband.
I hope this helps.
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------
Carsten Aulbert wrote:
Hi all,
I wanted to run high performance linpack mostly for fun (and of course to
learn more about it and stress test a couple of machines). However, so far
I've had very mixed results.
I downloaded the 2.0 version released in September 2008 and managed it to
compile with mpich 1.2.7 on Debian Lenny. The resulting xhpl file is
dynamically linked like this:
linux-vdso.so.1 => (0x00007fffca372000)
libpthread.so.0 => /lib/libpthread.so.0 (0x00007fb47bca8000)
librt.so.1 => /lib/librt.so.1 (0x00007fb47ba9f000)
libgfortran.so.3 => /usr/lib/libgfortran.so.3 (0x00007fb47b7c4000)
libm.so.6 => /lib/libm.so.6 (0x00007fb47b541000)
libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x00007fb47b32a000)
libc.so.6 => /lib/libc.so.6 (0x00007fb47afd7000)
/lib64/ld-linux-x86-64.so.2 (0x00007fb47bec4000)
Then I wanted to run a couple of tests on a single quad-CPU node (with 12 GB
physical RAM), I used
http://www.advancedclustering.com/faq/how-do-i-tune-my-hpldat-file.html
to generate files for a single and a dual core test [1] and [2].
Starting the single core run does not pose any problem:
/usr/bin/mpirun.mpich -np 1 -machinefile machines /nfs/xhpl
where machines is just a simple file containing 4 times the name of this host.
So far so good.
============================================================================
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
WR11C2R4 14592 128 1 1 407.94 5.078e+00
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0087653 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0209927 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0045327 ...... PASSED
============================================================================
When starting the two core run, I receive the following error message after a
couple of seconds (after RSS hits the VIRT RAM value in top):
/usr/bin/mpirun.mpich -np 2 -machinefile machines /nfs/xhpl
p0_20535: p4_error: interrupt SIGSEGV: 11
rm_l_1_20540: (1.804688) net_send: could not write to fd=5, errno = 32
SIGSEGV with p4_error indicates a seg fault within hpl - that's as far as I've
come with google, but right now I have no idea how to proceed. I somehow doubt
that this venerable program is so buggy that I'd hit it on my first day ;)
Any ideas where I might do something wrong?
Cheers
Carsten
[1]
single core test
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out output file name (if any)
8 device out (6=stdout,7=stderr,file)
1 # of problems sizes (N)
14592 Ns
1 # of NBs
128 NBs
0 PMAP process mapping (0=Row-,1=Column-major)
1 # of process grids (P x Q)
1 Ps
1 Qs
16.0 threshold
1 # of panel fact
2 PFACTs (0=left, 1=Crout, 2=Right)
1 # of recursive stopping criterium
4 NBMINs (>= 1)
1 # of panels in recursion
2 NDIVs
1 # of recursive panel fact.
1 RFACTs (0=left, 1=Crout, 2=Right)
1 # of broadcast
1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1 # of lookahead depth
1 DEPTHs (>=0)
2 SWAP (0=bin-exch,1=long,2=mix)
64 swapping threshold
0 L1 in (0=transposed,1=no-transposed) form
0 U in (0=transposed,1=no-transposed) form
1 Equilibration (0=no,1=yes)
8 memory alignment in double (> 0)
##### This line (no. 32) is ignored (it serves as a separator). ######
0 Number of additional problem sizes for PTRANS
1200 10000 30000 values of N
0 number of additional blocking sizes for PTRANS
40 9 8 13 13 20 16 32 64 values of NB
[2]
dual core setup
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out output file name (if any)
8 device out (6=stdout,7=stderr,file)
1 # of problems sizes (N)
14592 Ns
1 # of NBs
128 NBs
0 PMAP process mapping (0=Row-,1=Column-major)
1 # of process grids (P x Q)
1 Ps
2 Qs
16.0 threshold
1 # of panel fact
2 PFACTs (0=left, 1=Crout, 2=Right)
1 # of recursive stopping criterium
4 NBMINs (>= 1)
1 # of panels in recursion
2 NDIVs
1 # of recursive panel fact.
1 RFACTs (0=left, 1=Crout, 2=Right)
1 # of broadcast
1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1 # of lookahead depth
1 DEPTHs (>=0)
2 SWAP (0=bin-exch,1=long,2=mix)
64 swapping threshold
0 L1 in (0=transposed,1=no-transposed) form
0 U in (0=transposed,1=no-transposed) form
1 Equilibration (0=no,1=yes)
8 memory alignment in double (> 0)
##### This line (no. 32) is ignored (it serves as a separator). ######
0 Number of additional problem sizes for PTRANS
1200 10000 30000 values of N
0 number of additional blocking sizes for PTRANS
40 9 8 13 13 20 16 32 64 values of NB
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf