Re: [Beowulf] trouble running the linpack xhpl benchmark

Craig Tierney Fri, 05 May 2006 14:35:51 -0700

Bruce Allen wrote:

I've built three other large clusters in the past, but was nevermotivated to do a Top500 linpack benchmark for them. This time around,for our new Nemo cluster, I want to have linpack results for the Top500list. So Kipp Cannon, one of our group's postdocs, has spent a few dayssetting up and running linpack/xhpl.
We have 640 dual-core 2.2 GHz opteron 175 nodes with 2GB per node and agood gigE network.
We're having problems getting xhpl to run on the entire cluster, and arewondering if someone on this list might have insight into what might begoing wrong. At the moment, the software combination is gcc + lam/mpi +atlas + hpl. Note that in our normal use the cluster runs standaloneexecutables managed via condor (trivially parallel code!) so this is ourfirst use of MPI or any MPI code in at least three years.


Use Goto's blas library.  It is faster than Atlas.

Testing on up to 338 nodes (676 cores), the benchmark runs fine and weare getting above 60% of peak floating-point performance. But,attempting to use the entire cluster (640 nodes, 1280 cores) seems totrigger the out-of-memory killer on some nodes. The jobs never reallyseem to start running, they are killed before calling mpi_init (which isthe error message we see from LAM: "job exited before calling mpi_init()").
The jobs die very quickly, so we have not been able to see how muchmemory they try to allocate. We are using a spreadsheet given to us byDavid Cownie at AMD for calculating the problem size based on themaximum usable RAM per core, and have found that that spreadsheet workscorrectly: running on 20 cores, 196 cores, and 676 cores with problemsizes chosen by that spreadsheet show the same, predicted, RAM used percore in all cases.
Could there be some threshold in xhpl, where above some problem sizeit's RAM usage increases for other reasons?
What about the "PxQ" parameters? For 676 cores we are using square P=Qbut change this to use 1280 cores. Does anyone know of problems withrunning xhpl when P != Q on x86_64?

Have you tried running xhpl on both halves of the system? This willtell you if you have hardware problems on one side of the system.

Also, try setting N to a small number, like 10000, for the entirecluster. You can start isolate what the problem is that way as well.


Just make sure that P<Q.  Keep it as square as possible.  32x40 should
work well for your system.

Craig

Cheers,
    Bruce Allen
_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visithttp://www.beowulf.org/mailman/listinfo/beowulf


_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] trouble running the linpack xhpl benchmark

Reply via email to