Bruce Allen wrote:
I've built three other large clusters in the past, but was never motivated to do a Top500 linpack benchmark for them. This time around, for our new Nemo cluster, I want to have linpack results for the Top500 list. So Kipp Cannon, one of our group's postdocs, has spent a few days setting up and running linpack/xhpl.

We have 640 dual-core 2.2 GHz opteron 175 nodes with 2GB per node and a good gigE network.

We're having problems getting xhpl to run on the entire cluster, and are wondering if someone on this list might have insight into what might be going wrong. At the moment, the software combination is gcc + lam/mpi + atlas + hpl. Note that in our normal use the cluster runs standalone executables managed via condor (trivially parallel code!) so this is our first use of MPI or any MPI code in at least three years.

Use Goto's blas library.  It is faster than Atlas.


Testing on up to 338 nodes (676 cores), the benchmark runs fine and we are getting above 60% of peak floating-point performance. But, attempting to use the entire cluster (640 nodes, 1280 cores) seems to trigger the out-of-memory killer on some nodes. The jobs never really seem to start running, they are killed before calling mpi_init (which is the error message we see from LAM: "job exited before calling mpi_init()").

The jobs die very quickly, so we have not been able to see how much memory they try to allocate. We are using a spreadsheet given to us by David Cownie at AMD for calculating the problem size based on the maximum usable RAM per core, and have found that that spreadsheet works correctly: running on 20 cores, 196 cores, and 676 cores with problem sizes chosen by that spreadsheet show the same, predicted, RAM used per core in all cases.

Could there be some threshold in xhpl, where above some problem size it's RAM usage increases for other reasons?

What about the "PxQ" parameters? For 676 cores we are using square P=Q but change this to use 1280 cores. Does anyone know of problems with running xhpl when P != Q on x86_64?

Have you tried running xhpl on both halves of the system? This will tell you if you have hardware problems on one side of the system.

Also, try setting N to a small number, like 10000, for the entire cluster. You can start isolate what the problem is that way as well.

Just make sure that P<Q.  Keep it as square as possible.  32x40 should
work well for your system.

Craig


Cheers,
    Bruce Allen
_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to