----- Original Message ----- From: "Dan Kidger" <[EMAIL PROTECTED]>
To: <beowulf@beowulf.org>
Cc: "Tom Elken" <[EMAIL PROTECTED]>
Sent: Saturday, February 11, 2006 3:13 PM
Subject: Re: [Beowulf] Re: Matrix Multiply


Tom Elken wrote:

mathematician but im trying to understand how the benchmark operates.
I would like to test my system by seeing how many FLOPS are achieved
using only the Matrix Multiply.

You could probably download the HPC Challenge benchmark (from http://icl.cs.utk.edu/hpcc/software/index.html ) and cut-paste some code from it's DGEMM (Double-precision, GEneral Matrix-Multiply) sub-benchmark as a relatively easy way to get a test program for matrix-multiply.

DGEMM is typically ~90% or more of the HPL benchmark's profile.

Indeed I have been doing that for the last couple of years since hpcc appeared. It is trivial to slightly modify one file of the hpcc source such that you can run just one or two or the seven contained benchmarks via setting a shell variable - for example just to run dgemm in your case, or say ptrans or gups in my case. It is also very easy to pipe the hpcc output though a couple of lines of perl or sed so as to get just the summary output lines for the subset of tests that you ran.

As for the flops, this can depend on the lower bits of the matrix size - it is common to see dgemm implimentations oscilate due to cache line hits and the like. rather than speak in terms of actual flops, it usuely make more sense to quote the percentage of theoretical peak you get. The theoretical peak is well defined - simply the cpu's clock speed mulitplied up by favious factors like:
 - times number of cpus per Motherboard
 - times number of cores per cpu
- times number of floating point instructions issues per cycle (2 for itanium and alpha, 1 for xeon/opteron)
 - times width of any SIMD unit (2 for the 2*64-bit wide SSE2)
 - times two if your FPUs can do chained muladd (like itanium)
- times 75% reduction factor if you can't issue floating point loads fast enough (was this G5?)




Some cpu architectures come out better than others but you should expect to get say >85% even on the worst (thank you Mr Goto :-) )

- clockrate of a chip is a factor too
- L1 issues and weird habits of chip caches in general

Important is L1 cache.

Itanium2 can hide completely the latency in matrice because of a 1 cycle L1.
Opteron can nearly hide completely latency because of a 3 cycle L1 @ 2 ports
Xeon can't hide latency at all here as prescott is 4 cycle L1 @ 1 port

Because of this and other issues Xeon is effectively real slow here when you want to achieve high precision by just multiply(-adding) registers and the worst choice for matrix multiply.

Now let's discuss single precision operations. Suddenly the PC's look a lot better then than Itanium hardware, as SSE2 can be of great help there suddenly issuing 2 multiplies within 1 cycle.

Actually all floating point software we use professionally which (lucky) partly is cooked in hardware is single precision.

Single precision is far more interesting than double precision.
SSE2 is great there.

Vincent




Daniel.

--------------------------------------------------------------
Dr. Dan Kidger, Quadrics Ltd.      [EMAIL PROTECTED]
One Bridewell St., Bristol, BS1 2AA, UK         0117 915 5505
----------------------- www.quadrics.com --------------------

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to