Re: [Beowulf] Re: Matrix Multiply

Vincent Diepeveen Mon, 13 Feb 2006 17:38:31 -0800

----- Original Message -----From: "Dan Kidger" <[EMAIL PROTECTED]>

To: <beowulf@beowulf.org>
Cc: "Tom Elken" <[EMAIL PROTECTED]>
Sent: Saturday, February 11, 2006 3:13 PM
Subject: Re: [Beowulf] Re: Matrix Multiply

Tom Elken wrote:
mathematician but im trying to understand how the benchmark operates.
I would like to test my system by seeing how many FLOPS are achieved
using only the Matrix Multiply.
You could probably download the HPC Challenge benchmark (fromhttp://icl.cs.utk.edu/hpcc/software/index.html ) and cut-paste some codefrom it's DGEMM (Double-precision, GEneral Matrix-Multiply) sub-benchmarkas a relatively easy way to get a test program for matrix-multiply.
DGEMM is typically ~90% or more of the HPL benchmark's profile.
Indeed I have been doing that for the last couple of years since hpccappeared. It is trivial to slightly modify one file of the hpcc sourcesuch that you can run just one or two or the seven contained benchmarksvia setting a shell variable - for example just to run dgemm in your case,or say ptrans or gups in my case. It is also very easy to pipe the hpccoutput though a couple of lines of perl or sed so as to get just thesummary output lines for the subset of tests that you ran.
As for the flops, this can depend on the lower bits of the matrix size -it is common to see dgemm implimentations oscilate due to cache line hitsand the like. rather than speak in terms of actual flops, it usuely makemore sense to quote the percentage of theoretical peak you get. Thetheoretical peak is well defined - simply the cpu's clock speed mulitpliedup by favious factors like:
 - times number of cpus per Motherboard
 - times number of cores per cpu
- times number of floating point instructions issues per cycle (2 foritanium and alpha, 1 for xeon/opteron)
 - times width of any SIMD unit (2 for the 2*64-bit wide SSE2)
 - times two if your FPUs can do chained muladd (like itanium)
- times 75% reduction factor if you can't issue floating point loads fastenough (was this G5?)

Some cpu architectures come out better than others but you should expectto get say >85% even on the worst (thank you Mr Goto :-) )


- clockrate of a chip is a factor too
- L1 issues and weird habits of chip caches in general

Important is L1 cache.

Itanium2 can hide completely the latency in matrice because of a 1 cycle L1.
Opteron can nearly hide completely latency because of a 3 cycle L1 @ 2 ports
Xeon can't hide latency at all here as prescott is 4 cycle L1 @ 1 port

Because of this and other issues Xeon is effectively real slow here when youwant to achieve high precisionby just multiply(-adding) registers and the worst choice for matrixmultiply.

Now let's discuss single precision operations. Suddenly the PC's look a lotbetter then than Itanium hardware, as SSE2 can be of great help theresuddenly issuing 2 multiplies within 1 cycle.

Actually all floating point software we use professionally which (lucky)partly is cooked in hardware is single precision.


Single precision is far more interesting than double precision.
SSE2 is great there.

Vincent




Daniel.

--------------------------------------------------------------
Dr. Dan Kidger, Quadrics Ltd.      [EMAIL PROTECTED]
One Bridewell St., Bristol, BS1 2AA, UK         0117 915 5505
----------------------- www.quadrics.com --------------------

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org

To change your subscription (digest mode or unsubscribe) visithttp://www.beowulf.org/mailman/listinfo/beowulf


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Re: Matrix Multiply

Reply via email to