----- Original Message -----
From: "Dan Kidger" <[EMAIL PROTECTED]>
To: <beowulf@beowulf.org>
Cc: "Tom Elken" <[EMAIL PROTECTED]>
Sent: Saturday, February 11, 2006 3:13 PM
Subject: Re: [Beowulf] Re: Matrix Multiply
Tom Elken wrote:
mathematician but im trying to understand how the benchmark operates.
I would like to test my system by seeing how many FLOPS are achieved
using only the Matrix Multiply.
You could probably download the HPC Challenge benchmark (from
http://icl.cs.utk.edu/hpcc/software/index.html ) and cut-paste some code
from it's DGEMM (Double-precision, GEneral Matrix-Multiply) sub-benchmark
as a relatively easy way to get a test program for matrix-multiply.
DGEMM is typically ~90% or more of the HPL benchmark's profile.
Indeed I have been doing that for the last couple of years since hpcc
appeared. It is trivial to slightly modify one file of the hpcc source
such that you can run just one or two or the seven contained benchmarks
via setting a shell variable - for example just to run dgemm in your case,
or say ptrans or gups in my case. It is also very easy to pipe the hpcc
output though a couple of lines of perl or sed so as to get just the
summary output lines for the subset of tests that you ran.
As for the flops, this can depend on the lower bits of the matrix size -
it is common to see dgemm implimentations oscilate due to cache line hits
and the like. rather than speak in terms of actual flops, it usuely make
more sense to quote the percentage of theoretical peak you get. The
theoretical peak is well defined - simply the cpu's clock speed mulitplied
up by favious factors like:
- times number of cpus per Motherboard
- times number of cores per cpu
- times number of floating point instructions issues per cycle (2 for
itanium and alpha, 1 for xeon/opteron)
- times width of any SIMD unit (2 for the 2*64-bit wide SSE2)
- times two if your FPUs can do chained muladd (like itanium)
- times 75% reduction factor if you can't issue floating point loads fast
enough (was this G5?)
Some cpu architectures come out better than others but you should expect
to get say >85% even on the worst (thank you Mr Goto :-) )
- clockrate of a chip is a factor too
- L1 issues and weird habits of chip caches in general
Important is L1 cache.
Itanium2 can hide completely the latency in matrice because of a 1 cycle L1.
Opteron can nearly hide completely latency because of a 3 cycle L1 @ 2 ports
Xeon can't hide latency at all here as prescott is 4 cycle L1 @ 1 port
Because of this and other issues Xeon is effectively real slow here when you
want to achieve high precision
by just multiply(-adding) registers and the worst choice for matrix
multiply.
Now let's discuss single precision operations. Suddenly the PC's look a lot
better then than Itanium hardware, as SSE2 can be of great help there
suddenly issuing 2 multiplies within 1 cycle.
Actually all floating point software we use professionally which (lucky)
partly is cooked in hardware is single precision.
Single precision is far more interesting than double precision.
SSE2 is great there.
Vincent
Daniel.
--------------------------------------------------------------
Dr. Dan Kidger, Quadrics Ltd. [EMAIL PROTECTED]
One Bridewell St., Bristol, BS1 2AA, UK 0117 915 5505
----------------------- www.quadrics.com --------------------
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf