In message from Mark Hahn <[EMAIL PROTECTED]> (Fri, 12 Oct 2007
16:09:05 -0400 (EDT)):
This means that 2 additional FP results per cycle in
microarchitecture gives
only about 7% of performance increase :-(
the 4 flops/cycle is really for linpack-like code: it assumes you are
executing packed double SIMD.
Yes, but AFAIK most of the modern optimizing F9x compilers for x86 can
generate codes w/SSEx instructions (instead of x87). And I assume that
many real world codes, including some from SPECfp2006 set, includes
the work w/floating point vectors. It's not necessary to have very
long vectors - taking into account that 64 bit SSE vectors have
length=2.
Such things may gives theoretically 2x speedup !
just that not all FP is SIMD-friendly, I think.
Yes, I agree w/"not all". But 7% speedup means, I beleive, "very
seldom FP codes" ?
Yours
Mikhail
if your code spends
a lot of time in blas/lapack functions, I would expect it to see good
speedup.
regards, mark hahn.
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf