This means that 2 additional FP results per cycle in microarchitecture gives only about 7% of performance increase :-(
the 4 flops/cycle is really for linpack-like code: it assumes you are executing packed double SIMD.
The question is - should we wait some better results for new incoming optimizing compilers versions ? Or it is the reality - that 2 additional FP results per cycle gives (in average) relative small performance increase ?
just that not all FP is SIMD-friendly, I think. if your code spends a lot of time in blas/lapack functions, I would expect it to see good speedup.
regards, mark hahn. _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf