------- Comment #50 from whaley at cs dot utsa dot edu 2006-08-08 18:36 ------- Guys,
I've been scoping this a little closer on the Athlon64X2. I have found that the patched gcc can achieve as much as 93% of theoretical peak (5218Mflop on a 2800Mhz Athlon64X2!) for in-cache matmul when the code generator is allowed to go to town. That at least ties the best I've ever seen for an x86 chip, and what it means is that on this architecture, the x87 unit can be coaxed into beating the SSE unit *even when the SSE instructions are fully vectorized* (for double precision only, of course: vector single prec SSE has twice theoretical peak of x87). This also means that ATLAS should get a real speed boost when the new gcc is released, and other fp packages have the potential to do so as well. So, with this motivation, I edited the genned assembly, and made the following changes by hand in ~30 different places in the kernel assembly: >#ifdef FMULL > fmull 1440(%rcx) >#else > fldl 1440(%rcx) > fmulp %st,%st(1) >#endif To my surprise, on this arch, using the fldl/fmulp pair caused a performance drop. So, either my SSE experience does not necessarily translate to x87, or the Opteron (where I did the SSE tuning) is subtly different than the Athlon64X2, or my memory of the tuning is faulty. Just as a check, Paulo: is this the peephole you would do? Anyway, doing this by hand is too burdensome to make widespread timings feasable, so if you'd like to see that, I'll need a gcc patch to do it automatically . . . Cheers, Clint -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827