4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3

whaley at cs dot utsa dot edu Tue, 08 Aug 2006 11:36:42 -0700


------- Comment #50 from whaley at cs dot utsa dot edu  2006-08-08 18:36 -------
Guys,


I've been scoping this a little closer on the Athlon64X2.  I have found that
the patched gcc can achieve as much as 93% of theoretical peak (5218Mflop on a
2800Mhz Athlon64X2!) for in-cache matmul when the code generator is allowed to
go to town.  That at least ties the best I've ever seen for an x86 chip, and
what it means is that on this architecture, the x87 unit can be coaxed into
beating the SSE unit *even when the SSE instructions are fully vectorized* (for
double precision only, of course: vector single prec SSE has twice theoretical
peak of x87).  This also means that ATLAS should get a real speed boost when
the new gcc is released, and other fp packages have the potential to do so as
well.  So, with this motivation, I edited the genned assembly, and made the
following changes by hand in ~30 different places in the kernel assembly:

>#ifdef FMULL
>        fmull   1440(%rcx)
>#else
>        fldl    1440(%rcx)
>        fmulp   %st,%st(1)
>#endif

To my surprise, on this arch, using the fldl/fmulp pair caused a performance
drop.  So, either my SSE experience does not necessarily translate to x87, or
the Opteron (where I did the SSE tuning) is subtly different than the
Athlon64X2, or my memory of the tuning is faulty.  Just as a check, Paulo: is
this the peephole you would do?

Anyway, doing this by hand is too burdensome to make widespread timings
feasable, so if you'd like to see that, I'll need a gcc patch to do it
automatically . . .

Cheers,
Clint


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827

[Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3

Reply via email to