4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3

whaley at cs dot utsa dot edu Tue, 08 Aug 2006 09:44:05 -0700


------- Comment #49 from whaley at cs dot utsa dot edu  2006-08-08 16:43 -------
Paolo,


>Yes, so far so good and this part has already been committed.  But does
>a *single* load-and-execute instruction execute faster than the two
>instructions in a load+execute sequence?

As I said, in my hand-tuned SSE assembly experience, which is faster depends on
the architecture.  In particular, netburst or Core do well with the final
fmul[ls], and other archs do not.  My guess is that netburst and Core probably
crack this single instruction in two during decode, which allows the implicit
load to be advanced, but with less instruction load.  I think other
architectures do not split the inst during decode, which means that tomasulo's
cannot advance the load due to dependencies, which makes the separate
instructions faster, even in the face of the extra instruction.

If you can give me a patch that makes gcc call a new peephole opt getting rid
of the final mul[sl] only when a certain flag is thrown, I will see if I can't
post timings across a variety of architectures using both ways, so we can see
if my SSE experience is true for x87, and how strong the performance benefit
for various architectures.  This will allow us to evaluate how important
getting this choice is, what should be the default state, and how we should
vary it according to architecture.  My own theoretical guess is that if you
*have* to pick a behavior, surely separate instructions are better: on systems
with the cracking, this extra inst at worst eats up some mem and a bit of
decode bandwidth, which on most machines is not critical.  On the other hand,
having a non-advancable load is pretty bad news on systems w/o the cracking
ability.  The proposed timings could demonstrate the accuracy of this guess.

As I mentioned, and I *think* Jan echoed, for the case you have already fixed,
the peephole's way should be the default way, even at low optimization: there's
no extra instruction to this peephole, and it is better everywhere we've timed,
and I see no way in theory for the first sequence to be better.

Thanks,
Clint


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827

[Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3

Reply via email to