------- Comment #49 from whaley at cs dot utsa dot edu 2006-08-08 16:43 ------- Paolo,
>Yes, so far so good and this part has already been committed. But does >a *single* load-and-execute instruction execute faster than the two >instructions in a load+execute sequence? As I said, in my hand-tuned SSE assembly experience, which is faster depends on the architecture. In particular, netburst or Core do well with the final fmul[ls], and other archs do not. My guess is that netburst and Core probably crack this single instruction in two during decode, which allows the implicit load to be advanced, but with less instruction load. I think other architectures do not split the inst during decode, which means that tomasulo's cannot advance the load due to dependencies, which makes the separate instructions faster, even in the face of the extra instruction. If you can give me a patch that makes gcc call a new peephole opt getting rid of the final mul[sl] only when a certain flag is thrown, I will see if I can't post timings across a variety of architectures using both ways, so we can see if my SSE experience is true for x87, and how strong the performance benefit for various architectures. This will allow us to evaluate how important getting this choice is, what should be the default state, and how we should vary it according to architecture. My own theoretical guess is that if you *have* to pick a behavior, surely separate instructions are better: on systems with the cracking, this extra inst at worst eats up some mem and a bit of decode bandwidth, which on most machines is not critical. On the other hand, having a non-advancable load is pretty bad news on systems w/o the cracking ability. The proposed timings could demonstrate the accuracy of this guess. As I mentioned, and I *think* Jan echoed, for the case you have already fixed, the peephole's way should be the default way, even at low optimization: there's no extra instruction to this peephole, and it is better everywhere we've timed, and I see no way in theory for the first sequence to be better. Thanks, Clint -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827