------- Comment #3 from guillaume dot melquiond at ens-lyon dot fr 2006-04-05 08:59 ------- Since the runtime slowdown between the binaries produced by GCC3 and GCC4 was not negligible, I did search a bit more for workarounds. It was quite simple in fact: passing -mno-sse produced assembly code roughly as efficient. With -mno-sse, the testcase obviously does not use any xmm register anymore. But in addition it does not use any callee-save register anymore, and hence it uses less stack space and prologue and epilogue are shorter. As a consequence, the generated code is both faster and shorter with -mno-sse. In fact, the testcase binary is even 30% shorter than if it had been produced with -Os. (The same binary is generated for both -O3 and -Os.)
-- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26778