------- Additional Comments From guardia at sympatico dot ca 2005-01-29 19:21 ------- Hum, ok we can do a "movd %mm0, %eax", that's why it gets combined...
Well, I give up. The V8QI (and whatever) -> V2SI conversion seems to be causing all the trouble here if we look at the RTL of something like: __m64 moo(__v8qi mmx1) { mmx1 = __builtin_ia32_punpcklbw (mmx1, mmx1); return mmx1; } It explicitly asks for a conversion to V2SI (__m64) that gets assigned to an xmm register afterwards: (insn 15 14 17 1 (set (reg:V8QI 58 [ D.2201 ]) (reg:V8QI 62)) -1 (nil) (nil)) (insn 17 15 18 1 (set (reg:V2SI 63) (subreg:V2SI (reg:V8QI 58 [ D.2201 ]) 0)) -1 (nil) (nil)) (insn 18 17 19 1 (set (mem/i:V2SI (reg/f:SI 60 [ D.2206 ]) [0 <result>+0 S8 A64]) (reg:V2SI 63)) -1 (nil) (nil)) So... the only way to fix this would be to either make the register allocator more intelligent (bug 19161), or to provide intrinsics like the Intel compiler does with one to one mapping to instructions directly. right? That wouldn't be such a bad idea, I think... instead of using the current __builtins for stuff in *mmintrin.h, we could use a different set of builtins that only supports V2SI and nothing else..? Well, that's going to be for another time ;) -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19530