https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54174
--- Comment #3 from Hongtao.liu <crazylht at gmail dot com> --- (In reply to Richard Biener from comment #1) > That's more likely a register allocator issue. Yes, LRA allocate registers from back to front which means change source code like below will eliminate redundant mov. typedef float v4sf __attribute__ ((vector_size (4*4))); typedef float v8sf __attribute__ ((vector_size (4*8))); v4sf add(v8sf v) { v4sf b = __builtin_ia32_vextractf128_ps256(v, 1); v4sf a = __builtin_ia32_vextractf128_ps256(v, 0); return a + b; }