http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56253
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Target| |x86_64-*-*, i?86-*-* Status|UNCONFIRMED |NEW Last reconfirmed| |2013-02-08 Component|tree-optimization |target Ever Confirmed|0 |1 --- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> 2013-02-08 12:38:50 UTC --- Confirmed. That's because we have __m256 foo(__m256, __m256, __m256) (__m256 a, __m256 b, __m256 c) { __m256 D.6689; __m256 D.6686; __m256 _5; __m256 _6; <bb 2>: _5 = __builtin_ia32_mulps256 (a_1(D), b_2(D)); _6 = __builtin_ia32_addps256 (_5, c_3(D)); return _6; instead of _5 = a_1(D) * b_2(D); _6 = _5 + c_3(D); not sure why we use builtins for these basic operations... _mm256_add_ps could for example be simply extern __inline __m256 __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm256_add_ps (__m256 __A, __m256 __B) { return (__m256) ((__v8sf)__A + (__v8sf)__B); } with the caveat of using a GNU extension.