https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87555
--- Comment #5 from Hongtao.liu <crazylht at gmail dot com> --- With open-code successfully optimize __m128d f1(__m128d x, __m128d y, __m128d z){ __m128d tem = _mm_mul_pd (x,y); __m128d tem2 = tem + z; __m128d tem3 = tem - z; return __builtin_shuffle (tem2, tem3, (__m128i) {0, 3}); } to f1: .LFB5481: .cfi_startproc vfmsubadd132pd %xmm1, %xmm2, %xmm0 ret .cfi_endproc But failed to optimize __m256d f2(__m256d x, __m256d y, __m256d z){ __m256d tem = _mm256_mul_pd (x,y); __m256d tem2 = tem + z; __m256d tem3 = tem - z; return __builtin_shuffle (tem2, tem3, (__m256i) {0, 5, 2, 7}); } since simplify_rtx didn't realize Failed to match this instruction: (set (reg:V4SF 88) (vec_merge:V4SF (fma:V4SF (reg/v:V4SF 85 [ x ]) (reg/v:V4SF 86 [ y ]) (neg:V4SF (reg/v:V4SF 87 [ z ]))) (fma:V4SF (reg/v:V4SF 85 [ x ]) (reg/v:V4SF 86 [ y ]) (reg/v:V4SF 87 [ z ])) (const_int 10 [0xa]))) is equal to (set (reg:V4SF 88) (vec_merge:V4SF (fma:V4SF (reg/v:V4SF 85 [ x ]) (reg/v:V4SF 86 [ y ]) (reg/v:V4SF 87 [ z ])) (fma:V4SF (reg/v:V4SF 85 [ x ]) (reg/v:V4SF 86 [ y ]) (neg:V4SF (reg/v:V4SF 87 [ z ]))) (const_int 5 [0x5]))) later is how our pattern is defined. So it there any canonical rtx for vec_merge? (vec_merge (A B const_int 10) should abviously equal to (vec_merge B A const_int 5)