https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87555
--- Comment #5 from Hongtao.liu <crazylht at gmail dot com> ---
With open-code
successfully optimize
__m128d f1(__m128d x, __m128d y, __m128d z){
__m128d tem = _mm_mul_pd (x,y);
__m128d tem2 = tem + z;
__m128d tem3 = tem - z;
return __builtin_shuffle (tem2, tem3, (__m128i) {0, 3});
}
to
f1:
.LFB5481:
.cfi_startproc
vfmsubadd132pd %xmm1, %xmm2, %xmm0
ret
.cfi_endproc
But failed to optimize
__m256d f2(__m256d x, __m256d y, __m256d z){
__m256d tem = _mm256_mul_pd (x,y);
__m256d tem2 = tem + z;
__m256d tem3 = tem - z;
return __builtin_shuffle (tem2, tem3, (__m256i) {0, 5, 2, 7});
}
since simplify_rtx didn't realize
Failed to match this instruction:
(set (reg:V4SF 88)
(vec_merge:V4SF (fma:V4SF (reg/v:V4SF 85 [ x ])
(reg/v:V4SF 86 [ y ])
(neg:V4SF (reg/v:V4SF 87 [ z ])))
(fma:V4SF (reg/v:V4SF 85 [ x ])
(reg/v:V4SF 86 [ y ])
(reg/v:V4SF 87 [ z ]))
(const_int 10 [0xa])))
is equal to
(set (reg:V4SF 88)
(vec_merge:V4SF
(fma:V4SF (reg/v:V4SF 85 [ x ])
(reg/v:V4SF 86 [ y ])
(reg/v:V4SF 87 [ z ]))
(fma:V4SF (reg/v:V4SF 85 [ x ])
(reg/v:V4SF 86 [ y ])
(neg:V4SF (reg/v:V4SF 87 [ z ])))
(const_int 5 [0x5])))
later is how our pattern is defined.
So it there any canonical rtx for vec_merge?
(vec_merge (A B const_int 10) should abviously equal to (vec_merge B A
const_int 5)