https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91340

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |hjl.tools at gmail dot com,
                   |                            |jakub at gcc dot gnu.org,
                   |                            |uros at gcc dot gnu.org

--- Comment #1 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
I've tried both:
extern __inline __m256d __attribute__((__gnu_inline__, __always_inline__,
__artificial__))
_mm256_zextpd128_pd256 (__m128d __A)
{
  return _mm256_insertf128_pd (_mm256_castpd128_pd256 (__A), _mm_setzero_pd (),
                               1);
}

extern __inline __m256d __attribute__((__gnu_inline__, __always_inline__,
__artificial__))
_mm256_zextpd128_pd256 (__m128d __A)
{
  return (__m256d) __builtin_shuffle (_mm256_castpd128_pd256 (__A),
                                      _mm256_setzero_pd (),
                                      (__v4di) { 0, 1, 4, 5 });
}

both generate vpxor + vinsert[fi]128.  In both cases, the problem is that
vec_set_lo_<mode> patterns use register_operand/"v" for the operand from which
the upper bits are taken, while in this case we want const0_operand/"C" and
then use vmovapd etc.  To be totally instruction-less, we'd need to analyze
whatever instruction generated the operand and verify if it clears the upper
bits or not, while that might be ok for certain special cases, doing it for
everything is going to be way too hard, because often how exactly we represent
it in the RTL still doesn't imply how exactly it is implemented.  In
vec_set_lo_<mode> patterns it isn't easy to add that alternative though,
because the instruction supports masking and for masking the vmovaps etc. with
narrower operands works differently.

Reply via email to