https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111874

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Hongtao.liu from comment #1)
> For integer, We have _mm512_mask_reduce_add_epi32 defined as
> 
> extern __inline int
> __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> _mm512_mask_reduce_add_epi32 (__mmask16 __U, __m512i __A)
> {
>   __A = _mm512_maskz_mov_epi32 (__U, __A);
>   __MM512_REDUCE_OP (+);
> }
> 
> #undef __MM512_REDUCE_OP
> #define __MM512_REDUCE_OP(op) \
>   __v8si __T1 = (__v8si) _mm512_extracti64x4_epi64 (__A, 1);          \
>   __v8si __T2 = (__v8si) _mm512_extracti64x4_epi64 (__A, 0);          \
>   __m256i __T3 = (__m256i) (__T1 op __T2);                            \
>   __v4si __T4 = (__v4si) _mm256_extracti128_si256 (__T3, 1);          \
>   __v4si __T5 = (__v4si) _mm256_extracti128_si256 (__T3, 0);          \
>   __v4si __T6 = __T4 op __T5;                                         \
>   __v4si __T7 = __builtin_shuffle (__T6, (__v4si) { 2, 3, 0, 1 });    \
>   __v4si __T8 = __T6 op __T7;                                         \
>   return __T8[0] op __T8[1]
> 
> There's correponding floating point version, but it's not in-order adds.

It also doesn't handle signed zeros correctly which would require
not using _mm512_maskz_mov_epi32 but merge masking with { -0.0, -0.0, ... }
for FP.  Of course as it's not doing in-order processing not handling
signed zeros correctly might be a minor thing.

So yes, we're looking for -O3 without -ffast-math vectorization of
a conditional reduction that's currently not supported (correctly).

double a[1024];
double foo()
{
  double res = 0.0;
  for (int i = 0; i < 1024; ++i)
    {
      if (a[i] < 0.)
         res += a[i];
    }
  return res;
}

should be vectorizable also with -frounding-math (where the trick using
-0.0 for masked elements doesn't work).  Currently we are using 0.0 for
them (but there's a pending patch).

Maybe we don't care about -frounding-math and so -0.0 adds are OK.  We
get something like the following with znver4, it could be that trying
to optimize the case of a sparse mask with vcompress isn't worth it

.L2:
        vmovapd (%rax), %zmm1
        addq    $64, %rax
        vminpd  %zmm5, %zmm1, %zmm1
        valignq $3, %ymm1, %ymm1, %ymm2
        vunpckhpd       %xmm1, %xmm1, %xmm3
        vaddsd  %xmm1, %xmm0, %xmm0
        vaddsd  %xmm3, %xmm0, %xmm0
        vextractf64x2   $1, %ymm1, %xmm3
        vextractf64x4   $0x1, %zmm1, %ymm1
        vaddsd  %xmm3, %xmm0, %xmm0
        vaddsd  %xmm2, %xmm0, %xmm0
        vunpckhpd       %xmm1, %xmm1, %xmm2
        vaddsd  %xmm1, %xmm0, %xmm0
        vaddsd  %xmm2, %xmm0, %xmm0
        vextractf64x2   $1, %ymm1, %xmm2
        valignq $3, %ymm1, %ymm1, %ymm1
        vaddsd  %xmm2, %xmm0, %xmm0
        vaddsd  %xmm1, %xmm0, %xmm0
        cmpq    $a+8192, %rax
        jne     .L2

Reply via email to