https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111874
--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to Hongtao.liu from comment #1) > For integer, We have _mm512_mask_reduce_add_epi32 defined as > > extern __inline int > __attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) > _mm512_mask_reduce_add_epi32 (__mmask16 __U, __m512i __A) > { > __A = _mm512_maskz_mov_epi32 (__U, __A); > __MM512_REDUCE_OP (+); > } > > #undef __MM512_REDUCE_OP > #define __MM512_REDUCE_OP(op) \ > __v8si __T1 = (__v8si) _mm512_extracti64x4_epi64 (__A, 1); \ > __v8si __T2 = (__v8si) _mm512_extracti64x4_epi64 (__A, 0); \ > __m256i __T3 = (__m256i) (__T1 op __T2); \ > __v4si __T4 = (__v4si) _mm256_extracti128_si256 (__T3, 1); \ > __v4si __T5 = (__v4si) _mm256_extracti128_si256 (__T3, 0); \ > __v4si __T6 = __T4 op __T5; \ > __v4si __T7 = __builtin_shuffle (__T6, (__v4si) { 2, 3, 0, 1 }); \ > __v4si __T8 = __T6 op __T7; \ > return __T8[0] op __T8[1] > > There's correponding floating point version, but it's not in-order adds. It also doesn't handle signed zeros correctly which would require not using _mm512_maskz_mov_epi32 but merge masking with { -0.0, -0.0, ... } for FP. Of course as it's not doing in-order processing not handling signed zeros correctly might be a minor thing. So yes, we're looking for -O3 without -ffast-math vectorization of a conditional reduction that's currently not supported (correctly). double a[1024]; double foo() { double res = 0.0; for (int i = 0; i < 1024; ++i) { if (a[i] < 0.) res += a[i]; } return res; } should be vectorizable also with -frounding-math (where the trick using -0.0 for masked elements doesn't work). Currently we are using 0.0 for them (but there's a pending patch). Maybe we don't care about -frounding-math and so -0.0 adds are OK. We get something like the following with znver4, it could be that trying to optimize the case of a sparse mask with vcompress isn't worth it .L2: vmovapd (%rax), %zmm1 addq $64, %rax vminpd %zmm5, %zmm1, %zmm1 valignq $3, %ymm1, %ymm1, %ymm2 vunpckhpd %xmm1, %xmm1, %xmm3 vaddsd %xmm1, %xmm0, %xmm0 vaddsd %xmm3, %xmm0, %xmm0 vextractf64x2 $1, %ymm1, %xmm3 vextractf64x4 $0x1, %zmm1, %ymm1 vaddsd %xmm3, %xmm0, %xmm0 vaddsd %xmm2, %xmm0, %xmm0 vunpckhpd %xmm1, %xmm1, %xmm2 vaddsd %xmm1, %xmm0, %xmm0 vaddsd %xmm2, %xmm0, %xmm0 vextractf64x2 $1, %ymm1, %xmm2 valignq $3, %ymm1, %ymm1, %ymm1 vaddsd %xmm2, %xmm0, %xmm0 vaddsd %xmm1, %xmm0, %xmm0 cmpq $a+8192, %rax jne .L2