https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88570
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Last reconfirmed| |2023-01-18 CC| |crazylht at gmail dot com Ever confirmed|0 |1 Status|UNCONFIRMED |NEW --- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> --- The missed vectorization without AVX is because the lack of masked loads. I don't think you can elide the masked load for either n3/d3 or n4/d4 because they might not be accessed at all and thus fault (when pointer-based). For test2 we now produce with -march=skylake-avx512 or -march=znver4 test2: .LFB1: .cfi_startproc vmovupd (%rdi), %ymm2 vxorpd %xmm1, %xmm1, %xmm1 vcmppd $14, %ymm1, %ymm2, %k1 vblendmpd (%rdx), %ymm1, %ymm3{%k1} knotb %k1, %k1 vblendmpd (%rcx), %ymm1, %ymm0{%k1} vcmppd $1, %ymm2, %ymm1, %k1 vmovapd %ymm3, %ymm0{%k1} vmovupd %ymm0, (%rsi) vzeroupper ret so the negation improved - we still do the compare twice for test1 to produce the negated mask. We do use merge-masking but use memory operands on the blend instruction, initially zeroing the destination. It looks like those could be peepholed to masked zero-filling move (avoiding the false dependence?) and the second blend is equal to a simple move with merge masking? I wonder what the optimal sequence would be here. Note the vectorizer doesn't know about merge/zero-masking and just assumes "undefined" for masked elements loaded. Note the vectorizer itself generates the second compare to compute the inverted mask for test1 but the bit-negate in test2 which is because that's how if-conversion produces it, or rather, that's how we fold the integer compare with the match.pd rule /* We can simplify a logical negation of a comparison to the inverted comparison. As we cannot compute an expression operator using invert_tree_comparison we have to simulate that with expression code iteration. */