[Bug target/88570] Missing or ineffective vectorization of scatter load

rguenth at gcc dot gnu.org via Gcc-bugs Wed, 18 Jan 2023 02:55:15 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88570


Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|                            |2023-01-18
                 CC|                            |crazylht at gmail dot com
     Ever confirmed|0                           |1
             Status|UNCONFIRMED                 |NEW

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
The missed vectorization without AVX is because the lack of masked loads.
I don't think you can elide the masked load for either n3/d3 or n4/d4
because they might not be accessed at all and thus fault (when pointer-based).

For test2 we now produce with -march=skylake-avx512 or -march=znver4

test2:
.LFB1:
        .cfi_startproc
        vmovupd (%rdi), %ymm2
        vxorpd  %xmm1, %xmm1, %xmm1
        vcmppd  $14, %ymm1, %ymm2, %k1
        vblendmpd       (%rdx), %ymm1, %ymm3{%k1}
        knotb   %k1, %k1
        vblendmpd       (%rcx), %ymm1, %ymm0{%k1}
        vcmppd  $1, %ymm2, %ymm1, %k1
        vmovapd %ymm3, %ymm0{%k1}
        vmovupd %ymm0, (%rsi)
        vzeroupper
        ret

so the negation improved - we still do the compare twice for test1 to
produce the negated mask.

We do use merge-masking but use memory operands on the blend instruction,
initially zeroing the destination.  It looks like those could be
peepholed to masked zero-filling move (avoiding the false dependence?)
and the second blend is equal to a simple move with merge masking?

I wonder what the optimal sequence would be here.

Note the vectorizer doesn't know about merge/zero-masking and just assumes
"undefined" for masked elements loaded.

Note the vectorizer itself generates the second compare to compute
the inverted mask for test1 but the bit-negate in test2 which is
because that's how if-conversion produces it, or rather, that's
how we fold the integer compare with the match.pd rule

/* We can simplify a logical negation of a comparison to the
   inverted comparison.  As we cannot compute an expression
   operator using invert_tree_comparison we have to simulate
   that with expression code iteration.  */

[Bug target/88570] Missing or ineffective vectorization of scatter load

Reply via email to