https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99646

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
             Target|                            |x86_64-*-*
             Blocks|                            |53947
   Last reconfirmed|                            |2021-03-18
          Component|middle-end                  |tree-optimization
     Ever confirmed|0                           |1
           Keywords|                            |missed-optimization

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
We're using quite inefficient vectorization here and the lack of cross-lane
interleaving permutes are harmful to AVX vectorization since there's no extract
even / extract odd available and we do not factor this in when costing.

Surprisingly not vectorizing is still slower (1.4s vs 1.35s with AVX for me),
but -funroll-loops w/o vectorizing is comparable to vectorizing with SSE
and unrolling ontop of SSE vectorizing doesn't help.

In the end what we miss (apart from the bad use of interleaving) is the
opportunity to use masked stores (and loads).  Which would halve the number
of usable lanes but likely provide a speedup over scalar unrolled code.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

Reply via email to