[Bug tree-optimization/99646] s111 benchmark of TSVC preffers -mprefer-avx128 on zen3

rguenth at gcc dot gnu.org via Gcc-bugs Thu, 18 Mar 2021 08:17:09 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99646


Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
             Target|                            |x86_64-*-*
             Blocks|                            |53947
   Last reconfirmed|                            |2021-03-18
          Component|middle-end                  |tree-optimization
     Ever confirmed|0                           |1
           Keywords|                            |missed-optimization

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
We're using quite inefficient vectorization here and the lack of cross-lane
interleaving permutes are harmful to AVX vectorization since there's no extract
even / extract odd available and we do not factor this in when costing.

Surprisingly not vectorizing is still slower (1.4s vs 1.35s with AVX for me),
but -funroll-loops w/o vectorizing is comparable to vectorizing with SSE
and unrolling ontop of SSE vectorizing doesn't help.

In the end what we miss (apart from the bad use of interleaving) is the
opportunity to use masked stores (and loads).  Which would halve the number
of usable lanes but likely provide a speedup over scalar unrolled code.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

[Bug tree-optimization/99646] s111 benchmark of TSVC preffers -mprefer-avx128 on zen3

Reply via email to