https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86625

Alexander Monakov <amonakov at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |amonakov at gcc dot gnu.org
          Component|rtl-optimization            |tree-optimization

--- Comment #1 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
Please supply testcase(s) as Bugzilla attachments, not external links.

At -O3/-Ofast the main issue is early unrolling ('cunrolli') splatting all
simple 16-iteration inner loops. After that imho all hope is lost, and yeah,
looks like we try to vectorize across the other dimension.

With -O3 -fdisable-tree-cunrolli, or with -O2 -ftree-vectorize we do get the
correct vectorization pattern, but a couple of problems remain: after vect,
tree optimizations cannot hoist/sink memory references out of the outer loop,
leaving 2 loads, 1 load-broadcast and 1 store per each fma. Later, RTL PRE
cleans up redundant vector loads, but load-broadcasts and stores remain.

Reply via email to