https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86625
Alexander Monakov <amonakov at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |amonakov at gcc dot gnu.org Component|rtl-optimization |tree-optimization --- Comment #1 from Alexander Monakov <amonakov at gcc dot gnu.org> --- Please supply testcase(s) as Bugzilla attachments, not external links. At -O3/-Ofast the main issue is early unrolling ('cunrolli') splatting all simple 16-iteration inner loops. After that imho all hope is lost, and yeah, looks like we try to vectorize across the other dimension. With -O3 -fdisable-tree-cunrolli, or with -O2 -ftree-vectorize we do get the correct vectorization pattern, but a couple of problems remain: after vect, tree optimizations cannot hoist/sink memory references out of the outer loop, leaving 2 loads, 1 load-broadcast and 1 store per each fma. Later, RTL PRE cleans up redundant vector loads, but load-broadcasts and stores remain.