https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99646
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |NEW Target| |x86_64-*-* Blocks| |53947 Last reconfirmed| |2021-03-18 Component|middle-end |tree-optimization Ever confirmed|0 |1 Keywords| |missed-optimization --- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> --- We're using quite inefficient vectorization here and the lack of cross-lane interleaving permutes are harmful to AVX vectorization since there's no extract even / extract odd available and we do not factor this in when costing. Surprisingly not vectorizing is still slower (1.4s vs 1.35s with AVX for me), but -funroll-loops w/o vectorizing is comparable to vectorizing with SSE and unrolling ontop of SSE vectorizing doesn't help. In the end what we miss (apart from the bad use of interleaving) is the opportunity to use masked stores (and loads). Which would halve the number of usable lanes but likely provide a speedup over scalar unrolled code. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 [Bug 53947] [meta-bug] vectorizer missed-optimizations