https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104344
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |hubicka at gcc dot gnu.org, | |rguenth at gcc dot gnu.org, | |rsandifo at gcc dot gnu.org Keywords| |missed-optimization Status|UNCONFIRMED |NEW Last reconfirmed| |2022-02-02 Ever confirmed|0 |1 --- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> --- If you add -fopt-info to the compiler command you can see what happens: > ./cc1 -quiet t.c -Os -fopt-info t.c:21:21: optimized: Loop 1 distributed: split to 0 loops and 1 library calls. we recognized the loop and made a memmove out of it which is then optimally expanded > ./cc1 -quiet t.c -O3 -fopt-info t.c:10:10: optimized: basic block part vectorized using 16 byte vectors t.c:21:21: optimized: Loop 1 distributed: split to 0 loops and 1 library calls. here the same happens but in addition to that the unrolled stmts are vectorized, producing the same effective result Starting with GCC 12 vectorization will also happen at -O2 but not at -Os. You can add -ftree-vectorize -fvect-cost-model=very-cheap to mimic what we do at -O2 with -Os. Note that vectorization can also increase code size, on x86 for example because the instructions, even if there end up being less of them, have a larger encoding. Maybe also sth for -Oz vs. -Os and for general consideration of -O2 vs. -Os and vectorization enablement. The pattern itself could also be recognized by store-merging/bswap detection (but the interleaving accesses can make the separate analysis of merging stores and replacing the load difficult).