https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104344

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |hubicka at gcc dot gnu.org,
                   |                            |rguenth at gcc dot gnu.org,
                   |                            |rsandifo at gcc dot gnu.org
           Keywords|                            |missed-optimization
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2022-02-02
     Ever confirmed|0                           |1

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
If you add -fopt-info to the compiler command you can see what happens:

> ./cc1 -quiet t.c -Os -fopt-info
t.c:21:21: optimized: Loop 1 distributed: split to 0 loops and 1 library calls.

we recognized the loop and made a memmove out of it which is then optimally
expanded

> ./cc1 -quiet t.c -O3 -fopt-info
t.c:10:10: optimized: basic block part vectorized using 16 byte vectors
t.c:21:21: optimized: Loop 1 distributed: split to 0 loops and 1 library calls.

here the same happens but in addition to that the unrolled stmts are
vectorized, producing the same effective result

Starting with GCC 12 vectorization will also happen at -O2 but not at -Os.
You can add -ftree-vectorize -fvect-cost-model=very-cheap to mimic what we
do at -O2 with -Os.

Note that vectorization can also increase code size, on x86 for example
because the instructions, even if there end up being less of them, have
a larger encoding.

Maybe also sth for -Oz vs. -Os and for general consideration of -O2 vs. -Os
and vectorization enablement.

The pattern itself could also be recognized by store-merging/bswap detection
(but the interleaving accesses can make the separate analysis of merging stores
and replacing the load difficult).

Reply via email to