https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119681

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Target|Aarch64                     |Aarch64 x86_64
     Ever confirmed|0                           |1
           Severity|normal                      |enhancement
           Keywords|                            |ra
   Last reconfirmed|                            |2025-04-08
            Summary|extraneous move             |extraneous move
                   |instructions when unrolling |instructions when unrolling
                   |core_list_reverse () with   |core_list_reverse () with
                   |-O3 -funroll-all-loops      |-O3 -funroll-all-loops; not
                   |                            |copying the return block
             Status|UNCONFIRMED                 |NEW

--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
To some extent this is a RA issue and about range splitting. In this case we
have an exit to this location:
```
(insn 23 18 24 7 (set (reg/i:DI 0 x0)
        (reg/v/f:DI 104 [ list ])) "/app/example.c":29:1 70 {*movdi_aarch64}
     (expr_list:REG_DEAD (reg/v/f:DI 104 [ list ])
        (nil)))
(insn 24 23 59 7 (use (reg/i:DI 0 x0)) "/app/example.c":29:1 -1
     (nil))
      ; pc falls through to BB 1
```
;; BB 1 is the exit block

If we had copied this block a few times and split/renamed 104 in the others it
the RA would do its job. The question comes how to detect this and do that
without copying too much? I am not 100% sure copying that bb 3x times is always
the right thing to do.

And as I mentioned this depends on the micro-architecture of which way is
better.

You see the same behavior on x86_64 too but the micro-architecture of most
x86_64 is that the move is free so there is no need to hide the latency
anyways. Though it does increase the number of instructions inside the inner
loop which could cause to go across 2 icache blocks which might be where the
slow down is rather than anything else.

I also not so sure this optimization applies in general or just for coremarks.
Plus inlining core_list_reverse might allow for the moves to go away in general
since it is not constrained to using the return register which is the biggest
constraint here.

Reply via email to