https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119681
Andrew Pinski <pinskia at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Target|Aarch64 |Aarch64 x86_64 Ever confirmed|0 |1 Severity|normal |enhancement Keywords| |ra Last reconfirmed| |2025-04-08 Summary|extraneous move |extraneous move |instructions when unrolling |instructions when unrolling |core_list_reverse () with |core_list_reverse () with |-O3 -funroll-all-loops |-O3 -funroll-all-loops; not | |copying the return block Status|UNCONFIRMED |NEW --- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> --- To some extent this is a RA issue and about range splitting. In this case we have an exit to this location: ``` (insn 23 18 24 7 (set (reg/i:DI 0 x0) (reg/v/f:DI 104 [ list ])) "/app/example.c":29:1 70 {*movdi_aarch64} (expr_list:REG_DEAD (reg/v/f:DI 104 [ list ]) (nil))) (insn 24 23 59 7 (use (reg/i:DI 0 x0)) "/app/example.c":29:1 -1 (nil)) ; pc falls through to BB 1 ``` ;; BB 1 is the exit block If we had copied this block a few times and split/renamed 104 in the others it the RA would do its job. The question comes how to detect this and do that without copying too much? I am not 100% sure copying that bb 3x times is always the right thing to do. And as I mentioned this depends on the micro-architecture of which way is better. You see the same behavior on x86_64 too but the micro-architecture of most x86_64 is that the move is free so there is no need to hide the latency anyways. Though it does increase the number of instructions inside the inner loop which could cause to go across 2 icache blocks which might be where the slow down is rather than anything else. I also not so sure this optimization applies in general or just for coremarks. Plus inlining core_list_reverse might allow for the moves to go away in general since it is not constrained to using the return register which is the biggest constraint here.