https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114991
--- Comment #9 from Alex Coplan <acoplan at gcc dot gnu.org> --- (In reply to Richard Biener from comment #8) > Is this now fixed on trunk? No, not really. The codegen at -O2 on trunk is: f: stp x29, x30, [sp, -144]! mov x29, sp add x0, sp, 80 bl g ldp q28, q30, [sp, 80] add x0, sp, 16 ldp q29, q31, [sp, 112] str q28, [sp, 16] stp q30, q29, [sp, 32] str q31, [sp, 64] bl h ldp x29, x30, [sp], 144 ret Vlad's fix above helped reduce the frame size (thanks!). Immediately before that change (r15-7931), we have (again at -O2): f: stp x29, x30, [sp, -160]! mov x29, sp add x0, sp, 96 bl g ldp q28, q30, [sp, 96] add x0, sp, 32 ldp q29, q31, [sp, 128] str q28, [sp, 32] stp q30, q29, [x0, 16] str q31, [x0, 48] bl h ldp x29, x30, [sp], 160 ret so this is an improvement, but it looks like with these insns: str q28, [sp, 16] stp q30, q29, [sp, 32] str q31, [sp, 64] we're forming an stp in the middle, so we've still got work to do in ldp_fusion here. Also, as Andrew noted in #c5, the only reason we do better now is because of Wilco's change to turn the scheduler off on AArch64 (r15-6661-gc5db3f50bdf34ea96fd193a2a66d686401053bd2). Wilco also later re-enabled the scheduler at -O3 (r15-7871-gf870302515d5fcf7355f0108c3ead0038ff326fd), so e.g. taking the testcase in #c1 at -O3 on trunk, we get: f: stp x29, x30, [sp, -144]! mov x29, sp add x0, sp, 80 bl g ldp q31, q30, [sp, 80] add x0, sp, 16 ldr q29, [sp, 112] str q31, [sp, 16] ldr q31, [sp, 128] stp q30, q29, [sp, 32] str q31, [sp, 64] bl h ldp x29, x30, [sp], 144 ret i.e. we still have the same pathological interleaving caused by the scheduler (which ldp_fusion is currently unable to undo without the patch in #c4). So there is still work to do in ldp_fusion. The problem I ran into before is that the fix I proposed in #c4 isn't always beneficial. I believe this is because, although we form more pairs with that change, doing so before RA can scupper the RA's REG_EQUIV optimization and lead to spills. It needs more investigation to confirm this, but if so, I think we need some mechanism to allow the RA to either crack or look through paired loads/stores when it is beneficial to do so (e.g. to permit a REG_EQUIV optimization and avoid an additional spill). So there is more to do, but it is not at all straightforward to fix.