https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114991

--- Comment #9 from Alex Coplan <acoplan at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #8)
> Is this now fixed on trunk?

No, not really.  The codegen at -O2 on trunk is:

f:
        stp     x29, x30, [sp, -144]!
        mov     x29, sp
        add     x0, sp, 80
        bl      g
        ldp     q28, q30, [sp, 80]
        add     x0, sp, 16
        ldp     q29, q31, [sp, 112]
        str     q28, [sp, 16]
        stp     q30, q29, [sp, 32]
        str     q31, [sp, 64]
        bl      h
        ldp     x29, x30, [sp], 144
        ret

Vlad's fix above helped reduce the frame size (thanks!).  Immediately before
that change (r15-7931), we have (again at -O2):

f:
        stp     x29, x30, [sp, -160]!
        mov     x29, sp
        add     x0, sp, 96
        bl      g
        ldp     q28, q30, [sp, 96]
        add     x0, sp, 32
        ldp     q29, q31, [sp, 128]
        str     q28, [sp, 32]
        stp     q30, q29, [x0, 16]
        str     q31, [x0, 48]
        bl      h
        ldp     x29, x30, [sp], 160
        ret

so this is an improvement, but it looks like with these insns:

        str     q28, [sp, 16]
        stp     q30, q29, [sp, 32]
        str     q31, [sp, 64]

we're forming an stp in the middle, so we've still got work to do in ldp_fusion
here.  Also, as Andrew noted in #c5, the only reason we do better now is
because of Wilco's change to turn the scheduler off on AArch64
(r15-6661-gc5db3f50bdf34ea96fd193a2a66d686401053bd2).  Wilco also later
re-enabled the scheduler at -O3
(r15-7871-gf870302515d5fcf7355f0108c3ead0038ff326fd), so e.g. taking the
testcase in #c1 at -O3 on trunk, we get:

f:
        stp     x29, x30, [sp, -144]!
        mov     x29, sp
        add     x0, sp, 80
        bl      g
        ldp     q31, q30, [sp, 80]
        add     x0, sp, 16
        ldr     q29, [sp, 112]
        str     q31, [sp, 16]
        ldr     q31, [sp, 128]
        stp     q30, q29, [sp, 32]
        str     q31, [sp, 64]
        bl      h
        ldp     x29, x30, [sp], 144
        ret

i.e. we still have the same pathological interleaving caused by the scheduler
(which ldp_fusion is currently unable to undo without the patch in #c4).

So there is still work to do in ldp_fusion.  The problem I ran into before is
that the fix I proposed in #c4 isn't always beneficial.  I believe this is
because, although we form more pairs with that change, doing so before RA can
scupper the RA's REG_EQUIV optimization and lead to spills.  It needs more
investigation to confirm this, but if so, I think we need some mechanism to
allow the RA to either crack or look through paired loads/stores when it is
beneficial to do so (e.g. to permit a REG_EQUIV optimization and avoid an
additional spill).

So there is more to do, but it is not at all straightforward to fix.

Reply via email to