https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65358
--- Comment #14 from ktkachov at gcc dot gnu.org --- Right, I think the root cause is the emit_push_insn in expr.c. It's supposed to push what needs to be pushed from a partial argument onto the stack and do the moves into the registers. Currently it performs the pushes and then does the moves, which does the wrong things if the pushing destroys stack elements that it wants to load into registers. Doing the load-to-registers part first and the pushing second fixed this for me and generated the below: foo: @ args = 16, pretend = 8, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 sub sp, sp, #8 mov r0, r1 mov r1, r2 str lr, [sp, #-4]! ldr lr, [sp, #16] mov ip, sp str r3, [ip, #8]! ldmia ip, {r2, r3} str lr, [sp, #12] ldr lr, [sp], #4 add sp, sp, #8 b bar which still does the tail call optimisation. I haven't tested it more extensively yet, so I'll be taking that approach and prepare and test a patch.