https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308
--- Comment #4 from Bernd Edlinger <bernd.edlinger at hotmail dot de> --- hmm, when I compare aarch64 vs. arm sha512.c.260r.reload with -O3 -fno-schedule-insns I see a big difference: aarch64 has only few spill regs subreg regs: Slot 0 regnos (width = 8): 856 Slot 1 regnos (width = 8): 857 Slot 2 regnos (width = 8): 858 Slot 3 regnos (width = 8): 859 Slot 4 regnos (width = 8): 860 Slot 5 regnos (width = 8): 861 Slot 6 regnos (width = 8): 862 Slot 7 regnos (width = 8): 2117 Slot 8 regnos (width = 8): 1164 Slot 9 regnos (width = 8): 1052 but arm has 415 (8 bytes each) and the line "subreg regs:" before the Spill Slots is contains ~1500 regs. and while aarch64 does not have a single subreg in any pass, the arm has lots of subregs before lra eliminates all of them. like this, in sha512.c.217r.expand: (insn 85 84 86 5 (set (subreg:SI (reg:DI 1670) 4) (ashift:SI (subreg:SI (reg:DI 1669) 0) (const_int 24 [0x18]))) sha512.c:98 -1 (nil)) (insn 86 85 87 5 (set (subreg:SI (reg:DI 1670) 0) (const_int 0 [0])) sha512.c:98 -1 (nil)) This funny instruction is generated in arm_emit_coreregs_64bit_shift: /* Shifts by a constant greater than 31. */ rtx adj_amount = GEN_INT (INTVAL (amount) - 32); emit_insn (SET (out_down, SHIFT (code, in_up, adj_amount))); if (code == ASHIFTRT) emit_insn (gen_ashrsi3 (out_up, in_up, GEN_INT (31))); else emit_insn (SET (out_up, const0_rtx)); From my past experience, I assume that using a subreg to write an half of the out register makes more problems than it solves. So I tried this: Index: gcc/config/arm/arm.c =================================================================== --- gcc/config/arm/arm.c (revision 239624) +++ gcc/config/arm/arm.c (working copy) @@ -29170,12 +29170,11 @@ /* Shifts by a constant greater than 31. */ rtx adj_amount = GEN_INT (INTVAL (amount) - 32); + emit_insn (SET (out, const0_rtx)); emit_insn (SET (out_down, SHIFT (code, in_up, adj_amount))); if (code == ASHIFTRT) emit_insn (gen_ashrsi3 (out_up, in_up, GEN_INT (31))); - else - emit_insn (SET (out_up, const0_rtx)); } } else and it reduced the stack from 3472->2960