https://gcc.gnu.org/bugzilla/show_bug.cgi?id=42575
ktkachov at gcc dot gnu.org changed: What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |REOPENED Resolution|FIXED |--- --- Comment #13 from ktkachov at gcc dot gnu.org --- So I see this regression still, but only for some -mcpu options. For example for -mcpu=cortex-a15 we get: mul r3, r0, r3 strd r4, [sp, #-8]! umull r4, r5, r0, r2 mla r1, r2, r1, r3 mov r0, r4 add r5, r1, r5 mov r1, r5 ldrd r4, [sp] add sp, sp, #8 whereas for cortex-a7 we get: mul r3, r0, r3 mla r3, r2, r1, r3 umull r0, r1, r0, r2 add r1, r3, r1 I think the problem here is reload. If I look at the the dump of postreload, for the 'bad' RTL I see: r0(SI) := r0(SI) r3(SI) := r0(SI) * r3(SI) r4(DI) := r0(SI) * r2(SI) //with sign extension r1(SI) := r2(SI) * r1(SI) + r3(SI) r5(SI) := r1(SI) + r5(SI) r0(DI) := r4(DI) whereas for the good one I see: r0(SI) := r0(SI) r3(SI) := r0(SI) * r3(SI) r3(SI) := r2(SI) * r1(SI) + r3(SI) r0(DI) := r0(SI) * r2(SI) //with sign extension r1(SI) := r3(SI) + r1(SI) r0(DI) := r0(DI) In the good one the final insn is eliminated due to being dead, whereas the in the bad one the final DImode move is split into two moves. Sched1 changed the order of the mult and mult-accumulate but it's the register allocator that causes the bad codegen