https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82237
Bug ID: 82237 Summary: [AArch64] Destructive operations result in poor register allocation after scheduling Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: jgreenhalgh at gcc dot gnu.org Target Milestone: --- A destructive operation is one in which an input operand is both read and written. For example, in the vector FMLA instruction in AArch64: FMLA v0.4s, v1.4s, v2.4s The first operand is used for the accumulator value (the operation is v0 = v0 + v1 * v2) and is both read and written by the instruction. In RTL terms, this is: (define_insn "fma<mode>4" [(set (match_operand:VHSDF 0 "register_operand" "=w") (fma:VHSDF (match_operand:VHSDF 1 "register_operand" "w") (match_operand:VHSDF 2 "register_operand" "w") (match_operand:VHSDF 3 "register_operand" "0")))] "TARGET_SIMD" "fmla\\t%0.<Vtype>, %1.<Vtype>, %2.<Vtype>" [(set_attr "type" "neon_fp_mla_<stype><q>")] ) from config/aarch64/aarch64-simd.md . We can get suboptimal code where a read/write operand is used both by a destructive operation, and a non-destructive operation, and the destructive operation is scheduled before the non-destructive operation. For example, with this auto-vectorizable code (with trunk, -O3 -mcpu=cortex-a57): void foo (float* __restrict__ in1, float* __restrict__ in2, float* __restrict__ out1, float* __restrict__ out2) { for (int i = 0; i < 1024; i++) { float t = out1[i]; out1[i] = t + in1[i] * in2[i]; out2[i] = t + in1[i]; } } ldr q1, [x2, x4] ldr q0, [x0, x4] ldr q2, [x1, x4] mov v3.16b, v1.16b // <<<<<< 1) fmla v3.4s, v2.4s, v0.4s // <<<<<< 2) fadd v0.4s, v0.4s, v1.4s // <<<<<< 3) str q3, [x2, x4] str q0, [x3, x4] The scheduling of 2) before 3) forces a reload from v1 in to v3 at 1). With an improved schedule, this could be: ldr q1, [x2, x4] ldr q0, [x0, x4] ldr q2, [x1, x4] fadd v4.4s, v0.4s, v1.4s // <<<<<< 3) fmla v3.4s, v2.4s, v0.4s // <<<<<< 2) str q3, [x2, x4] str q4, [x3, x4] In larger loops, we can end up in this situation more frequently than we would like - the cost of the extra move instructions can be high.