https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119046
Bug ID: 119046 Summary: [15 Regression] Performance drop from not forming lane-wise FMLAs with Eigen library Product: gcc Version: 15.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: ktkachov at gcc dot gnu.org Target Milestone: --- Created attachment 60603 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=60603&action=edit Reproducer for aarch64 Unfortunately I couldn't reduce this to smaller example, but I'm attaching a small benchmark that builds against the Eigen library to reproduce the issue. You'll need the template-only Eigen library from https://gitlab.com/libeigen/eigen checked out. On aarch64 you can build the benchmark with: g++ -I../eigen -O3 -mcpu=neoverse-v2 benchmark.cpp Running the resulting binary should give a GFOPS number (higher is better) Building the benchmark with GCC 15 gives about ~20% lower number than with GCC 14. The codegen difference is down to GCC 14 producing this in the critical GEMM loop: ldp q29, q9, [x1] ldp q11, q12, [x0] ldr q13, [x0, 32] fmla v3.4s, v13.4s, v29.s[0] fmla v26.4s, v11.4s, v29.s[0] fmla v27.4s, v11.4s, v29.s[1] fmla v28.4s, v11.4s, v29.s[2] fmla v14.4s, v11.4s, v29.s[3] fmla v15.4s, v12.4s, v29.s[0] fmla v0.4s, v12.4s, v29.s[1] fmla v1.4s, v12.4s, v29.s[2] fmla v2.4s, v12.4s, v29.s[3] fmla v4.4s, v13.4s, v29.s[1] fmla v5.4s, v13.4s, v29.s[2] fmla v7.4s, v13.4s, v29.s[3] mov v29.16b, v10.16b fmla v16.4s, v11.4s, v9.s[0] fmla v17.4s, v12.4s, v9.s[0] fmla v19.4s, v11.4s, v9.s[1] fmla v20.4s, v12.4s, v9.s[1] fmla v22.4s, v11.4s, v9.s[2] fmla v23.4s, v12.4s, v9.s[2] fmla v25.4s, v11.4s, v9.s[3] fmla v18.4s, v13.4s, v9.s[0] fmla v21.4s, v13.4s, v9.s[1] fmla v24.4s, v13.4s, v9.s[2] fmla v29.4s, v12.4s, v9.s[3] fmla v31.4s, v13.4s, v9.s[3] whereas GCC 15 emits extra lane-dup instructions: ldp q5, q6, [x1] ldp q2, q3, [x0] ldr q4, [x0, 32] dup v1.4s, v5.s[1] fmla v29.4s, v4.4s, v5.s[0] fmla v30.4s, v2.4s, v5.s[0] fmla v28.4s, v3.4s, v5.s[0] fmla v7.4s, v2.4s, v1.4s fmla v8.4s, v3.4s, v1.4s fmla v9.4s, v4.4s, v1.4s dup v1.4s, v5.s[2] dup v5.4s, v5.s[3] fmla v17.4s, v2.4s, v6.s[0] fmla v14.4s, v2.4s, v5.4s fmla v15.4s, v3.4s, v5.4s fmla v16.4s, v4.4s, v5.4s dup v5.4s, v6.s[1] fmla v18.4s, v3.4s, v6.s[0] fmla v19.4s, v4.4s, v6.s[0] fmla v20.4s, v2.4s, v5.4s fmla v21.4s, v3.4s, v5.4s fmla v22.4s, v4.4s, v5.4s dup v5.4s, v6.s[2] dup v6.4s, v6.s[3] fmla v10.4s, v2.4s, v1.4s fmla v12.4s, v3.4s, v1.4s fmla v13.4s, v4.4s, v1.4s fmla v23.4s, v2.4s, v5.4s fmla v24.4s, v3.4s, v5.4s fmla v25.4s, v4.4s, v5.4s fmla v26.4s, v2.4s, v6.4s fmla v27.4s, v3.4s, v6.4s fmla v31.4s, v4.4s, v6.4s I've bisected this to the change g:9dbff9c05520a74e6cd337578f27b56c941f64f3 the Revert "Revert "combine: Don't combine if I2 does not change"" The code inside Eigen in question that generates the FMLAs is in Eigen/src/Core/arch/NEON/GeneralBlockPanelKernel.h and references PR89101 as a previous incarnation of this bug that they had to workaround with inline assembly.