https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119046

            Bug ID: 119046
           Summary: [15 Regression] Performance drop from not forming
                    lane-wise FMLAs with Eigen library
           Product: gcc
           Version: 15.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: ktkachov at gcc dot gnu.org
  Target Milestone: ---

Created attachment 60603
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=60603&action=edit
Reproducer for aarch64

Unfortunately I couldn't reduce this to smaller example, but I'm attaching a
small benchmark that builds against the Eigen library to reproduce the issue.

You'll need the template-only Eigen library from
https://gitlab.com/libeigen/eigen checked out.

On aarch64 you can build the benchmark with:
g++ -I../eigen -O3 -mcpu=neoverse-v2 benchmark.cpp

Running the resulting binary should give a GFOPS number (higher is better)
Building the benchmark with GCC 15 gives about ~20% lower number than with GCC
14.

The codegen difference is down to GCC 14 producing this in the critical GEMM
loop:
        ldp     q29, q9, [x1]
        ldp     q11, q12, [x0]
        ldr     q13, [x0, 32]
        fmla    v3.4s, v13.4s, v29.s[0]
        fmla    v26.4s, v11.4s, v29.s[0]
        fmla    v27.4s, v11.4s, v29.s[1]
        fmla    v28.4s, v11.4s, v29.s[2]
        fmla    v14.4s, v11.4s, v29.s[3]
        fmla    v15.4s, v12.4s, v29.s[0]
        fmla    v0.4s, v12.4s, v29.s[1]
        fmla    v1.4s, v12.4s, v29.s[2]
        fmla    v2.4s, v12.4s, v29.s[3]
        fmla    v4.4s, v13.4s, v29.s[1]
        fmla    v5.4s, v13.4s, v29.s[2]
        fmla    v7.4s, v13.4s, v29.s[3]
        mov     v29.16b, v10.16b
        fmla    v16.4s, v11.4s, v9.s[0]
        fmla    v17.4s, v12.4s, v9.s[0]
        fmla    v19.4s, v11.4s, v9.s[1]
        fmla    v20.4s, v12.4s, v9.s[1]
        fmla    v22.4s, v11.4s, v9.s[2]
        fmla    v23.4s, v12.4s, v9.s[2]
        fmla    v25.4s, v11.4s, v9.s[3]
        fmla    v18.4s, v13.4s, v9.s[0]
        fmla    v21.4s, v13.4s, v9.s[1]
        fmla    v24.4s, v13.4s, v9.s[2]
        fmla    v29.4s, v12.4s, v9.s[3]
        fmla    v31.4s, v13.4s, v9.s[3]

whereas GCC 15 emits extra lane-dup instructions:
        ldp     q5, q6, [x1]
        ldp     q2, q3, [x0]
        ldr     q4, [x0, 32]
        dup     v1.4s, v5.s[1]
        fmla    v29.4s, v4.4s, v5.s[0]
        fmla    v30.4s, v2.4s, v5.s[0]
        fmla    v28.4s, v3.4s, v5.s[0]
        fmla    v7.4s, v2.4s, v1.4s
        fmla    v8.4s, v3.4s, v1.4s
        fmla    v9.4s, v4.4s, v1.4s
        dup     v1.4s, v5.s[2]
        dup     v5.4s, v5.s[3]
        fmla    v17.4s, v2.4s, v6.s[0]
        fmla    v14.4s, v2.4s, v5.4s
        fmla    v15.4s, v3.4s, v5.4s
        fmla    v16.4s, v4.4s, v5.4s
        dup     v5.4s, v6.s[1]
        fmla    v18.4s, v3.4s, v6.s[0]
        fmla    v19.4s, v4.4s, v6.s[0]
        fmla    v20.4s, v2.4s, v5.4s
        fmla    v21.4s, v3.4s, v5.4s
        fmla    v22.4s, v4.4s, v5.4s
        dup     v5.4s, v6.s[2]
        dup     v6.4s, v6.s[3]
        fmla    v10.4s, v2.4s, v1.4s
        fmla    v12.4s, v3.4s, v1.4s
        fmla    v13.4s, v4.4s, v1.4s
        fmla    v23.4s, v2.4s, v5.4s
        fmla    v24.4s, v3.4s, v5.4s
        fmla    v25.4s, v4.4s, v5.4s
        fmla    v26.4s, v2.4s, v6.4s
        fmla    v27.4s, v3.4s, v6.4s
        fmla    v31.4s, v4.4s, v6.4s

I've bisected this to the change g:9dbff9c05520a74e6cd337578f27b56c941f64f3 the
Revert "Revert "combine: Don't combine if I2 does not change""

The code inside Eigen in question that generates the FMLAs is in
Eigen/src/Core/arch/NEON/GeneralBlockPanelKernel.h and references PR89101 as a
previous incarnation of this bug that they had to workaround with inline
assembly.

Reply via email to