https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120943

            Bug ID: 120943
           Summary: [16 Regression] 5% slowdown of 527.cam4_r on Zen{4,5}
                    since r16-1643-gd073bb6cfc219d
           Product: gcc
           Version: 16.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: pheeck at gcc dot gnu.org
                CC: hjl at gcc dot gnu.org
            Blocks: 26163
  Target Milestone: ---
              Host: x86_64-linux
            Target: x86_64-linux

As seen here

https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=1108.497.0
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=1286.497.0

there was a 5% exec time slowdown of 527.cam4_r SPEC 2017
benchmark when run with -O2 -march=x86-64-v3 -flto -fprofile-use on an AMD Zen4
or Zen5 machine.
I bisected it to r16-1643-gd073bb6cfc219d.

d073bb6cfc219d4b6c283a0b527ee88b42e640e0 is the first bad commit
commit d073bb6cfc219d4b6c283a0b527ee88b42e640e0
Author: H.J. Lu <hjl.to...@gmail.com>
Date:   Thu Mar 18 18:43:10 2021 -0700

    x86: Update memcpy/memset inline strategies for -mtune=generic

    Update memcpy and memset inline strategies for -mtune=generic:

    1. Don't align memory.
    2. For known sizes, prefer vector loop, unroll loop with 4 moves or
       stores per iteration without aligning the loop, up to 256 bytes.
    3. For unknown sizes, use memcpy/memset.
    4. Since each loop iteration has 4 stores and 8 stores for zeroing with
       unroll loop may be needed, change CLEAR_RATIO to 10 so that zeroing
       up to 72 bytes are fully unrolled with 9 stores without SSE.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
[Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)

Reply via email to