https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120943
Bug ID: 120943 Summary: [16 Regression] 5% slowdown of 527.cam4_r on Zen{4,5} since r16-1643-gd073bb6cfc219d Product: gcc Version: 16.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: pheeck at gcc dot gnu.org CC: hjl at gcc dot gnu.org Blocks: 26163 Target Milestone: --- Host: x86_64-linux Target: x86_64-linux As seen here https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=1108.497.0 https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=1286.497.0 there was a 5% exec time slowdown of 527.cam4_r SPEC 2017 benchmark when run with -O2 -march=x86-64-v3 -flto -fprofile-use on an AMD Zen4 or Zen5 machine. I bisected it to r16-1643-gd073bb6cfc219d. d073bb6cfc219d4b6c283a0b527ee88b42e640e0 is the first bad commit commit d073bb6cfc219d4b6c283a0b527ee88b42e640e0 Author: H.J. Lu <hjl.to...@gmail.com> Date: Thu Mar 18 18:43:10 2021 -0700 x86: Update memcpy/memset inline strategies for -mtune=generic Update memcpy and memset inline strategies for -mtune=generic: 1. Don't align memory. 2. For known sizes, prefer vector loop, unroll loop with 4 moves or stores per iteration without aligning the loop, up to 256 bytes. 3. For unknown sizes, use memcpy/memset. 4. Since each loop iteration has 4 stores and 8 stores for zeroing with unroll loop may be needed, change CLEAR_RATIO to 10 so that zeroing up to 72 bytes are fully unrolled with 9 stores without SSE. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163 [Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)