[Bug target/121993] [16 Regression] 20-30% slowdown of 470.lbm on AMD Zen3 and 5-8% slowdown of 519.lbm_r on Zen2 since r16-3485-gae689f89fb4059

lili.cui at intel dot com via Gcc-bugs Tue, 23 Sep 2025 22:51:04 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121993


--- Comment #5 from cuilili <lili.cui at intel dot com> ---
Filip, thank you very much for the configuration file and command line. I
generated two different binary files.

For 519.lbm_r 5-8% regression: 

On ZNVER3: 
I run 519 with r16-3485 and r16-3484 (with your configuration file and command
line), Both binary executed 1.01s. Since this is a "-i test", runtime is very
short, even slight fluctuations can have a significant impact on performance. I
also used "-i ref", both binary executed 166s. So I didn't reproduce it on
ZNVER3.

On ZNVER2(AWS): 
When using the "-i test" parameter, the execution time are same. However, when
using the "-i ref" parameter, the rate (sorry I only record the rate for this
machine, not the execution time) shows slight fluctuations. Taking the median
score, the results are almost identical.

base rate         patch rate
5.00                5.06
5.13                5.07
5.08                5.10


For 470.lbm 20-30% regression: 

On ZNVER3:
With configuration "-Ofast -march=native -flto".

base runtime      patch runtime   regression
 0m0.714s           0m0.759s       5.9% 
 0m0.701s           0m0.732s       4.2%
 0m0.709s           0m0.753s       5.8%

I used perf tools to collect the data for 470, and this time I observed a 13.5%
performance drop. Since the lbm is small, I compared all instructions of the
benchmark. 

The performance degradation was due to some side effects, not an issue with the
IRA register allocation. In fact, my patch even reduced register overflows. The
most significant performance degradation came from the same binary loop, Others
are related to different register usage and order.
Others are related to differences in register name and order.

For example, the following sequence. It seems to be caused by an icache or
dcache alignment issues(I tired using "-falign-functions=32/64
-falign-jumps=32/64", but doesn't work).
-----------------------base-------------------------------------------
            │ 8bc:   vmovsd       %xmm5,(%r11)  --> use xmm5
   10727840 │        vmovsd       0x8(%r14),%xmm0
            │        vmovsd       %xmm0,-0x3e70(%r11)
   10528113 │        vmovsd       0x10(%r14),%xmm0
     878447 │        vmovsd       %xmm0,0x3e88(%r11)
   14035345 │        vmovsd       0x18(%r14),%xmm0
    2632457 │        vmovsd       %xmm0,-0x80(%r11)
    9654504 │        vmovsd       0x20(%r14),%xmm0
    5261938 │        vmovsd       %xmm0,0xb8(%r11)
    8770630 │        vmovsd       0x28(%r14),%xmm0
    7888586 │        vmovsd       %xmm0,-0x1869d0(%r11)
   20176271 │        vmovsd       0x30(%r14),%xmm0
            │        vmovsd       %xmm0,0x186a28(%r11)
   10579155 │        vmovsd       0x38(%r14),%xmm0
            │        vmovsd       %xmm0,-0x3ed0(%r11)
    8777138 │        vmovsd       0x40(%r14),%xmm0
    2633230 │        vmovsd       %xmm0,-0x3d98(%r11)
   10525467 │        vmovsd       0x48(%r14),%xmm0
    1754299 │        vmovsd       %xmm0,0x3e20(%r11)
    9648294 │        vmovsd       0x50(%r14),%xmm0
            │        vmovsd       %xmm0,0x3f58(%r11)
    6144210 │        vmovsd       0x58(%r14),%xmm0
    1754360 │        vmovsd       %xmm0,-0x18a810(%r11)
   15790653 │        vmovsd       0x60(%r14),%xmm0
    1752129 │        vmovsd       %xmm0,0x182be8(%r11)
    9653771 │        vmovsd       0x68(%r14),%xmm0
            │        vmovsd       %xmm0,-0x182b20(%r11)
   55271408 │        vmovsd       0x70(%r14),%xmm0
    1756540 │        vmovsd       %xmm0,0x18a8d8(%r11)
  359669767 │        vmovsd       0x78(%r14),%xmm0   ----> hottest instruction
     877530 │        vmovsd       %xmm0,-0x186a10(%r11)
    4385782 │        vmovsd       0x80(%r14),%xmm0
     876959 │        vmovsd       %xmm0,0x1869e8(%r11)
   64032288 │        vmovsd       0x88(%r14),%xmm0
            │        vmovsd       %xmm0,-0x1868e0(%r11)
    8772932 │        vmovsd       0x90(%r14),%xmm0
            │        vmovsd       %xmm0,0x186b18(%r11)
   11401106 │      ↑ jmpq         6f4

-----------------------patch--------------------------------------

            │ 8b3:   mov          %rcx,(%r11) ---> use rcx instead of xmm5      
   14027500 │        vmovsd       0x8(%r14),%xmm0
     876359 │        vmovsd       %xmm0,-0x3e70(%r11)
     877777 │        vmovsd       0x10(%r14),%xmm0
    1804440 │        vmovsd       %xmm0,0x3e88(%r11)
    9646056 │        vmovsd       0x18(%r14),%xmm0
    3507892 │        vmovsd       %xmm0,-0x80(%r11)
    7009163 │        vmovsd       0x20(%r14),%xmm0
     876772 │        vmovsd       %xmm0,0xb8(%r11)
   15779049 │        vmovsd       0x28(%r14),%xmm0
    4380344 │        vmovsd       %xmm0,-0x1869d0(%r11)
   17723773 │        vmovsd       0x30(%r14),%xmm0
    1750264 │        vmovsd       %xmm0,0x186a28(%r11)
    7002104 │        vmovsd       0x38(%r14),%xmm0
            │        vmovsd       %xmm0,-0x3ed0(%r11)
   12378493 │        vmovsd       0x40(%r14),%xmm0
     875679 │        vmovsd       %xmm0,-0x3d98(%r11)
   14918469 │        vmovsd       0x48(%r14),%xmm0
     878525 │        vmovsd       %xmm0,0x3e20(%r11)
    7011765 │        vmovsd       0x50(%r14),%xmm0
    1749572 │        vmovsd       %xmm0,0x3f58(%r11)
    9641725 │        vmovsd       0x58(%r14),%xmm0
            │        vmovsd       %xmm0,-0x18a810(%r11)
   10516598 │        vmovsd       0x60(%r14),%xmm0
            │        vmovsd       %xmm0,0x182be8(%r11)
   18429589 │        vmovsd       0x68(%r14),%xmm0
     877425 │        vmovsd       %xmm0,-0x182b20(%r11)
   15786218 │        vmovsd       0x70(%r14),%xmm0
    2629875 │        vmovsd       %xmm0,0x18a8d8(%r11)
  460820987 │        vmovsd       0x78(%r14),%xmm0   ----> hottest instruction
     876879 │        vmovsd       %xmm0,-0x186a10(%r11)
   13143096 │        vmovsd       0x80(%r14),%xmm0
            │        vmovsd       %xmm0,0x1869e8(%r11)
    7011126 │        vmovsd       0x88(%r14),%xmm0
     873431 │        vmovsd       %xmm0,-0x1868e0(%r11)
    5253489 │        vmovsd       0x90(%r14),%xmm0
    1754722 │        vmovsd       %xmm0,0x186b18(%r11)
   84139915 │      ↑ jmpq         6eb


Regarding 470, although your performance regression is more significant, it
should be a similar issue.

[Bug target/121993] [16 Regression] 20-30% slowdown of 470.lbm on AMD Zen3 and 5-8% slowdown of 519.lbm_r on Zen2 since r16-3485-gae689f89fb4059

Reply via email to