https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121993
--- Comment #5 from cuilili <lili.cui at intel dot com> ---
Filip, thank you very much for the configuration file and command line. I
generated two different binary files.
For 519.lbm_r 5-8% regression:
On ZNVER3:
I run 519 with r16-3485 and r16-3484 (with your configuration file and command
line), Both binary executed 1.01s. Since this is a "-i test", runtime is very
short, even slight fluctuations can have a significant impact on performance. I
also used "-i ref", both binary executed 166s. So I didn't reproduce it on
ZNVER3.
On ZNVER2(AWS):
When using the "-i test" parameter, the execution time are same. However, when
using the "-i ref" parameter, the rate (sorry I only record the rate for this
machine, not the execution time) shows slight fluctuations. Taking the median
score, the results are almost identical.
base rate patch rate
5.00 5.06
5.13 5.07
5.08 5.10
For 470.lbm 20-30% regression:
On ZNVER3:
With configuration "-Ofast -march=native -flto".
base runtime patch runtime regression
0m0.714s 0m0.759s 5.9%
0m0.701s 0m0.732s 4.2%
0m0.709s 0m0.753s 5.8%
I used perf tools to collect the data for 470, and this time I observed a 13.5%
performance drop. Since the lbm is small, I compared all instructions of the
benchmark.
The performance degradation was due to some side effects, not an issue with the
IRA register allocation. In fact, my patch even reduced register overflows. The
most significant performance degradation came from the same binary loop, Others
are related to different register usage and order.
Others are related to differences in register name and order.
For example, the following sequence. It seems to be caused by an icache or
dcache alignment issues(I tired using "-falign-functions=32/64
-falign-jumps=32/64", but doesn't work).
-----------------------base-------------------------------------------
│ 8bc: vmovsd %xmm5,(%r11) --> use xmm5
10727840 │ vmovsd 0x8(%r14),%xmm0
│ vmovsd %xmm0,-0x3e70(%r11)
10528113 │ vmovsd 0x10(%r14),%xmm0
878447 │ vmovsd %xmm0,0x3e88(%r11)
14035345 │ vmovsd 0x18(%r14),%xmm0
2632457 │ vmovsd %xmm0,-0x80(%r11)
9654504 │ vmovsd 0x20(%r14),%xmm0
5261938 │ vmovsd %xmm0,0xb8(%r11)
8770630 │ vmovsd 0x28(%r14),%xmm0
7888586 │ vmovsd %xmm0,-0x1869d0(%r11)
20176271 │ vmovsd 0x30(%r14),%xmm0
│ vmovsd %xmm0,0x186a28(%r11)
10579155 │ vmovsd 0x38(%r14),%xmm0
│ vmovsd %xmm0,-0x3ed0(%r11)
8777138 │ vmovsd 0x40(%r14),%xmm0
2633230 │ vmovsd %xmm0,-0x3d98(%r11)
10525467 │ vmovsd 0x48(%r14),%xmm0
1754299 │ vmovsd %xmm0,0x3e20(%r11)
9648294 │ vmovsd 0x50(%r14),%xmm0
│ vmovsd %xmm0,0x3f58(%r11)
6144210 │ vmovsd 0x58(%r14),%xmm0
1754360 │ vmovsd %xmm0,-0x18a810(%r11)
15790653 │ vmovsd 0x60(%r14),%xmm0
1752129 │ vmovsd %xmm0,0x182be8(%r11)
9653771 │ vmovsd 0x68(%r14),%xmm0
│ vmovsd %xmm0,-0x182b20(%r11)
55271408 │ vmovsd 0x70(%r14),%xmm0
1756540 │ vmovsd %xmm0,0x18a8d8(%r11)
359669767 │ vmovsd 0x78(%r14),%xmm0 ----> hottest instruction
877530 │ vmovsd %xmm0,-0x186a10(%r11)
4385782 │ vmovsd 0x80(%r14),%xmm0
876959 │ vmovsd %xmm0,0x1869e8(%r11)
64032288 │ vmovsd 0x88(%r14),%xmm0
│ vmovsd %xmm0,-0x1868e0(%r11)
8772932 │ vmovsd 0x90(%r14),%xmm0
│ vmovsd %xmm0,0x186b18(%r11)
11401106 │ ↑ jmpq 6f4
-----------------------patch--------------------------------------
│ 8b3: mov %rcx,(%r11) ---> use rcx instead of xmm5
14027500 │ vmovsd 0x8(%r14),%xmm0
876359 │ vmovsd %xmm0,-0x3e70(%r11)
877777 │ vmovsd 0x10(%r14),%xmm0
1804440 │ vmovsd %xmm0,0x3e88(%r11)
9646056 │ vmovsd 0x18(%r14),%xmm0
3507892 │ vmovsd %xmm0,-0x80(%r11)
7009163 │ vmovsd 0x20(%r14),%xmm0
876772 │ vmovsd %xmm0,0xb8(%r11)
15779049 │ vmovsd 0x28(%r14),%xmm0
4380344 │ vmovsd %xmm0,-0x1869d0(%r11)
17723773 │ vmovsd 0x30(%r14),%xmm0
1750264 │ vmovsd %xmm0,0x186a28(%r11)
7002104 │ vmovsd 0x38(%r14),%xmm0
│ vmovsd %xmm0,-0x3ed0(%r11)
12378493 │ vmovsd 0x40(%r14),%xmm0
875679 │ vmovsd %xmm0,-0x3d98(%r11)
14918469 │ vmovsd 0x48(%r14),%xmm0
878525 │ vmovsd %xmm0,0x3e20(%r11)
7011765 │ vmovsd 0x50(%r14),%xmm0
1749572 │ vmovsd %xmm0,0x3f58(%r11)
9641725 │ vmovsd 0x58(%r14),%xmm0
│ vmovsd %xmm0,-0x18a810(%r11)
10516598 │ vmovsd 0x60(%r14),%xmm0
│ vmovsd %xmm0,0x182be8(%r11)
18429589 │ vmovsd 0x68(%r14),%xmm0
877425 │ vmovsd %xmm0,-0x182b20(%r11)
15786218 │ vmovsd 0x70(%r14),%xmm0
2629875 │ vmovsd %xmm0,0x18a8d8(%r11)
460820987 │ vmovsd 0x78(%r14),%xmm0 ----> hottest instruction
876879 │ vmovsd %xmm0,-0x186a10(%r11)
13143096 │ vmovsd 0x80(%r14),%xmm0
│ vmovsd %xmm0,0x1869e8(%r11)
7011126 │ vmovsd 0x88(%r14),%xmm0
873431 │ vmovsd %xmm0,-0x1868e0(%r11)
5253489 │ vmovsd 0x90(%r14),%xmm0
1754722 │ vmovsd %xmm0,0x186b18(%r11)
84139915 │ ↑ jmpq 6eb
Regarding 470, although your performance regression is more significant, it
should be a similar issue.