[Bug ipa/117695] lto got zero score on unixbench dhry2reg on trunk
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117695 Tianyang Zhou changed: What|Removed |Added CC||tianyang.chou at gmail dot com --- Comment #9 from Tianyang Zhou --- (In reply to Andrew Pinski from comment #8) > so this code is undefined for a few different reasons. > #1 is you can't call fprintf from an async signal (e.g. SIGALARM). > https://www.gnu.org/software/libc/manual/html_node/Formatted-Output- > Functions.html > specifies that fprintf is `AS-Unsafe corrupt heap` > (https://www.gnu.org/software/libc/manual/html_node/POSIX-Safety-Concepts. > html for the meaning there on AS) > > > #2 is an async signal might be done at different times and the write might > not be done by the compiler at all, you need to mark the variable as > volatile. that is Run_Index needs to be marked as volatile to get the > variable written to by the compiler (that is why it is optimized away with > LTO). Hi Andrew, Instead of marking the variable as volatile. Will disable loop-eliminating related passes works for this circumstance?
[Bug tree-optimization/114932] IVopts inefficient handling of signed IV used for addressing.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932 Tianyang Chou changed: What|Removed |Added CC||tianyang.chou at gmail dot com --- Comment #22 from Tianyang Chou --- (In reply to Tamar Christina from comment #21) > Thus finally fixed Hi Tamar, Is there any other prerequisite patches for this patch: "perform affine fold to unsigned on non address expressions." ? I apply your patch to gcc-13.2.0 official and got 10% performance increase on aarch64, but applying the patch to gcc-12.3.0 official got 0 performance boost. So I wonder if there are some related patches between gcc-12.3.0 and gcc-13.2.0 which bring the difference.
[Bug tree-optimization/114932] IVopts inefficient handling of signed IV used for addressing.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932 --- Comment #23 from Tianyang Chou --- (In reply to Tianyang Chou from comment #22) > (In reply to Tamar Christina from comment #21) > > Thus finally fixed > > Hi Tamar, > > Is there any other prerequisite patches for this patch: "perform affine fold > to unsigned on non address expressions." ? > > I apply your patch to gcc-13.2.0 official and got 10% performance increase > on aarch64, but applying the patch to gcc-12.3.0 official got 0 performance > boost. > > So I wonder if there are some related patches between gcc-12.3.0 and > gcc-13.2.0 which bring the difference. I saw GCC-12.3.0 official with the patch generate asm for the test case like below: ldp w7, w6, [x0, 216] ldp w5, w4, [x0, 252] ldr w3, [x0, 288] ldr w2, [x0, 292] sub w7, w7, #1 sub w6, w6, #1 sub w5, w5, #1 sub w4, w4, #1 sub w3, w3, #1 stp w7, w6, [x0, 216] stp w5, w4, [x0, 252] add x0, x0, 324 sub w2, w2, #1 str w3, [x0, -36] str w2, [x0, -32] and gcc-13.2.0 offcial with the patch generate asm like: mvni v3.2s, 0 ldr d2, [x0, 216] ldr d1, [x0, 252] ldr d0, [x0, 288] add v2,2s, v2,2s, v3,2s add x0,x0, 324 add v1,2s, v1,2s, v3.2s add v0,2s, v0.2s, v3.2s str d2, [x0, -108] str d1, [x0, -72] str d0, [x0, -36] It seems code didn't get vectorized when using GCC-12.3.0
[Bug target/114978] [14/15 regression] 548.exchange2_r 14%-28% regressions on Loongarch64 after gcc 14 snapshot 20240317
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114978 --- Comment #30 from Tianyang Chou --- (In reply to Chen Chen from comment #27) > I am a bit confused with your statement. For AOSC gcc 13.2 I got 8.52 with > parameters "-g -Ofast -march=la464 -flto", and 8.76 with parameters "-g > -Ofast -march=la464". These results are similar to yours. > > For gcc 14 snapshot 20240317, currently I configure with the following > parameters: > > -enable-shared --enable-threads=posix --with-system-zlib > --enable-gnu-indirect-function --enable-__cxa_atexit > --disable-libunwind-exceptions --enable-clocale=gnu --disable-libstdcxx-pch > --disable-libssp --enable-gnu-unique-object --enable-linker-build-id > -enable-lto --enable-plugin --enable-install-libiberty --disable-multilib > --disable-werror --enable-pie --enable-checking=release > --enable-libstdcxx-dual-abi --with-default-libstdcxx-abi=new > --enable-default-pie --enable-default-ssp --enable-bootstrap > --enable-languages=c,c++,fortran,lto --with-abi=lp64d > --with-arch=loongarch64 --with-tune=la464 --build=loongarch64-aosc-linux-gnu > --program-suffix=-14.0.1 > > And still get a score of 12.8. > > > (In reply to Tianyang Chou from comment #24) > > (In reply to Chen Chen from comment #0) > > > We tested Loongarch64 CPU Loongson 3A6000 with "LA664" architecture in > > > Linux > > > operating system AOSC OS 11.4.0 (default gcc version is 13.2.0). And we > > > found the 548.exchange2_r benchmark from SPEC 2017 INTrate suite suffered > > > significant regressions from 14% to 28% with various compiling options. > > > > > > The rate-1 results are following: > > > > > > after snapshot 20240317 score 14.3-19.3% lower with parameters "-g -Ofast > > > -march=native": > > > 13.2.0:11.7 (223s) [gcc 13.2.0, system default] > > > 20240317: 11.0 (237s) [gcc 14 snapshot 20240317] > > > 20240324: 8.88 (295s) [gcc 14 snapshot 20240324] > > > 20240430: 9.03 (290s) [gcc 14 snapshot 20240430, 14.1.0-RC] > > > 14.1.0:9.43 (278s) [gcc 14.1.0 release] > > > > > > after snapshot 20240317 score 16.5-20.8% lower with parameters "-g -Ofast > > > -march=native -flto": > > > 13.2.0:12.0 (218s) > > > 20240317: 10.6 (248s) > > > 20240324: 8.40 (312s) > > > 20240430: 8.48 (309s) > > > 14.1.0:8.85 (296s) > > > > > > > > > after snapshot 20240317 score 18-23.1% lower with parameters "-g -Ofast > > > -march=la664": > > > 13.2.0:"-march=la664" flag is not supported > > > 20240317: 11.5 (227s) > > > 20240324: 8.84 (296s) > > > 20240430: 9.43 (278s) > > > 14.1.0:9.42 (278s) > > > > > > > > > after snapshot 20240317 score 20.3-21.2% lower with parameters "-g -Ofast > > > -march=la664 -flto": > > > 13.2.0:"-march=la664" flag is not supported > > > 20240317: 11.1 (236s) > > > 20240324: 8.75 (299s) > > > 20240430: 8.85 (296s) > > > 14.1.0:8.85 (296s) > > > > > > > > > after snapshot 20240317 score 26.3-26.6% lower with parameters "-g -Ofast > > > -march=la464": > > > 13.2.0:8.76 (299s) > > > 20240317: 12.8 (205s) > > > 20240324: 9.39 (279s) > > > 20240430: 9.43 (278s) > > > 14.1.0:9.43 (278s) > > > > > > > > > after snapshot 20240317 score 26.6-28% lower with parameters "-g -Ofast > > > -march=la464 -flto": > > > 13.2.0:8.52 (307s) > > > 20240317: 12.8 (204s) > > > 20240324: 9.22 (284s) > > > 20240430: 9.37 (280s) > > > 14.1.0:9.40 (279s) > > > > > > > > > The gcc 14 snapshots and gcc 14.1.0 are compiled with the following > > > parameters: > > > > > > --enable-shared --enable-threads=posix --with-system-zlib > > > --enable-gnu-indirect-function --enable-__cxa_atexit > > > --disable-libunwind-exceptions --enable-clocale=gnu > > > --disable-libstdcxx-pch > > > --disable-libssp --enable-gnu-unique-object --enable-linker-build-id > > > --enable-lto --enable-plugin --enable-install-libiberty --disable-multilib > > > --disable-werror --enable-pie --enable-checking=release > > > --enable-libstdcxx-dual-abi --with-default-libstdcxx-abi=new > > > --enable-default-pie --enable-default-ssp --enable-bootstrap > > > --enable-languages=c,c++,fortran,lto --with-abi=lp64d > > > --with-arch=loongarch64 --with-tune=la664 > > > --build=loongarch64-aosc-linux-gnu > > > > > > > > > The regression may be found on other types of CPUs as well. We did a quick > > > test on AMD Zen4 CPU R9 7940HS and found similar but smaller regression: > > > > > > The rate-1 results on x86_64 (AMD R9 7940HS) with operating system Debian > > > 12: > > > > > > after snapshot 20240317 score 8.6-9.6% lower with parameters "-m64 -g > > > -Ofast > > > -march=znver3": > > > 12.2.0:30.1 (87.0s) [gcc 12.2.0, system default] > > > 13.2.0:30.6 (85.7s) [gcc 13.2 release] > > > 20240317: 31.4 (83.3s) [gcc 14 snapshot] > > > 20240324: 28.7 (91.2s) [gcc 14 snapshot] > > > 20240430: 28.4 (92.2s) [gcc 14 snapshot, 14.1.0-RC] > > > > > > after snapshot 20240317 score 10% lower with parameters "-m64 -g -Ofast > > > -march=znver3 -flto": > >
[Bug tree-optimization/114932] IVopts inefficient handling of signed IV used for addressing.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932 --- Comment #25 from Tianyang Chou --- (In reply to Tamar Christina from comment #0) Hi Tamar, After reading the whole discussion, I still confused about how does the immediate offset mode generated, can you help me understanding the logic chain of the optimization? What am I understand is: before optimized, gcc generate an register offset mode, your patch allows CHREC multiply to be folded in IVOPT pass, that means the addressing calculation process get simplified, but what's the relation between this simplification and generated immediate offset mode? How does this CHREC multiply folding optimization causes the generation of immediate offset ldr step by step? Hope you can provide me the basic train of thought from your optimization to the generation of immediate offset load/store instructions. Many thanks!
[Bug tree-optimization/114932] IVopts inefficient handling of signed IV used for addressing.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932 --- Comment #26 from Tianyang Chou --- (In reply to Tamar Christina from comment #0) Hi Tamar, After reading the whole discussion, I still confused about how does the immediate offset mode generated, can you help me understanding the logic chain of the optimization? What am I understand is: before optimized, gcc generate an register offset mode, your patch allows CHREC multiply to be folded in IVOPT pass, that means the addressing calculation process get simplified, but what's the relation between this simplification and generated immediate offset mode? How does this CHREC multiply folding optimization causes the generation of immediate offset ldr step by step? Hope you can provide me the basic train of thought from your optimization to the generation of immediate offset load/store instructions. Many thanks!
[Bug tree-optimization/114932] IVopts inefficient handling of signed IV used for addressing.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932 --- Comment #28 from Tianyang Chou --- Very helpful, thanks for your time. Best regards,Tianyang.Chou Replied Message From tnfchris at gcc dot gnu.org[1] Date 4/20/2025 19:46 To [2] Subject [Bug tree-optimization/114932] IVopts inefficient handling of signed IV used for addressing. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932 --- Comment #27 from Tamar Christina --- (In reply to Tianyang Chou from comment #26) (In reply to Tamar Christina from comment #0) Hi Tamar, After reading the whole discussion, I still confused about how does the immediate offset mode generated, can you help me understanding the logic chain of the optimization? What am I understand is: before optimized, gcc generate an register offset mode, your patch allows CHREC multiply to be folded in IVOPT pass, that means the addressing calculation process get simplified, but what's the relation between this simplification and generated immediate offset mode? How does this CHREC multiply folding optimization causes the generation of immediate offset ldr step by step? Hope you can provide me the basic train of thought from your optimization to the generation of immediate offset load/store instructions. Many thanks! Hi Tianyang, Sorry I forgot to respond here. The basic gist of it is that with the original IV Base: (integer(kind=4) *) &block + ((sizetype) ((integer(kind=8)) l0_19(D) * 81) + 9) * 4 The more complicated expression makes it hard for IV opts to compare IVs. Lets say you have another IV Base: (integer(kind=4) *) &block + ((sizetype) ((integer(kind=8)) l0_19(D) * 81) + 9) * 4 Base: (integer(kind=4) *) &block + ((sizetype) ((integer(kind=8)) l0_19(D) * 81) + 10) * 4 This leaves IV opts with only the choice that the common base can be ptr = &block + ((sizetype) ((integer(kind=8)) l0_19(D) * 81) and so the two IVs become (ptr + 9) * 4 and (ptr + 10) * 4 so you'll need a more complicated addressing mode. by being able to fold the expressions to something simpler they become: Base: (integer(kind=4) *) &block + ((sizetype) ((unsigned long) l0_19(D) * 324) + 36) Base: (integer(kind=4) *) &block + ((sizetype) ((unsigned long) l0_19(D) * 324) + 40) By comparing the operations structurally IV opts realizes the base expression this time can be ptr = &block + ((sizetype) ((unsigned long) l0_19(D) * 324) + 36) and so the two IVs become ptr and ptr + 4, hence the immediate offset addressing. Basically the outer multiply prevents the IVs from being expressible as a simple offset from each other. -- You are receiving this mail because: You are on the CC list for the bug. 1. mailto:undefined 2. mailto:tianyang.c...@gmail.com
[Bug target/114978] [14/15 regression] 548.exchange2_r 14%-28% regressions on Loongarch64 after gcc 14 snapshot 20240317
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114978 Tianyang Zhou changed: What|Removed |Added CC||tianyang.chou at gmail dot com --- Comment #24 from Tianyang Zhou --- (In reply to Chen Chen from comment #0) > We tested Loongarch64 CPU Loongson 3A6000 with "LA664" architecture in Linux > operating system AOSC OS 11.4.0 (default gcc version is 13.2.0). And we > found the 548.exchange2_r benchmark from SPEC 2017 INTrate suite suffered > significant regressions from 14% to 28% with various compiling options. > > The rate-1 results are following: > > after snapshot 20240317 score 14.3-19.3% lower with parameters "-g -Ofast > -march=native": > 13.2.0:11.7 (223s) [gcc 13.2.0, system default] > 20240317: 11.0 (237s) [gcc 14 snapshot 20240317] > 20240324: 8.88 (295s) [gcc 14 snapshot 20240324] > 20240430: 9.03 (290s) [gcc 14 snapshot 20240430, 14.1.0-RC] > 14.1.0:9.43 (278s) [gcc 14.1.0 release] > > after snapshot 20240317 score 16.5-20.8% lower with parameters "-g -Ofast > -march=native -flto": > 13.2.0:12.0 (218s) > 20240317: 10.6 (248s) > 20240324: 8.40 (312s) > 20240430: 8.48 (309s) > 14.1.0:8.85 (296s) > > > after snapshot 20240317 score 18-23.1% lower with parameters "-g -Ofast > -march=la664": > 13.2.0:"-march=la664" flag is not supported > 20240317: 11.5 (227s) > 20240324: 8.84 (296s) > 20240430: 9.43 (278s) > 14.1.0:9.42 (278s) > > > after snapshot 20240317 score 20.3-21.2% lower with parameters "-g -Ofast > -march=la664 -flto": > 13.2.0:"-march=la664" flag is not supported > 20240317: 11.1 (236s) > 20240324: 8.75 (299s) > 20240430: 8.85 (296s) > 14.1.0:8.85 (296s) > > > after snapshot 20240317 score 26.3-26.6% lower with parameters "-g -Ofast > -march=la464": > 13.2.0:8.76 (299s) > 20240317: 12.8 (205s) > 20240324: 9.39 (279s) > 20240430: 9.43 (278s) > 14.1.0:9.43 (278s) > > > after snapshot 20240317 score 26.6-28% lower with parameters "-g -Ofast > -march=la464 -flto": > 13.2.0:8.52 (307s) > 20240317: 12.8 (204s) > 20240324: 9.22 (284s) > 20240430: 9.37 (280s) > 14.1.0:9.40 (279s) > > > The gcc 14 snapshots and gcc 14.1.0 are compiled with the following > parameters: > > --enable-shared --enable-threads=posix --with-system-zlib > --enable-gnu-indirect-function --enable-__cxa_atexit > --disable-libunwind-exceptions --enable-clocale=gnu --disable-libstdcxx-pch > --disable-libssp --enable-gnu-unique-object --enable-linker-build-id > --enable-lto --enable-plugin --enable-install-libiberty --disable-multilib > --disable-werror --enable-pie --enable-checking=release > --enable-libstdcxx-dual-abi --with-default-libstdcxx-abi=new > --enable-default-pie --enable-default-ssp --enable-bootstrap > --enable-languages=c,c++,fortran,lto --with-abi=lp64d > --with-arch=loongarch64 --with-tune=la664 --build=loongarch64-aosc-linux-gnu > > > The regression may be found on other types of CPUs as well. We did a quick > test on AMD Zen4 CPU R9 7940HS and found similar but smaller regression: > > The rate-1 results on x86_64 (AMD R9 7940HS) with operating system Debian 12: > > after snapshot 20240317 score 8.6-9.6% lower with parameters "-m64 -g -Ofast > -march=znver3": > 12.2.0:30.1 (87.0s) [gcc 12.2.0, system default] > 13.2.0:30.6 (85.7s) [gcc 13.2 release] > 20240317: 31.4 (83.3s) [gcc 14 snapshot] > 20240324: 28.7 (91.2s) [gcc 14 snapshot] > 20240430: 28.4 (92.2s) [gcc 14 snapshot, 14.1.0-RC] > > after snapshot 20240317 score 10% lower with parameters "-m64 -g -Ofast > -march=znver3 -flto": > 12.2.0:29.0 (90.3s) > 13.2.0:30.9 (84.9s) > 20240317: 32.0 (81.8s) > 20240324: 28.8 (90.9s) > 20240430: 28.8 (91.1s) > > gcc13 and gcc14 are compiled with the following parameters: > > --enable-shared --enable-threads=posix --with-system-zlib > --enable-gnu-indirect-function --enable-__cxa_atexit > --disable-libunwind-exceptions --enable-clocale=gnu --disable-libstdcxx-pch > --disable-libssp --enable-gnu-unique-object --enable-linker-build-id > --enable-lto --enable-plugin --enable-install-libiberty --disable-multilib > --disable-werror --enable-pie --enable-checking=release > --enable-libstdcxx-dual-abi --with-default-libstdcxx-abi=new > --enable-default-pie --enable-default-ssp --enable-bootstrap > --enable-languages=c,c++,fortran,lto --build=x86_64-linux-gnu > --host=x86_64-linux-gnu --target=x86_64-linux-gnu Sorry to talk about something unrelated to this bug. I tried running 548 on CPU loongson 3A6000 with the same compiler version and compiler options as you but the score is only 8.5, so could you please tell me what am I missing? I just can't reproduce your performance result. The gcc compiler source code is downloaded from the github repo AOSC-Tracking/gcc(13.2.0), configure it with parameters: "--enable-shared --enable-threads=posi
[Bug target/114978] [14/15 regression] 548.exchange2_r 14%-28% regressions on Loongarch64 after gcc 14 snapshot 20240317
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114978 --- Comment #25 from Tianyang Zhou --- Btw, I use the latest AOSC and the system default GCC is 14.2.0 20240801