[Bug ipa/117695] lto got zero score on unixbench dhry2reg on trunk

2024-12-30 Thread tianyang.chou at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117695

Tianyang Zhou  changed:

   What|Removed |Added

 CC||tianyang.chou at gmail dot com

--- Comment #9 from Tianyang Zhou  ---
(In reply to Andrew Pinski from comment #8)
> so this code is undefined for a few different reasons.
> #1 is you can't call fprintf from an async signal (e.g. SIGALARM).
> https://www.gnu.org/software/libc/manual/html_node/Formatted-Output-
> Functions.html
> specifies that fprintf is `AS-Unsafe corrupt heap`
> (https://www.gnu.org/software/libc/manual/html_node/POSIX-Safety-Concepts.
> html for the meaning there on AS)
> 
> 
> #2 is an async signal might be done at different times and the write might
> not be done by the compiler at all, you need to mark the variable as
> volatile. that is Run_Index needs to be marked as volatile to get the
> variable written to by the compiler (that is why it is optimized away with
> LTO).
Hi Andrew,

Instead of marking the variable as volatile. Will disable loop-eliminating
related passes works for this circumstance?

[Bug tree-optimization/114932] IVopts inefficient handling of signed IV used for addressing.

2025-03-18 Thread tianyang.chou at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932

Tianyang Chou  changed:

   What|Removed |Added

 CC||tianyang.chou at gmail dot com

--- Comment #22 from Tianyang Chou  ---
(In reply to Tamar Christina from comment #21)
> Thus finally fixed

Hi Tamar,

Is there any other prerequisite patches for this patch: "perform affine fold to
unsigned on non address expressions." ?

I apply your patch to gcc-13.2.0 official and got 10% performance increase on
aarch64, but applying the patch to gcc-12.3.0 official got 0 performance boost.

So I wonder if there are some related patches between gcc-12.3.0 and gcc-13.2.0
which bring the difference.

[Bug tree-optimization/114932] IVopts inefficient handling of signed IV used for addressing.

2025-03-18 Thread tianyang.chou at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932

--- Comment #23 from Tianyang Chou  ---
(In reply to Tianyang Chou from comment #22)
> (In reply to Tamar Christina from comment #21)
> > Thus finally fixed
> 
> Hi Tamar,
> 
> Is there any other prerequisite patches for this patch: "perform affine fold
> to unsigned on non address expressions." ?
> 
> I apply your patch to gcc-13.2.0 official and got 10% performance increase
> on aarch64, but applying the patch to gcc-12.3.0 official got 0 performance
> boost.
> 
> So I wonder if there are some related patches between gcc-12.3.0 and
> gcc-13.2.0 which bring the difference.

I saw GCC-12.3.0 official with the patch generate asm for the test case like
below:

ldp w7, w6, [x0, 216]
ldp w5, w4, [x0, 252]
ldr w3, [x0, 288]
ldr w2, [x0, 292]
sub w7, w7, #1
sub w6, w6, #1
sub w5, w5, #1
sub w4, w4, #1
sub w3, w3, #1
stp w7, w6, [x0, 216]
stp w5, w4, [x0, 252]
add x0, x0, 324
sub w2, w2, #1
str w3, [x0, -36]
str w2, [x0, -32]

and gcc-13.2.0 offcial with the patch generate asm like:

mvni v3.2s, 0
ldr d2, [x0, 216]
ldr d1, [x0, 252]
ldr d0, [x0, 288]
add v2,2s, v2,2s, v3,2s
add x0,x0, 324
add v1,2s, v1,2s, v3.2s
add v0,2s, v0.2s, v3.2s
str d2, [x0, -108]
str d1, [x0, -72]
str d0, [x0, -36]

It seems code didn't get vectorized when using GCC-12.3.0

[Bug target/114978] [14/15 regression] 548.exchange2_r 14%-28% regressions on Loongarch64 after gcc 14 snapshot 20240317

2025-03-14 Thread tianyang.chou at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114978

--- Comment #30 from Tianyang Chou  ---
(In reply to Chen Chen from comment #27)
> I am a bit confused with your statement. For AOSC gcc 13.2 I got 8.52 with
> parameters "-g -Ofast -march=la464 -flto", and 8.76 with parameters "-g
> -Ofast -march=la464". These results are similar to yours.
> 
> For gcc 14 snapshot 20240317, currently I configure with the following
> parameters:
> 
> -enable-shared --enable-threads=posix --with-system-zlib
> --enable-gnu-indirect-function --enable-__cxa_atexit
> --disable-libunwind-exceptions --enable-clocale=gnu --disable-libstdcxx-pch
> --disable-libssp --enable-gnu-unique-object --enable-linker-build-id
> -enable-lto --enable-plugin --enable-install-libiberty --disable-multilib
> --disable-werror --enable-pie --enable-checking=release
> --enable-libstdcxx-dual-abi --with-default-libstdcxx-abi=new
> --enable-default-pie --enable-default-ssp --enable-bootstrap
> --enable-languages=c,c++,fortran,lto --with-abi=lp64d
> --with-arch=loongarch64 --with-tune=la464 --build=loongarch64-aosc-linux-gnu
> --program-suffix=-14.0.1
> 
> And still get a score of 12.8.
> 
> 
> (In reply to Tianyang Chou from comment #24)
> > (In reply to Chen Chen from comment #0)
> > > We tested Loongarch64 CPU Loongson 3A6000 with "LA664" architecture in 
> > > Linux
> > > operating system AOSC OS 11.4.0 (default gcc version is 13.2.0). And we
> > > found the 548.exchange2_r benchmark from SPEC 2017 INTrate suite suffered
> > > significant regressions from 14% to 28% with various compiling options.
> > > 
> > > The rate-1 results are following:
> > > 
> > > after snapshot 20240317 score 14.3-19.3% lower with parameters "-g -Ofast
> > > -march=native":
> > > 13.2.0:11.7 (223s) [gcc 13.2.0, system default]
> > > 20240317:  11.0 (237s) [gcc 14 snapshot 20240317]
> > > 20240324:  8.88 (295s) [gcc 14 snapshot 20240324]
> > > 20240430:  9.03 (290s) [gcc 14 snapshot 20240430, 14.1.0-RC]
> > > 14.1.0:9.43 (278s) [gcc 14.1.0 release]
> > > 
> > > after snapshot 20240317 score 16.5-20.8% lower with parameters "-g -Ofast
> > > -march=native -flto": 
> > > 13.2.0:12.0 (218s)
> > > 20240317:  10.6 (248s)
> > > 20240324:  8.40 (312s)
> > > 20240430:  8.48 (309s)
> > > 14.1.0:8.85 (296s)
> > > 
> > > 
> > > after snapshot 20240317 score 18-23.1% lower with parameters "-g -Ofast
> > > -march=la664":   
> > > 13.2.0:"-march=la664" flag is not supported
> > > 20240317:  11.5 (227s)
> > > 20240324:  8.84 (296s)
> > > 20240430:  9.43 (278s)
> > > 14.1.0:9.42 (278s)
> > > 
> > > 
> > > after snapshot 20240317 score 20.3-21.2% lower with parameters "-g -Ofast
> > > -march=la664 -flto": 
> > > 13.2.0:"-march=la664" flag is not supported
> > > 20240317:  11.1 (236s)
> > > 20240324:  8.75 (299s)
> > > 20240430:  8.85 (296s)
> > > 14.1.0:8.85 (296s)
> > > 
> > > 
> > > after snapshot 20240317 score 26.3-26.6% lower with parameters "-g -Ofast
> > > -march=la464":   
> > > 13.2.0:8.76 (299s)
> > > 20240317:  12.8 (205s)
> > > 20240324:  9.39 (279s)
> > > 20240430:  9.43 (278s)
> > > 14.1.0:9.43 (278s)
> > > 
> > > 
> > > after snapshot 20240317 score 26.6-28% lower with parameters "-g -Ofast
> > > -march=la464 -flto": 
> > > 13.2.0:8.52 (307s)
> > > 20240317:  12.8 (204s)
> > > 20240324:  9.22 (284s)
> > > 20240430:  9.37 (280s)
> > > 14.1.0:9.40 (279s)
> > > 
> > > 
> > > The gcc 14 snapshots and gcc 14.1.0 are compiled with the following
> > > parameters: 
> > > 
> > > --enable-shared --enable-threads=posix --with-system-zlib
> > > --enable-gnu-indirect-function --enable-__cxa_atexit
> > > --disable-libunwind-exceptions --enable-clocale=gnu 
> > > --disable-libstdcxx-pch
> > > --disable-libssp --enable-gnu-unique-object --enable-linker-build-id
> > > --enable-lto --enable-plugin --enable-install-libiberty --disable-multilib
> > > --disable-werror --enable-pie --enable-checking=release
> > > --enable-libstdcxx-dual-abi --with-default-libstdcxx-abi=new
> > > --enable-default-pie --enable-default-ssp --enable-bootstrap
> > > --enable-languages=c,c++,fortran,lto --with-abi=lp64d
> > > --with-arch=loongarch64 --with-tune=la664 
> > > --build=loongarch64-aosc-linux-gnu
> > > 
> > > 
> > > The regression may be found on other types of CPUs as well. We did a quick
> > > test on AMD Zen4 CPU R9 7940HS and found similar but smaller regression:
> > > 
> > > The rate-1 results on x86_64 (AMD R9 7940HS) with operating system Debian 
> > > 12:
> > > 
> > > after snapshot 20240317 score 8.6-9.6% lower with parameters "-m64 -g 
> > > -Ofast
> > > -march=znver3":
> > > 12.2.0:30.1 (87.0s) [gcc 12.2.0, system default]
> > > 13.2.0:30.6 (85.7s) [gcc 13.2 release]
> > > 20240317:  31.4 (83.3s) [gcc 14 snapshot]
> > > 20240324:  28.7 (91.2s) [gcc 14 snapshot]
> > > 20240430:  28.4 (92.2s) [gcc 14 snapshot, 14.1.0-RC]
> > > 
> > > after snapshot 20240317 score 10% lower with parameters "-m64 -g -Ofast
> > > -march=znver3 -flto":
> > 

[Bug tree-optimization/114932] IVopts inefficient handling of signed IV used for addressing.

2025-04-09 Thread tianyang.chou at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932

--- Comment #25 from Tianyang Chou  ---
(In reply to Tamar Christina from comment #0)

Hi Tamar,
After reading the whole discussion, I still confused about how does the
immediate offset mode generated, can you help me understanding the logic chain
of the optimization?
What am I understand is: before optimized, gcc generate an register offset
mode, your patch allows CHREC multiply to be folded in IVOPT pass, that means
the addressing calculation process get simplified, but what's the relation
between this simplification and generated immediate offset mode? How does this
CHREC multiply folding optimization causes the generation of immediate offset
ldr step by step?
Hope you can provide me the basic train of thought from your optimization
to the generation of immediate offset load/store instructions. Many thanks!

[Bug tree-optimization/114932] IVopts inefficient handling of signed IV used for addressing.

2025-04-09 Thread tianyang.chou at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932

--- Comment #26 from Tianyang Chou  ---
(In reply to Tamar Christina from comment #0)

Hi Tamar,
After reading the whole discussion, I still confused about how does the
immediate offset mode generated, can you help me understanding the logic chain
of the optimization?
What am I understand is: before optimized, gcc generate an register offset
mode, your patch allows CHREC multiply to be folded in IVOPT pass, that means
the addressing calculation process get simplified, but what's the relation
between this simplification and generated immediate offset mode? How does this
CHREC multiply folding optimization causes the generation of immediate offset
ldr step by step?
Hope you can provide me the basic train of thought from your optimization
to the generation of immediate offset load/store instructions. Many thanks!

[Bug tree-optimization/114932] IVopts inefficient handling of signed IV used for addressing.

2025-04-22 Thread tianyang.chou at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932

--- Comment #28 from Tianyang Chou  ---
   Very helpful, thanks for your time.
   Best regards,Tianyang.Chou
    Replied Message 

   From

   tnfchris at gcc dot gnu.org[1]

   Date

   4/20/2025 19:46

   To

   [2]

   Subject

   [Bug tree-optimization/114932] IVopts inefficient handling of signed
   IV used for addressing.

   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114932

   --- Comment #27 from Tamar Christina  ---
   (In reply to Tianyang Chou from comment #26)

 (In reply to Tamar Christina from comment #0)

 Hi Tamar,
 After reading the whole discussion, I still confused about how
 does the
 immediate offset mode generated, can you help me understanding the
 logic
 chain of the optimization?
 What am I understand is: before optimized, gcc generate an
 register
 offset mode, your patch allows CHREC multiply to be folded in
 IVOPT pass,
 that means the addressing calculation process get simplified, but
 what's the
 relation between this simplification and generated immediate
 offset mode?
 How does this CHREC multiply folding optimization causes the
 generation of
 immediate offset ldr step by step?
 Hope you can provide me the basic train of thought from your
 optimization to the generation of immediate offset load/store
 instructions.
 Many thanks!


   Hi Tianyang,

   Sorry I forgot to respond here.

   The basic gist of it is that with the original IV

   Base: (integer(kind=4) *) &block + ((sizetype) ((integer(kind=8))
   l0_19(D) * 81) + 9) * 4

   The more complicated expression makes it hard for IV opts to compare
   IVs.
   Lets say you have another IV

   Base: (integer(kind=4) *) &block + ((sizetype) ((integer(kind=8))
   l0_19(D) * 81) + 9) * 4
   Base: (integer(kind=4) *) &block + ((sizetype) ((integer(kind=8))
   l0_19(D) * 81) + 10) * 4

   This leaves IV opts with only the choice that the common base can be

   ptr = &block + ((sizetype) ((integer(kind=8)) l0_19(D) * 81)

   and so the two IVs become (ptr + 9) * 4 and (ptr + 10) * 4 so you'll
   need a
   more complicated addressing mode.
   by being able to fold the expressions to something simpler they
   become:

   Base: (integer(kind=4) *) &block + ((sizetype) ((unsigned long)
   l0_19(D)
   * 324) + 36)
   Base: (integer(kind=4) *) &block + ((sizetype) ((unsigned long)
   l0_19(D)
   * 324) + 40)

   By comparing the operations structurally IV opts realizes the base
   expression
   this time can be

   ptr = &block + ((sizetype) ((unsigned long) l0_19(D) * 324) + 36)

   and so the two IVs become ptr and ptr + 4, hence the immediate offset
   addressing.

   Basically the outer multiply prevents the IVs from being expressible
   as a
   simple offset from each other.

   --
   You are receiving this mail because:
   You are on the CC list for the bug.

   1. mailto:undefined
   2. mailto:tianyang.c...@gmail.com

[Bug target/114978] [14/15 regression] 548.exchange2_r 14%-28% regressions on Loongarch64 after gcc 14 snapshot 20240317

2025-03-06 Thread tianyang.chou at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114978

Tianyang Zhou  changed:

   What|Removed |Added

 CC||tianyang.chou at gmail dot com

--- Comment #24 from Tianyang Zhou  ---
(In reply to Chen Chen from comment #0)
> We tested Loongarch64 CPU Loongson 3A6000 with "LA664" architecture in Linux
> operating system AOSC OS 11.4.0 (default gcc version is 13.2.0). And we
> found the 548.exchange2_r benchmark from SPEC 2017 INTrate suite suffered
> significant regressions from 14% to 28% with various compiling options.
> 
> The rate-1 results are following:
> 
> after snapshot 20240317 score 14.3-19.3% lower with parameters "-g -Ofast
> -march=native":
> 13.2.0:11.7 (223s) [gcc 13.2.0, system default]
> 20240317:  11.0 (237s) [gcc 14 snapshot 20240317]
> 20240324:  8.88 (295s) [gcc 14 snapshot 20240324]
> 20240430:  9.03 (290s) [gcc 14 snapshot 20240430, 14.1.0-RC]
> 14.1.0:9.43 (278s) [gcc 14.1.0 release]
> 
> after snapshot 20240317 score 16.5-20.8% lower with parameters "-g -Ofast
> -march=native -flto": 
> 13.2.0:12.0 (218s)
> 20240317:  10.6 (248s)
> 20240324:  8.40 (312s)
> 20240430:  8.48 (309s)
> 14.1.0:8.85 (296s)
> 
> 
> after snapshot 20240317 score 18-23.1% lower with parameters "-g -Ofast
> -march=la664":   
> 13.2.0:"-march=la664" flag is not supported
> 20240317:  11.5 (227s)
> 20240324:  8.84 (296s)
> 20240430:  9.43 (278s)
> 14.1.0:9.42 (278s)
> 
> 
> after snapshot 20240317 score 20.3-21.2% lower with parameters "-g -Ofast
> -march=la664 -flto": 
> 13.2.0:"-march=la664" flag is not supported
> 20240317:  11.1 (236s)
> 20240324:  8.75 (299s)
> 20240430:  8.85 (296s)
> 14.1.0:8.85 (296s)
> 
> 
> after snapshot 20240317 score 26.3-26.6% lower with parameters "-g -Ofast
> -march=la464":   
> 13.2.0:8.76 (299s)
> 20240317:  12.8 (205s)
> 20240324:  9.39 (279s)
> 20240430:  9.43 (278s)
> 14.1.0:9.43 (278s)
> 
> 
> after snapshot 20240317 score 26.6-28% lower with parameters "-g -Ofast
> -march=la464 -flto": 
> 13.2.0:8.52 (307s)
> 20240317:  12.8 (204s)
> 20240324:  9.22 (284s)
> 20240430:  9.37 (280s)
> 14.1.0:9.40 (279s)
> 
> 
> The gcc 14 snapshots and gcc 14.1.0 are compiled with the following
> parameters: 
> 
> --enable-shared --enable-threads=posix --with-system-zlib
> --enable-gnu-indirect-function --enable-__cxa_atexit
> --disable-libunwind-exceptions --enable-clocale=gnu --disable-libstdcxx-pch
> --disable-libssp --enable-gnu-unique-object --enable-linker-build-id
> --enable-lto --enable-plugin --enable-install-libiberty --disable-multilib
> --disable-werror --enable-pie --enable-checking=release
> --enable-libstdcxx-dual-abi --with-default-libstdcxx-abi=new
> --enable-default-pie --enable-default-ssp --enable-bootstrap
> --enable-languages=c,c++,fortran,lto --with-abi=lp64d
> --with-arch=loongarch64 --with-tune=la664 --build=loongarch64-aosc-linux-gnu
> 
> 
> The regression may be found on other types of CPUs as well. We did a quick
> test on AMD Zen4 CPU R9 7940HS and found similar but smaller regression:
> 
> The rate-1 results on x86_64 (AMD R9 7940HS) with operating system Debian 12:
> 
> after snapshot 20240317 score 8.6-9.6% lower with parameters "-m64 -g -Ofast
> -march=znver3":
> 12.2.0:30.1 (87.0s) [gcc 12.2.0, system default]
> 13.2.0:30.6 (85.7s) [gcc 13.2 release]
> 20240317:  31.4 (83.3s) [gcc 14 snapshot]
> 20240324:  28.7 (91.2s) [gcc 14 snapshot]
> 20240430:  28.4 (92.2s) [gcc 14 snapshot, 14.1.0-RC]
> 
> after snapshot 20240317 score 10% lower with parameters "-m64 -g -Ofast
> -march=znver3 -flto":
> 12.2.0:29.0 (90.3s) 
> 13.2.0:30.9 (84.9s) 
> 20240317:  32.0 (81.8s) 
> 20240324:  28.8 (90.9s) 
> 20240430:  28.8 (91.1s)
> 
> gcc13 and gcc14 are compiled with the following parameters:
> 
> --enable-shared --enable-threads=posix --with-system-zlib
> --enable-gnu-indirect-function --enable-__cxa_atexit
> --disable-libunwind-exceptions --enable-clocale=gnu --disable-libstdcxx-pch
> --disable-libssp --enable-gnu-unique-object --enable-linker-build-id
> --enable-lto --enable-plugin --enable-install-libiberty --disable-multilib
> --disable-werror --enable-pie --enable-checking=release
> --enable-libstdcxx-dual-abi --with-default-libstdcxx-abi=new
> --enable-default-pie --enable-default-ssp --enable-bootstrap
> --enable-languages=c,c++,fortran,lto  --build=x86_64-linux-gnu
> --host=x86_64-linux-gnu --target=x86_64-linux-gnu

Sorry to talk about something unrelated to this bug. I tried running 548 on CPU
loongson 3A6000 with the same compiler version and compiler options as you but
the score is only 8.5,  so could you please tell me what am I missing? I just
can't reproduce your performance result.

The gcc compiler source code is downloaded from the github repo
AOSC-Tracking/gcc(13.2.0), configure it with parameters:

"--enable-shared --enable-threads=posi

[Bug target/114978] [14/15 regression] 548.exchange2_r 14%-28% regressions on Loongarch64 after gcc 14 snapshot 20240317

2025-03-06 Thread tianyang.chou at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114978

--- Comment #25 from Tianyang Zhou  ---
Btw, I use the latest AOSC and the system default GCC is 14.2.0 20240801