[Bug target/95525] New: Bitmask conflict between PTA_AVX512VP2INTERSECT and PTA_WAITPKG
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95525 Bug ID: 95525 Summary: Bitmask conflict between PTA_AVX512VP2INTERSECT and PTA_WAITPKG Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: lili.cui at intel dot com Target Milestone: --- In gcc trunk, bitmask conflict between PTA_AVX512VP2INTERSECT and PTA_WAITPKG in gcc/config/i386/i386.h const wide_int_bitmask PTA_AVX512VP2INTERSECT (0, HOST_WIDE_INT_1U << 9); const wide_int_bitmask PTA_WAITPKG (0, HOST_WIDE_INT_1U << 9);
[Bug target/95525] Bitmask conflict between PTA_AVX512VP2INTERSECT and PTA_WAITPKG
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95525 cuilili changed: What|Removed |Added Resolution|--- |FIXED Status|NEW |RESOLVED --- Comment #4 from cuilili --- Fixed for GCC 11, GCC 10.
[Bug target/95621] New: Add CET(PTA_SHSTK) to march=tigerlake
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95621 Bug ID: 95621 Summary: Add CET(PTA_SHSTK) to march=tigerlake Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: lili.cui at intel dot com Target Milestone: --- For intel TigerLake need support CET, add PTA_SHSTK to march=tigerlake.
[Bug target/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908 cuilili changed: What|Removed |Added CC||lili.cui at intel dot com --- Comment #23 from cuilili --- (In reply to Richard Biener from comment #17) > I do wonder though how CLX is fine with such access pattern ;) (did you test > with just -O2?) Actually CLX also has STLF issues, there is 13.7% regression when comparing "gcc trunk + -O2" w/ and w/t "-fno-tree-vectorize"
[Bug target/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908 --- Comment #24 from cuilili --- (In reply to cuilili from comment #23) > (In reply to Richard Biener from comment #17) > > I do wonder though how CLX is fine with such access pattern ;) (did you > > test > > with just -O2?) > Sorry, correct w/ and w/t order. Actually CLX also has STLF issues, there is 13.7% regression when comparing "gcc trunk + -O2" w/t and w/ "-fno-tree-vectorize"
[Bug target/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908 --- Comment #28 from cuilili --- (In reply to H.J. Lu from comment #25) > Can this be mitigated by removing redundant load and store? Yes, inlining say_sphere can remove redundant loads and stores, O3 does inlining, but O2 is more sensitive to code size and cannot be inlined.
[Bug target/104723] [12 regression] Redundant usage of stack
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104723 --- Comment #3 from cuilili --- (In reply to Hongtao.liu from comment #1) > STF issue here? Yes, Since "YMMWORD PTR [rsp-72]" across the cache line, it has STLF issue here. vmovdqu64 YMMWORD PTR [rsp-72], ymm31 --> store 32 bytes from [rsp-72], across cache line vmovdqu64 YMMWORD PTR [rsp-55], ymm31 --> over write part of YMMWORD PTR [rsp-72] vmovdqu64 ymm31, YMMWORD PTR [rsp-72] --> STLF with first instruction and has penalty.
[Bug target/104723] [12 regression] Redundant usage of stack
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104723 --- Comment #9 from cuilili --- (In reply to cuilili from comment #3) > (In reply to Hongtao.liu from comment #1) > > STF issue here? > correct comment #3 I used perf to collect the "ld_blocks.store_forward" event for those two test cases, stlf_64_55_64.S has STLF issue due to the two stores overlapping, not related to crossing cache line. In this case it has STLF issue. $cat stlf_64_55_64.S ... .LFB0: .cfi_startproc vmovdqu %ymm0, -64(%rsp) vmovdqu %ymm1, -55(%rsp) vmovdqu -64(%rsp), %ymm0 ret .cfi_endproc ... $ perf stat -e ld_blocks.store_forward ./stlf_64_55_64.out runtime= : 128883744 Performance counter stats for './stlf_64_55_64.out': 10,000,507 ld_blocks.store_forward:u In this case it can do STLF. $ cat stlf_64_128_64.S ... .LFB0: .cfi_startproc vmovdqu %ymm0, -64(%rsp) vmovdqu %ymm1, -128(%rsp) vmovdqu -64(%rsp), %ymm0 ret .cfi_endproc ... $ perf stat -e ld_blocks.store_forward ./stlf_64_128_64.out runtime= : 56477424 Performance counter stats for './stlf_64_128_64.out': 2 ld_blocks.store_forward:u 0.022103902 seconds time elapsed -
[Bug target/104271] [12 Regression] 538.imagick_r run-time at -Ofast -march=native regressed by 26% on Intel Cascade Lake server CPU
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104271 --- Comment #6 from cuilili --- I created a patch to fix this regression. The patch is under performance testing. Will sent it out later.
[Bug target/104271] [12 Regression] 538.imagick_r run-time at -Ofast -march=native regressed by 26% on Intel Cascade Lake server CPU
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104271 --- Comment #7 from cuilili --- Created attachment 52706 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52706&action=edit Add a heuristic for eliminate redundant load and store in inline pass. Hi Richard, Could you help take a look? This is my first time adding code in mid-end, hope you can give me some advice, thank you! I add a INLINE_HINT_eliminate_load_and_store hint in to inline pass. when callee's memory access is caller's local memory parameter and access size is greater than the target threshold, we will enable the hint. with the hint, inlining_insns_auto will enlarge the bound. The target hook is only enabled for x86 now. With the patch applied Icelake server: 538.imagic_r get 15.18% improvement for multicopy and 40.78% improvement for single copy with no measurable changes for other benchmarks. Casecadelake: 538.imagic_r get 12.4% improvement for multicopy with and code size increased by 0.4%. With no measurable changes for other benchmarks. Znver3 server: 538.imagic_r get 9.6% improvement for multicopy with and code size increased by 0.5%. With no measurable changes for other benchmarks.
[Bug target/104271] [12 Regression] 538.imagick_r run-time at -Ofast -march=native regressed by 26% on Intel Cascade Lake server CPU
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104271 --- Comment #9 from cuilili --- Really appreciate for your reply, I debugged SRA pass with the small testcase and found that SRA dose not handle this situation. SRA cannot split callee's first parameter for "Do not decompose non-BLKmode parameters in a way that would create a BLKmode parameter. Especially for pass-by-reference (hence, pointer type parameters), it's not worth it." Before inline: For caller store-1 : 128 bits store of struct "a" (it is an implicit store during IPA pass, the store can only be found after a certain pass.) For callee load-1 :128 bits load of struct "a" for operation "c->a=(*a)" store-2:128 bits store of struct "c->a" for operation "c->a=(*a)" load-2 :4 * 32 bits load for c->a.f1, c->a.f2, c->a.f3 and c->a.f4. (because the store-2 using vector register to store, we cannot use the register directly here.) After inline: For caller None. For callee store-2 : 128 bits store of struct c->a for operation "c->a=(*a)" int callee (struct A *a, struct C *c) { c->a=(*a); if ((c->b + 7) & 17) { c->a.f1 = c->a.f2 + c->a.f3; c->a.f2 = c->a.f2 - c->a.f3; c->a.f3 = c->a.f2 + c->a.f3; c->a.f4 = c->a.f2 - c->a.f3; c->b = c->a.f2 + c->a.f4; return 0; } return 1; } int caller (int d, struct C *c) { struct A a; a.f1 = 1 + d; a.f2 = 2; a.f3 = 12 + d; a.f4 = 68 + d; if (d > 0) return callee (&a, c); else return 1; } - In 538.imagic_r(c_ray also has the similar code), if we inline the hot function, the redundant store and load structure's size is 256 bits (4 elements of size 64 bits), which can eliminates one 256-bit store, one 256-bit load, and four 64-bit loads. can I do it like this? Computes the total size of all callee arguments that can eliminate redundant loads and stores. Thanks!
[Bug target/104723] [12 regression] Redundant usage of stack
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104723 --- Comment #11 from cuilili --- (In reply to Jakub Jelinek from comment #10) > And for the backend, the question is how big the penalty for the overlapping > store is compared to doing multiple non-overlapping stores. Say for those > 49 bytes one could do one OI, one TI/V1TI and one QI load/store as opposed to > one aligned and one misaligned OI load/store. > > For say: > void > foo (void *p, void *q) > { > __builtin_memcpy (p, q, 49); > } > we emit the 2 overlapping loads/stores for -mavx512f and 4 non-overlapping > loads/stores with say -mavx2. I execute both code sequence 10 times on ICX and znver3 machines. For ICX: 2 overlapping loads/stores are 3.5x faster than 4 non-overlapping loads/stores. For Znver3: 2 overlapping loads/stores are 1.39x faster than 4 non-overlapping loads/stores. vmovdqu ymm0, YMMWORD PTR [rsi] vmovdqu YMMWORD PTR [rdi], ymm0 vmovdqu ymm1, YMMWORD PTR [rsi+17] vmovdqu YMMWORD PTR [rdi+17], ymm1 vmovdqu xmm0, XMMWORD PTR [rsi] vmovdqu XMMWORD PTR [rdi], xmm0 vmovdqu xmm1, XMMWORD PTR [rsi+16] vmovdqu XMMWORD PTR [rdi+16], xmm1 vmovdqu xmm2, XMMWORD PTR [rsi+32] vmovdqu XMMWORD PTR [rdi+32], xmm2 movzx eax, BYTE PTR [rsi+48] mov BYTE PTR [rdi+48], al ---
[Bug target/105493] [12/13 Regression] x86_64 538.imagick_r 6% regressions and 2% 525.x264_r regressions on Alder Lake after r12-7319-g90d693bdc9d718
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105493 cuilili changed: What|Removed |Added Resolution|--- |FIXED Status|NEW |RESOLVED --- Comment #9 from cuilili --- This regression was fixed by commit r13-1021-g269edf4e5e6ab489730038f7e3495550623179fe, now close this ticket.
[Bug middle-end/26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163 Bug 26163 depends on bug 105493, which changed state. Bug 105493 Summary: [12/13 Regression] x86_64 538.imagick_r 6% regressions and 2% 525.x264_r regressions on Alder Lake after r12-7319-g90d693bdc9d718 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105493 What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED
[Bug target/104271] [12/13 Regression] 538.imagick_r run-time at -Ofast -march=native regressed by 26% on Intel Cascade Lake server CPU
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104271 --- Comment #12 from cuilili --- This regression caused by the store forwarding issue, we eliminate the redundant two pairs of loads and stores which have store forwarding issue by inlining. This regression has been fixed by https://gcc.gnu.org/g:1b9a5cc9ec08e9f239dd2096edcc447b7a72f64a
[Bug target/105493] New: [12/13 Regression] x86_64 538.imagick_r 6% regressions and 2% 525.x264_r regressions on Alder Lake after r12-7319-g90d693bdc9d718
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105493 Bug ID: 105493 Summary: [12/13 Regression] x86_64 538.imagick_r 6% regressions and 2% 525.x264_r regressions on Alder Lake after r12-7319-g90d693bdc9d718 Product: gcc Version: 13.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: lili.cui at intel dot com Target Milestone: --- Similar issue with https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104762 they are all caused by the same commit 90d693bdc9d71841f51d68826ffa5bd685d7f0bc options: -march=native -Ofast -lto Alder Lake single copy: after Vs. before this commit 525.x264_r -9.09% 538.imagick_r-25.00% Alder Lake multicopy: after Vs. before this commit 525.x264_r -2.00% 538.imagick_r-6.7%
[Bug target/105493] [12/13 Regression] x86_64 538.imagick_r 6% regressions and 2% 525.x264_r regressions on Alder Lake after r12-7319-g90d693bdc9d718
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105493 --- Comment #2 from cuilili --- (In reply to Richard Biener from comment #1) > Martin is currently re-benchmarking GCC 12 on AMD, so let's see if there's > anything left on those. AMD may not have this issue, Richard fixed AMD regression with this commit r12-7612-g69619acd8d9b5856f5af6e5323d9c7c4ec9ad08f, but intel wasn't fixed because they use different costs.
[Bug middle-end/110148] [14 Regression] TSVC s242 regression between g:c0df96b3cda5738afbba3a65bb054183c5cd5530 and g:e4c986fde56a6248f8fbe6cf0704e1da34b055d8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110148 --- Comment #7 from cuilili --- (In reply to Martin Jambor from comment #6) > I believe this has been fixed? Yes.
[Bug tree-optimization/110038] [14 Regression] ICE: in rewrite_expr_tree_parallel, at tree-ssa-reassoc.cc:5522 with --param=tree-reassoc-width=2147483647
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110038 --- Comment #2 from cuilili --- (In reply to Richard Biener from comment #1) > Probably best to limit the values to reassoc-width by adding the > appropriate IntegerRange attribute in params.opt > > IntegerRange(0, 256) > > maybe? "rewrite_expr_tree_parallel" got a wrong width from "get_reassociation_width" The number of ops is 4, width is 2147483647. get_reassociation_width: ... width_min = 1; while (width > width_min) { int width_mid = (width + width_min) / 2; --> (width + 1) out of bounds ... So Richard suggested that limiting tree-reassoc-width to IntegerRange(0, 256) would solve the ICE, I also added a width constraint in rewrite_expr_tree_parallel, here is the patch. https://gcc.gnu.org/pipermail/gcc-patches/2023-May/620154.html 1. Limit the value of tree-reassoc-width to IntegerRange(0, 256). 2. Add width limit in rewrite_expr_tree_parallel.
[Bug tree-optimization/110038] [14 Regression] ICE: in rewrite_expr_tree_parallel, at tree-ssa-reassoc.cc:5522 with --param=tree-reassoc-width=2147483647
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110038 --- Comment #5 from cuilili --- (In reply to Martin Jambor from comment #4) > So is this now fixed? Yes, the attachment case has been fixed.
[Bug target/104271] [12 Regression] 538.imagick_r run-time at -Ofast -march=native regressed by 26% on Intel Cascade Lake server CPU
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104271 --- Comment #14 from cuilili --- This regression has been fixed with the commit below and we can close this ticket. https://gcc.gnu.org/g:1b9a5cc9ec08e9f239dd2096edcc447b7a72f64a
[Bug middle-end/110148] [14 Regression] TSVC s242 regression between g:c0df96b3cda5738afbba3a65bb054183c5cd5530 and g:e4c986fde56a6248f8fbe6cf0704e1da34b055d8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110148 cuilili changed: What|Removed |Added CC||lili.cui at intel dot com --- Comment #2 from cuilili --- The commit changed the break dependency chain function, in order to generate more FMA. S242 has a chain that needs to be broken. The chain is in a small loop and related with the loop reduction variable a[i-1]. Src code: for (int i = 1; i < LEN_1D; ++i) { a[i] = a[i - 1] + s1 + s2 + b[i] + c[i] + d[i]; } -- Base version: SSA tree ssa1 = (s1+s2) + b[i]; ssa2 = c[i] + d[i]; ssa3 = ssa1+ssa2; ssa4 = ssa3 + a[i-1] a[i-1] uses xmm1, there are 2 instructions using xmm0 have dependencies across iterations Assembler Loop1: vmovsd 0x60c400(%rax),%xmm0 vaddsd 0x60b000(%rax),%xmm3,%xmm2 add$0x8,%rax vaddsd 0x60b9f8(%rax),%xmm0,%xmm0 vaddsd %xmm2,%xmm0,%xmm0 vaddsd %xmm0,%xmm1,%xmm1 ---> 1 vmovsd %xmm1,0x60cdf8(%rax) ---> 2 cmp$0xa00,%rdx jneLoop1 -- Base + commit g:e5405f065bace0685cb3b8878d1dfc7a6e7ef409 version: a[i-1] uses xmm0, there are 4 instructions using xmm0 have dependencies across iterations SSA tree ssa1 = (s1+s2) + b[i]; ssa2 = c[i] + d[i]; ssa3 = ssa1 + a[i-1] ssa3 = ssa2 + ssa3; Assembler Loop1: vaddsdq 0x60b000(%rax), %xmm0, %xmm0 ---> 1 vmovsdq 0x60c400(%rax), %xmm1 add $0x8, %rax vaddsdq 0x60b9f8(%rax), %xmm1, %xmm1 vaddsd %xmm2, %xmm0, %xmm0 ---> 2 vaddsd %xmm1, %xmm0, %xmm0 ---> 3 vmovsdq %xmm0, 0x60cdf8(%rax) ---> 4 cmp$0xa00,%rdx jneLoop1
[Bug middle-end/110148] [14 Regression] TSVC s242 regression between g:c0df96b3cda5738afbba3a65bb054183c5cd5530 and g:e4c986fde56a6248f8fbe6cf0704e1da34b055d8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110148 --- Comment #3 from cuilili --- I reproduced S1244 regression on znver3. Src code: for (int i = 0; i < LEN_1D-1; i++) { a[i] = b[i] + c[i] * c[i] + b[i] * b[i] + c[i]; d[i] = a[i] + a[i+1]; } Base version: Base + commit version: Assembler Assembler Loop1:Loop1: vmovsd 0x60c400(%rax),%xmm2 vmovsd 0x60ba00(%rax),%xmm2 vmovsd 0x60ba00(%rax),%xmm1 vmovsd 0x60c400(%rax),%xmm1 add$0x8,%rax add$0x8,%rax vaddsd %xmm1,%xmm2,%xmm0 vmovsd %xmm2,%xmm2,%xmm0 vmulsd %xmm2,%xmm2,%xmm2 vfmadd132sd %xmm2,%xmm1,%xmm0 vfmadd132sd %xmm1,%xmm2,%xmm1 vfmadd132sd %xmm1,%xmm2,%xmm1 vaddsd %xmm1,%xmm0,%xmm0 vaddsd %xmm1,%xmm0,%xmm0 vmovsd %xmm0,0x60cdf8(%rax) vmovsd %xmm0,0x60cdf8(%rax) vaddsd 0x60ce00(%rax),%xmm0,%xmm0 vaddsd 0x60ce00(%rax),%xmm0,%xmm0 vmovsd %xmm0,0x60aff8(%rax) vmovsd %xmm0,0x60aff8(%rax) cmp$0x9f8,%raxcmp$0x9f8,%rax jneLoop1: jneLoop1 For the Base version, mult and FMA have dependencies, which increases the latency of the critical dependency chain. I didn't find out why znver3 has regression. Same binary running on ICX has 11% gain (with #define iterations 1).
[Bug target/117192] [15 Regression] wrong code at -O3 with "-fno-unswitch-loops" on x86_64-linux-gnu since r15-4397-g70f59d2a1c51bd
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117192 --- Comment #14 from cuilili --- (In reply to Uroš Bizjak from comment #12) > Created attachment 59373 [details] > Proposed patch > > Patch in testing. Sorry, I made a mistake here, thanks!
[Bug middle-end/117838] New: IRA issues: The higher cost variable a is spilled for the lower cost variable conflict_a in improve_allocatuion()
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117838 Bug ID: 117838 Summary: IRA issues: The higher cost variable a is spilled for the lower cost variable conflict_a in improve_allocatuion() Product: gcc Version: 15.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: lili.cui at intel dot com Target Milestone: --- Created attachment 59740 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=59740&action=edit testcase-dump-patch testcase : sqlite3-ok.i (in attachment, It is extracted from SQLite) gcc version: Fri Oct 18 2024 (aaa855fac0c7003d823b48fe4cc4b9ded9331a2b) compile options : -std=c18 -m64 -S -o sqlite3.s -Ofast -march=x86-64-v3 sqlite3-ok.i -da Function: sqlite3VdbeExec variable a : pOp -> a5(r806, l0) cost=30495 variable conflict_a : nDepth -> a2737(r1214,l0) cost 1102 Pushing a5(r806,l0)(potential spill: pri=26, cost=30495) Pushing a2737(r1214,l0: a2656(r1214,l2: a1251(r1214,l17)))(cost 1102) Popping a5(r806,l0) -- assign reg 40 Popping a2737(r1214,l0: a2656(r1214,l2: a1251(r1214,l17))) -- assign reg 38 Spilling a5r806 for a2737r1214 Assigning 40 to a2737r1214 a5(r806,l0) -- assign memory It was introduced by the following commit: commit 037cc0b4a6646cc86549247a3590215ebd5c4c43 Author: Richard Sandiford Date: Mon Jan 10 14:47:09 2022 + ira: Handle "soft" conflicts between cap and non-cap allocnos I created a patch (in attachment) to drop the issue code, there is no regression on graviton 3 and SPR with speccpu2017.