[Bug testsuite/94036] [9 regression] gcc.target/powerpc/pr72804.c fails
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94036 luoxhu at gcc dot gnu.org changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED Last reconfirmed||2020-03-06 Assignee|unassigned at gcc dot gnu.org |luoxhu at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #1 from luoxhu at gcc dot gnu.org --- patch posted: https://gcc.gnu.org/ml/gcc-patches/2020-03/msg00284.html
[Bug testsuite/94036] [9 regression] gcc.target/powerpc/pr72804.c fails
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94036 luoxhu at gcc dot gnu.org changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #2 from luoxhu at gcc dot gnu.org --- Committed in r9-8357(85c08558c66dd8e2000a4ad282ca03368028fce3).
[Bug target/91518] [9/10 Regression] segfault when run CPU2006 465.tonto since r263875
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91518 luoxhu at gcc dot gnu.org changed: What|Removed |Added CC||luoxhu at gcc dot gnu.org --- Comment #8 from luoxhu at gcc dot gnu.org --- patch sent to: https://gcc.gnu.org/pipermail/gcc-patches/2020-March/542693.html
[Bug target/61837] missed loop invariant expression optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61837 luoxhu at gcc dot gnu.org changed: What|Removed |Added CC||luoxhu at gcc dot gnu.org --- Comment #3 from luoxhu at gcc dot gnu.org --- "addi 8,4,-1" and "subf 9,8,5" could not be hoisted out as there are dependency to "lbzu 9,1(8)". r8 need be initialized to p2-1 in each iteration of outer loop. Only the result of subf 9,8,5 is loop invariant (p2+s-1)-(p2-1). But the latest GCC code could be optimized as A, B, C is loop invariant. foo: .LFB0: .cfi_startproc cmpwi 7,5,0 li 6,0 rldicl 5,5,0,32 li 7,0 .p2align 4,,15 .L2: ble 7,.L7 addi 8,5,-1 // A addi 10,4,-1 rldicl 8,8,0,32 // B mr 9,3 addi 8,8,1// C mtctr 8 .p2align 5 .L4: lbzu 8,1(10) cmpw 0,8,7 bne 0,.L3 stw 6,0(9) .L3: addi 9,9,4 bdnz .L4 .L7: addi 6,6,88 addi 7,7,1 cmpwi 0,6, extsw 7,7 extsw 6,6 bne 0,.L2 blr
[Bug target/61837] missed loop invariant expression optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61837 --- Comment #5 from luoxhu at gcc dot gnu.org --- "-O2 -funswitch-loops" could generate expected code for s<=0, unswitch-loops is enabled by -O3, so this issue is reduced to duplicate of PR67288? foo: .LFB0: .cfi_startproc cmpwi 0,5,0 blelr 0 rldicl 5,5,0,32 addi 4,4,-1 li 6,0 li 7,0 .p2align 4,,15 .L2: rldicl 8,5,0,32 mr 10,4 mtctr 8 mr 9,3 .p2align 5 .L5: lbzu 8,1(10) cmpw 0,8,7 bne 0,.L4 stw 6,0(9) .L4: addi 9,9,4 bdnz .L5 addi 6,6,88 addi 7,7,1 cmpwi 0,6, extsw 7,7 extsw 6,6 bne 0,.L2 blr .long 0 .byte 0,0,0,0,0,0,0,0 .cfi_endproc .LFE0:
[Bug target/61837] missed loop invariant expression optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61837 --- Comment #7 from luoxhu at gcc dot gnu.org --- (In reply to Segher Boessenkool from comment #6) > But -funswitch-loops is much stronger than we want here, and the wrong > thing to use at -O2 (it often generates *slower* code!) Not sure your meaning here, -funswitch-loops is to generate "blelr 0" as you pointed out in (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61837#c4), not to optimize "-1, zero_ext, +1", which is to move loop invariant out, and if "-1, zero_ext, +1" could be simplified to "zero_ext" for non zero, this is actually a special case of PR67288.
[Bug target/61837] missed loop invariant expression optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61837 --- Comment #9 from luoxhu at gcc dot gnu.org --- (In reply to Segher Boessenkool from comment #8) > -funswitch-loops changes things like > > for (...) { > if (...) > ...1; > else > ...2; > } > > into > > if (...) { > for (...) > ...1; > } else { > for (...) > ...2; > } > > which often is not a good idea. This is why this is not done at -O2: > -O2 is only for optimisations that almost never hurt performance. Yes, for this case it performs better with unswitch-loops, and I see many usage of -O2 with unswith-loops in testsuite. I thought you were meaning do this at O2 without -funswitch-loops...
[Bug tree-optimization/83403] Missed register promotion opportunities in loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83403 luoxhu at gcc dot gnu.org changed: What|Removed |Added CC||luoxhu at gcc dot gnu.org --- Comment #7 from luoxhu at gcc dot gnu.org --- int could pass but unsigned will fail to capture the refs independent, drafted a patch to use the range info when checking the CONVERT expression on PLUS/MINUS/MULT for wrapping overflow(unsigned). https://gcc.gnu.org/pipermail/gcc-patches/2020-April/544684.html (gdb) p debug_aff(&off1) { type = sizetype offset = 8 elements = { [0] = (long unsigned int) n_93 * 80, [1] = &C * 1 } } $571 = void (gdb) p debug_aff(&off2) { type = sizetype offset = 0 elements = { [0] = (long unsigned int) n_93 * 80, [1] = &C * 1 } } Is this a reasonable solution, please?
[Bug middle-end/26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163 Bug 26163 depends on bug 91518, which changed state. Bug 91518 Summary: [9 Regression] segfault when run CPU2006 465.tonto since r263875 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91518 What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED
[Bug target/91518] [9 Regression] segfault when run CPU2006 465.tonto since r263875
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91518 luoxhu at gcc dot gnu.org changed: What|Removed |Added Resolution|--- |FIXED Status|NEW |RESOLVED --- Comment #12 from luoxhu at gcc dot gnu.org --- Also fixed on gcc-9.
[Bug tree-optimization/83403] Missed register promotion opportunities in loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83403 luoxhu at gcc dot gnu.org changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #9 from luoxhu at gcc dot gnu.org --- Fixed on master.
[Bug tree-optimization/53947] [meta-bug] vectorizer missed-optimizations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 Bug 53947 depends on bug 83403, which changed state. Bug 83403 Summary: Missed register promotion opportunities in loop https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83403 What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED
[Bug tree-optimization/88842] missing optimization CSE, reassociation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88842 luoxhu at gcc dot gnu.org changed: What|Removed |Added CC||luoxhu at gcc dot gnu.org --- Comment #3 from luoxhu at gcc dot gnu.org --- One more case of missed optimization CSE, reassoc: void foo(unsigned int a, unsigned int b, unsigned int c, unsigned int d, int *res1, int*res2, int *res3) { *res1 = a + b + c + d; *res2 = b + c; *res3 = a + d; } cat foo.s .file "foo.c" .machine power8 .abiversion 2 .section".text" .align 2 .p2align 4,,15 .globl foo .type foo, @function foo: .LFB0: .cfi_startproc add 10,5,6 add 10,10,4 add 4,4,5 add 10,10,3 add 3,3,6 stw 10,0(7) stw 4,0(8) stw 3,0(9) blr
[Bug rtl-optimization/37451] Extra addition for doloop in some cases
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=37451 --- Comment #11 from luoxhu at gcc dot gnu.org --- fixed on master.
[Bug rtl-optimization/37451] Extra addition for doloop in some cases
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=37451 luoxhu at gcc dot gnu.org changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #12 from luoxhu at gcc dot gnu.org --- Close this.
[Bug target/70053] Returning a struct of _Decimal128 values generates extraneous stores and loads
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70053 luoxhu at gcc dot gnu.org changed: What|Removed |Added CC||bergner at gcc dot gnu.org, ||luoxhu at gcc dot gnu.org --- Comment #4 from luoxhu at gcc dot gnu.org --- Is this just the difference of O3 and O2? Since O3 is OK, maybe this bug is not effective? $ /opt/at10.0/bin/gcc -O3 -S pr70053.c $ cat pr70053.s .file "pr70053.c" .abiversion 2 .section".text" .align 2 .p2align 4,,15 .globl D256_add_finite .type D256_add_finite, @function D256_add_finite: dcmpuq 7,4,6 beq 7,.L3 fmr 7,3 fmr 6,2 fmr 3,7 fmr 2,6 blr .p2align 4,,15 .L3: fmr 5,7 fmr 4,6 fmr 3,7 fmr 2,6 blr .long 0 .byte 0,0,0,0,0,0,0,0 .size D256_add_finite,.-D256_add_finite .ident "GCC: (GNU) 6.4.1 20170720 (Advance-Toolchain-at10.0) IBM AT 10 branch, based on subversion id 250395." .section.note.GNU-stack,"",@progbits
[Bug target/30271] -mstrict-align can an store extra for struct agrument passing
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=30271 luoxhu at gcc dot gnu.org changed: What|Removed |Added CC||luoxhu at gcc dot gnu.org --- Comment #12 from luoxhu at gcc dot gnu.org --- Fixed at least from GCC 4.9.4? $ /opt/at8.0/bin/gcc -O3 -c -S pr30271.c -mstrict-align $ cat pr30271.s .file "pr30271.c" .abiversion 2 .section".toc","aw" .section".text" .align 2 .p2align 4,,15 .globl f .type f, @function f: extsh 9,3 srawi 3,3,16 add 3,9,3 extsw 3,3 blr .long 0 .byte 0,0,0,0,0,0,0,0 .size f,.-f .ident "GCC: (GNU) 4.9.4 20150824 (Advance-Toolchain-at8.0) [ibm/gcc-4_9-branch, revision: 227153 merged from gcc-4_9-branch, revision 227151]" .section.note.GNU-stack,"",@progbits
[Bug target/69493] Poor code generation for return of struct containing vectors on PPC64LE
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69493 luoxhu at gcc dot gnu.org changed: What|Removed |Added CC||luoxhu at gcc dot gnu.org --- Comment #9 from luoxhu at gcc dot gnu.org --- No load/store on Power9. cat pr69493.s .file "pr69493.c" .abiversion 2 .section".text" .align 2 .p2align 4,,15 .globl test_big_double .type test_big_double, @function test_big_double: .LFB0: .cfi_startproc mfvsrd 7,1 mfvsrd 10,2 mfvsrd 8,3 mfvsrd 9,4 mtvsrdd 34,10,7 mtvsrdd 35,9,8 blr .long 0 .byte 0,0,0,0,0,0,0,0 .cfi_endproc .LFE0: .size test_big_double,.-test_big_double .ident "GCC: (GNU) 9.2.1 20191023 (Advance-Toolchain 13.0-1) [aba1f4e8b6ac]" .gnu_attribute 4, 5 .section.note.GNU-stack,"",@progbits
[Bug target/70053] Returning a struct of _Decimal128 values generates extraneous stores and loads
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70053 --- Comment #6 from luoxhu at gcc dot gnu.org --- "-O2 -ftree-slp-vectorize" could also generate the expected simple fmrs. Reason is pass_cselim will transform conditional stores into unconditional ones with PHI instructions when vectorization and if-conversion is enabled(gcc/tree-ssa-phiopt.c:2482). pr70053.c.108t.cdce: D256_add_finite (_Decimal128 a, _Decimal128 b, _Decimal128 c) { struct TDx2_t D.2914; [local count: 1073741824]: if (b_4(D) == c_5(D)) goto ; [34.00%] else goto ; [66.00%] [local count: 365072224]: D.2914.td0 = c_5(D); D.2914.td1 = c_5(D); goto ; [100.00%] [local count: 708669601]: D.2914.td0 = a_3(D); D.2914.td1 = b_4(D); [local count: 1073741824]: return D.2914; } => pr70053.c.109t.cselim: D256_add_finite (_Decimal128 a, _Decimal128 b, _Decimal128 c) { struct TDx2_t D.2914; _Decimal128 cstore_10; _Decimal128 cstore_11; [local count: 1073741824]: if (b_4(D) == c_5(D)) goto ; [34.00%] else goto ; [66.00%] [local count: 708669601]: [local count: 1073741824]: # cstore_10 = PHI # cstore_11 = PHI D.2914.td1 = cstore_11; D.2914.td0 = cstore_10; return D.2914; } Then at expand pass, the PHI instruction "cstore_10 = PHI " will be expanded to move for "-O2 -ftree-slp-vectorize". If no such PHI generated, bb3 and bb4 in pr70053.c.108t.cdce will be expanded to STORE/LOAD with TD->DI conversion, causing a lot st/ld conversion finally.
[Bug target/70053] Returning a struct of _Decimal128 values generates extraneous stores and loads
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70053 luoxhu at gcc dot gnu.org changed: What|Removed |Added CC||segher at gcc dot gnu.org --- Comment #7 from luoxhu at gcc dot gnu.org --- When expanding "D.2914.td0 = c_5(D);" in expand_assignment (to=, from=, nontemporal=false) at ../../gcc-master/gcc/expr.c:5058 1) expr.c:5158: to_rtx = expand_expr (tem, NULL_RTX, VOIDmode, EXPAND_WRITE); gdb pr to_rtx (mem/c:BLK (reg/f:DI 112 virtual-stack-vars) [2 D.2914+0 S32 A128]) ... 2) expr.c:5167: to_rtx = adjust_address (to_rtx, mode1, 0); p mode1 $86 = E_TDmode (gdb) pr to_rtx (mem/c:TD (reg/f:DI 112 virtual-stack-vars) [2 D.2914+0 S16 A128]) to_rtx is generated with address conversion from DImode to TDmode here. ... 3) expr.c:5374: result = store_field (to_rtx, bitsize, bitpos,bitregion_start, bitregion_end, mode1, from, get_alias_set (to), nontemporal, reversep); then the assignment instruction is generated as below: (insn 11 10 12 4 (set (mem/c:TD (reg/f:DI 112 virtual-stack-vars) [1 D.2914.td0+0 S16 A128]) (reg/v:TD 121 [ c ])) "pr70053.c":20:14 -1 (nil)) So if we need remove the redundant store/load in expand, the conversion from DImode to TDmode should be avoided for this case when using virtual-stack-vars registers. (For PR65421, there are similar DImode to DFmode conversion). pr70053.c.236r.expand with -O2: 1: NOTE_INSN_DELETED 6: NOTE_INSN_BASIC_BLOCK 2 2: r119:TD=%2:TD 3: r120:TD=%4:TD 4: r121:TD=%6:TD 5: NOTE_INSN_FUNCTION_BEG 8: r122:CCFP=cmp(r120:TD,r121:TD) 9: pc={(r122:CCFP!=0)?L16:pc} REG_BR_PROB 708669604 10: NOTE_INSN_BASIC_BLOCK 4 11: [r112:DI]=r121:TD 12: r123:DI=r112:DI+0x10 13: [r123:DI]=r121:TD 14: pc=L21 15: barrier 16: L16: 17: NOTE_INSN_BASIC_BLOCK 5 18: [r112:DI]=r119:TD 19: r124:DI=r112:DI+0x10 20: [r124:DI]=r120:TD 21: L21: 22: NOTE_INSN_BASIC_BLOCK 6 23: r125:TD=[r112:DI] 24: r127:DI=r112:DI+0x10 25: r126:TD=[r127:DI] 26: r117:TD=r125:TD 27: r118:TD=r126:TD 31: %2:TD=r117:TD 32: %4:TD=r118:TD 33: use %2:TD 34: use %4:TD
[Bug target/69493] Poor code generation for return of struct containing vectors on PPC64LE
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69493 --- Comment #10 from luoxhu at gcc dot gnu.org --- In expand, Power8 will emit two register permute instructions to byte swap the contents by rs6000_emit_le_vsx_move. P9: 5: NOTE_INSN_BASIC_BLOCK 2 2: r129:TF=%1:TF 3: r130:TF=%3:TF 4: NOTE_INSN_FUNCTION_BEG 7: r117:DF=unspec[r129:TF,0] 70 8: r131:V2DF=r121:V2DF 9: r133:DF=vec_select(r131:V2DF,parallel) 10: r131:V2DF=vec_concat(r117:DF,r133:DF) 11: r122:V2DF=r131:V2DF 12: r118:DF=unspec[r129:TF,0x1] 70 13: r119:DF=unspec[r130:TF,0] 70 14: r134:V2DF=r124:V2DF 15: r136:DF=vec_select(r134:V2DF,parallel) 16: r134:V2DF=vec_concat(r119:DF,r136:DF) 17: r125:V2DF=r134:V2DF 18: r120:DF=unspec[r130:TF,0x1] 70 19: r137:V2DF=r122:V2DF 20: r139:DF=vec_select(r137:V2DF,parallel) 21: r137:V2DF=vec_concat(r139:DF,r118:DF) 22: [r112:DI]=r137:V2DF 23: r140:V2DF=r125:V2DF 24: r142:DF=vec_select(r140:V2DF,parallel) 25: r140:V2DF=vec_concat(r142:DF,r120:DF) 26: [r112:DI+0x10]=r140:V2DF 27: r143:V4SI=[r112:DI] 28: r144:V4SI=[r112:DI+0x10] 29: r127:V4SI=r143:V4SI 30: r128:V4SI=r144:V4SI 34: %2:V4SI=r127:V4SI 35: %3:V4SI=r128:V4SI 36: use %2:V4SI 37: use %3:V4SI P8: 5: NOTE_INSN_BASIC_BLOCK 2 2: r129:TF=%1:TF 3: r130:TF=%3:TF 4: NOTE_INSN_FUNCTION_BEG 7: r117:DF=unspec[r129:TF,0] 70 8: r131:V2DF=r121:V2DF 9: r133:DF=vec_select(r131:V2DF,parallel) 10: r131:V2DF=vec_concat(r117:DF,r133:DF) 11: r122:V2DF=r131:V2DF 12: r118:DF=unspec[r129:TF,0x1] 70 13: r119:DF=unspec[r130:TF,0] 70 14: r134:V2DF=r124:V2DF 15: r136:DF=vec_select(r134:V2DF,parallel) 16: r134:V2DF=vec_concat(r119:DF,r136:DF) 17: r125:V2DF=r134:V2DF 18: r120:DF=unspec[r130:TF,0x1] 70 19: r137:V2DF=r122:V2DF 20: r139:DF=vec_select(r137:V2DF,parallel) 21: r137:V2DF=vec_concat(r139:DF,r118:DF) 22: r140:V2DF=vec_select(r137:V2DF,parallel) 23: [r112:DI]=vec_select(r140:V2DF,parallel) 24: r141:V2DF=r125:V2DF 25: r143:DF=vec_select(r141:V2DF,parallel) 26: r141:V2DF=vec_concat(r143:DF,r120:DF) 27: r144:V2DF=vec_select(r141:V2DF,parallel) 28: [r112:DI+0x10]=vec_select(r144:V2DF,parallel) 29: r146:V4SI=vec_select([r112:DI],parallel) 30: r145:V4SI=vec_select(r146:V4SI,parallel) 31: r148:V4SI=vec_select([r112:DI+0x10],parallel) 32: r147:V4SI=vec_select(r148:V4SI,parallel) 33: r127:V4SI=r145:V4SI 34: r128:V4SI=r147:V4SI 38: %2:V4SI=r127:V4SI 39: %3:V4SI=r128:V4SI 40: use %2:V4SI 41: use %3:V4SI Difference starts from #22. Power8 will emit two vec_select instructions for stack store/load operations. But power9 needs only one.
[Bug target/70053] Returning a struct of _Decimal128 values generates extraneous stores and loads
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70053 --- Comment #9 from luoxhu at gcc dot gnu.org --- (In reply to Segher Boessenkool from comment #8) > I see no conversion there? > > But, why does it it store to memory at all? Yes, no conversion for this case, only adjust_address to TImode. mem/c:TD means a MEM cannot trap. Reason of store to memory: D.2914 is a local struct variable here, seems we need do some optimization to sink the D.2914.td0 and D.2914.td1 from BB3&BB4 to BB5 to avoid store/load on stack? Or if there already exists some pass in Gimple? Or should this be optimized after expander by some new pass like store sink? O2/pr70053.c.234t.optimized: D256_add_finite (_Decimal128 a, _Decimal128 b, _Decimal128 c) { struct TDx2_t D.2914; [local count: 1073741824]: if (b_4(D) == c_5(D)) goto ; [34.00%] else goto ; [66.00%] [local count: 365072224]: D.2914.td0 = c_5(D); D.2914.td1 = c_5(D); goto ; [100.00%] [local count: 708669601]: D.2914.td0 = a_3(D); D.2914.td1 = b_4(D); [local count: 1073741824]: return D.2914; }
[Bug rtl-optimization/89310] Poor code generation returning float field from a struct
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89310 luoxhu at gcc dot gnu.org changed: What|Removed |Added CC||luoxhu at gcc dot gnu.org --- Comment #3 from luoxhu at gcc dot gnu.org --- rs6000.md: (define_insn_and_split "movsf_from_si" ... "&& reload_completed && vsx_reg_sfsubreg_ok (operands[0], SFmode) && int_reg_operand_not_pseudo (operands[1], SImode)" [(const_int 0) ... /* Move SF value to upper 32-bits for xscvspdpn. */ emit_insn (gen_ashldi3 (op2, op1_di, GEN_INT (32))); emit_insn (gen_p8_mtvsrd_sf (op0, op2)); emit_insn (gen_vsx_xscvspdpn_directmove (op0, op0)); DONE The split seems inevitable as reload_completed is true here, can this lshrdi3+ashldi3 be optimized by peephole? r9 is DImode, is there any benefit of using mtvsrw[az] instead of mtvsrd? Or could we replace the 3 instructions with better sequence? Thanks.
[Bug rtl-optimization/89310] Poor code generation returning float field from a struct
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89310 --- Comment #5 from luoxhu at gcc dot gnu.org --- Thanks. I copied the code from movsf_from_si to make a define_insn_and_split for "movsf_from_si2", but we don't have define_insn for rldicr, so I use gen_anddi3 instead, any comment? foo: .LFB0: .cfi_startproc rldicr 3,3,0,31 mtvsrd 1,3 xscvspdpn 1,1 blr diff --git a/gcc/config/rs6000/rs6000.md b/gcc/config/rs6000/rs6000.md index 4fcd6a94022..92c237edfad 100644 --- a/gcc/config/rs6000/rs6000.md +++ b/gcc/config/rs6000/rs6000.md @@ -7593,6 +7593,48 @@ (define_insn_and_split "movsf_from_si" "*, *, p9v, p8v, *, *, p8v,p8v, p8v, *")]) +(define_insn_and_split "movsf_from_si2" + [(set (match_operand:SF 0 "nonimmediate_operand" + "=!r, f, v, wa,m, Z, +Z, wa,?r,!r") + (unspec:SF [ +(subreg:SI (ashiftrt:DI + (match_operand:DI 1 "input_operand" + "m, m, wY,Z, r, f, + wa,r, wa,r") + (const_int 32)) 0)] + UNSPEC_SF_FROM_SI)) + (clobber (match_scratch:DI 2 + "=X,X, X, X, X, X, + X, r, X, X"))] + "TARGET_NO_SF_SUBREG + && (register_operand (operands[0], SFmode) + || register_operand (operands[1], SImode))" + "#" + "&& !reload_completed + && vsx_reg_sfsubreg_ok (operands[0], SFmode)" + [(const_int 0)] +{ + rtx op0 = operands[0]; + rtx op1 = operands[1]; + rtx tmp = gen_reg_rtx (DImode); + + emit_insn (gen_anddi3 (tmp, op1, GEN_INT(0xULL))); + emit_insn (gen_p8_mtvsrd_sf (op0, tmp)); + emit_insn (gen_vsx_xscvspdpn_directmove (op0, op0)); + DONE; +}) +
[Bug rtl-optimization/89310] Poor code generation returning float field from a struct
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89310 luoxhu at gcc dot gnu.org changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #10 from luoxhu at gcc dot gnu.org --- Fixed on upstream.
[Bug lto/96343] LTO ICE on PPC64le
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96343 --- Comment #4 from luoxhu at gcc dot gnu.org --- I tried to build both ADIOS2 and WarpX(with INTERPROCEDURAL_OPTIMIZATION) on a Power8 machine with gcc 9.3.0&9.2.1, no LTO error seen. /usr/bin/cmake ../ -DCMAKE_C_COMPILER=/opt/at12.0/bin/gcc -DCMAKE_CXX+COMPILER=/opt/at12.0/bin/g++ -DADIOS2_USE_Fortran=OFF -DADIOS2_USE_ZeroMQ=OFF -DBUILD_SHARED_LIBS=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_POSITION_INDEPENDENT_CODE=ON -DADIOS2_USE_SST=OFF -DCMAKE_CXX_FLAGS="-flto -fno-fat-lto-objects ${CMAKE_CXX_FLAGS}" make -j50 Not sure any difference with your configuration? Anyway, it will be much better if you could try new GCC or reduce a smaller test case, BTW, I see that someone mentioned that it may related to conda and python https://github.com/ornladios/ADIOS2/issues/1524#issue-458229988?
[Bug rtl-optimization/71309] Copying fields within a struct followed by use results in load hit store
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71309 luoxhu at gcc dot gnu.org changed: What|Removed |Added CC||luoxhu at gcc dot gnu.org Resolution|--- |FIXED Status|NEW |RESOLVED --- Comment #5 from luoxhu at gcc dot gnu.org --- Fixed on master.
[Bug testsuite/92398] [10 regression] error in update of gcc.target/powerpc/pr72804.c in r277872
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92398 --- Comment #10 from luoxhu at gcc dot gnu.org --- Author: luoxhu Revision: 278890 Modified property: svn:log Modified: svn:log at Wed Dec 4 08:50:33 2019 -- --- svn:log (original) +++ svn:log Wed Dec 4 08:50:33 2019 @@ -10,7 +10,7 @@ 2019-12-02 Luo Xiong Hu - testsuite/pr92398 + PR testsuite/92398 * gcc.target/powerpc/pr72804.c: Split the store function to... * gcc.target/powerpc/pr92398.h: ... this one. New. * gcc.target/powerpc/pr92398.p9+.c: New.
[Bug middle-end/93189] [10 regression] Many test case failures starting with r279942
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93189 --- Comment #3 from luoxhu at gcc dot gnu.org --- Author: luoxhu Revision: 279986 Modified property: svn:log Modified: svn:log at Wed Jan 8 01:32:45 2020 -- --- svn:log (original) +++ svn:log Wed Jan 8 01:32:45 2020 @@ -7,5 +7,6 @@ 2020-01-08 Luo Xiong Hu + PR middle-end/93189 * ipa-inline.c (caller_growth_limits): Restore the AND.
[Bug ipa/69678] Missed function specialization + partial devirtualization opportunity
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69678 luoxhu at gcc dot gnu.org changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #4 from luoxhu at gcc dot gnu.org --- Fixed.
[Bug middle-end/71509] Bitfield causes load hit store with larger store than load
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71509 luoxhu at gcc dot gnu.org changed: What|Removed |Added CC||luoxhu at gcc dot gnu.org --- Comment #11 from luoxhu at gcc dot gnu.org --- (In reply to Anton Blanchard from comment #4) > Created attachment 39683 [details] > Another bitop LHS test case > > Here's another issue found in the Linux kernel. Seems like this should be a > single lwz/stw since the union of counter and the bitops completely overlap. > > The half word store followed by word load is going to prevent it from store > forwarding. > > : >0: 00 00 03 81 lwz r8,0(r3) >4: 20 00 89 78 clrldi r9,r4,32 >8: c2 0f 2a 79 rldicl r10,r9,33,31 >c: 00 f8 48 51 rlwimi r8,r10,31,0,0 > 10: 5e 00 2a 55 rlwinm r10,r9,0,1,15 > 14: 00 00 03 91 stw r8,0(r3) > 18: 00 00 83 b0 sth r4,0(r3) > 1c: 00 00 42 60 ori r2,r2,0 > 20: 00 00 23 81 lwz r9,0(r3) > 24: 00 04 29 55 rlwinm r9,r9,0,16,0 > 28: 78 53 29 7d or r9,r9,r10 > 2c: 00 00 23 91 stw r9,0(r3) > 30: 20 00 80 4e blr This case only is fixed on latest gcc 10 already (issues in case __skb_decr_checksum_unnecessary from Anton Blanchard and test2 from Nicholas Piggin still exist). gcc version 10.0.1 20200210 objdump -d set_page_slub_counters.o set_page_slub_counters.o: file format elf64-powerpcle Disassembly of section .text: : 0: 22 84 89 78 rldicl r9,r4,48,48 4: 00 00 83 b0 sth r4,0(r3) 8: 02 00 23 b1 sth r9,2(r3) c: 20 00 80 4e blr
[Bug lto/92599] [8/9 regression] ICE in speculative_call_info, at cgraph.c:1142
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92599 luoxhu at gcc dot gnu.org changed: What|Removed |Added CC||luoxhu at gcc dot gnu.org --- Comment #9 from luoxhu at gcc dot gnu.org --- Fixed on master, could be closed?
[Bug middle-end/71509] Bitfield causes load hit store with larger store than load
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71509 luoxhu at gcc dot gnu.org changed: What|Removed |Added CC||linkw at gcc dot gnu.org --- Comment #13 from luoxhu at gcc dot gnu.org --- (In reply to Segher Boessenkool from comment #12) > But it could do just > > stw r4,0(r3) > > (on LE; and with a rotate first, on BE). Thanks for the catching, this optimization is not related to load hit store. I will investigate why store-merging pass failed to merge the 2 half store.
[Bug middle-end/93582] [10 Regression] -Warray-bounds gives error: array subscript 0 is outside array bounds of struct E[1]
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93582 luoxhu at gcc dot gnu.org changed: What|Removed |Added CC||luoxhu at gcc dot gnu.org --- Comment #30 from luoxhu at gcc dot gnu.org --- Hi Jakub, thanks for your information, seems this test case from Linux kernel could be added to your future fix patch. struct page { union { unsigned counters; struct { union { struct { unsigned inuse : 16; unsigned objects : 15; unsigned frozen : 1; }; }; }; }; }; void foo1 (struct page *page, unsigned long counters_new) { struct page tmp; tmp.counters = counters_new; page->inuse = tmp.inuse; page->objects = tmp.objects; page->frozen = tmp.frozen; } Tried gcc(r10-6717) with -O3 on powerpcle, the asm is: Disassembly of section .text: : 0: 3e 84 89 54 rlwinm r9,r4,16,16,31 4: 00 00 83 b0 sth r4,0(r3) 8: 02 00 23 b1 sth r9,2(r3) c: 20 00 80 4e blr ... 1c: 00 00 42 60 ori r2,r2,0 It is expected to emit only one store stw instruction(Also two half store instructions emitted for x86 platforms!). I am not sure whether the fre pass could check consecutive store and do the merge similar to store-merging pass as the input parameter counters_new is not a constant. :)
[Bug lto/91287] LTO disables linking with scalar MASS library (Fortran only)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91287 --- Comment #38 from luoxhu at gcc dot gnu.org --- Author: luoxhu Date: Wed Aug 14 02:18:33 2019 New Revision: 274411 URL: https://gcc.gnu.org/viewcvs?rev=274411&root=gcc&view=rev Log: Enable math functions linking with static library for LTO In LTO mode, if static library and dynamic library contains same function and both libraries are passed as arguments, linker will link the function in dynamic library no matter the sequence. This patch will output LTO symbol node as UNDEF if BUILT_IN_NORMAL function FNDECL is a math function, then the function in static library will be linked first if its sequence is ahead of the dynamic library. gcc/ChangeLog 2019-08-14 Xiong Hu Luo PR lto/91287 * builtins.c (builtin_with_linkage_p): New function. * builtins.h (builtin_with_linkage_p): New function. * symtab.c (write_symbol): Remove redundant assert. * lto-streamer-out.c (symtab_node::output_to_lto_symbol_table_p): Remove FIXME and use builtin_with_linkage_p. Modified: trunk/gcc/ChangeLog trunk/gcc/builtins.c trunk/gcc/builtins.h trunk/gcc/lto-streamer-out.c trunk/gcc/symtab.c
[Bug lto/91287] LTO disables linking with scalar MASS library (Fortran only)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91287 --- Comment #39 from luoxhu at gcc dot gnu.org --- Author: luoxhu Date: Mon Aug 26 08:53:27 2019 New Revision: 274921 URL: https://gcc.gnu.org/viewcvs?rev=274921&root=gcc&view=rev Log: Backport r274411 from trunk to gcc-9-branch Backport r274411 of "Enable math functions linking with static library for LTO" from mainline to gcc-9-branch. Bootstrapped/Regression-tested on Linux POWER8 LE. gcc/ChangeLog 2019-08-26 Xiong Hu Luo Backport r274411 from trunk to gcc-9-branch. 2019-08-14 Xiong Hu Luo PR lto/91287 * builtins.c (builtin_with_linkage_p): New function. * builtins.h (builtin_with_linkage_p): New function. * symtab.c (write_symbol): Remove redundant assert. * lto-streamer-out.c (symtab_node::output_to_lto_symbol_table_p): Remove FIXME and use builtin_with_linkage_p. Modified: branches/gcc-9-branch/gcc/ChangeLog branches/gcc-9-branch/gcc/builtins.c branches/gcc-9-branch/gcc/builtins.h branches/gcc-9-branch/gcc/lto-streamer-out.c branches/gcc-9-branch/gcc/symtab.c
[Bug target/98914] [11 Regression] ICE in rs6000_expand_vector_set, at config/rs6000/rs6000.c:7198
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98914 luoxhu at gcc dot gnu.org changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED --- Comment #6 from luoxhu at gcc dot gnu.org --- Fixed on master.
[Bug target/99718] [11 regression] ICE in new test case gcc.target/powerpc/pr98914.c for 32 bits
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99718 luoxhu at gcc dot gnu.org changed: What|Removed |Added CC||dje.gcc at gmail dot com, ||segher at gcc dot gnu.org --- Comment #1 from luoxhu at gcc dot gnu.org --- Confirmed. Another case forgot to test m32 again :( David mentioned no need variable vec_insert support for m32 build, so I think we should avoid generating IFN VEC_SET in gimple-isel.c:gimple_expand_vec_set_expr, but it seems not possible to check "TARGET_P8_VECTOR && TARGET_DIRECT_MOVE_64BIT" in the common file or through can_vec_set_var_idx_p. Any suggestions? https://gcc.gnu.org/pipermail/gcc-patches/2021-January/564403.html
[Bug target/97329] POWER9 default cache and line sizes appear to be wrong
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97329 luoxhu at gcc dot gnu.org changed: What|Removed |Added CC||luoxhu at gcc dot gnu.org --- Comment #9 from luoxhu at gcc dot gnu.org --- Yes, it seems a copy paste error for Power8 from Power7. Is this supposed to be fix by gcc-12 stage1? And any performance evaluation required? diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c index 616dae35bae..34c4edae20e 100644 --- a/gcc/config/rs6000/rs6000.c +++ b/gcc/config/rs6000/rs6000.c @@ -1055,7 +1055,7 @@ struct processor_costs power8_cost = { COSTS_N_INSNS (17), /* ddiv */ 128, /* cache line size */ 32, /* l1 cache */ - 256, /* l2 cache */ + 512, /* l2 cache */ 12, /* prefetch streams */ COSTS_N_INSNS (3), /* SF->DF convert */ };
[Bug target/99718] [11 regression] ICE in new test case gcc.target/powerpc/pr98914.c for 32 bits
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99718 --- Comment #4 from luoxhu at gcc dot gnu.org --- Thanks, Jakub. It tested pass on both m32/m64, is this a reasonable fix? @segher, will make it a patch if so. git diff diff --git a/gcc/config/rs6000/predicates.md b/gcc/config/rs6000/predicates.md index 859af75..0a5cae2 100644 --- a/gcc/config/rs6000/predicates.md +++ b/gcc/config/rs6000/predicates.md @@ -1920,6 +1920,12 @@ return address_is_prefixed (XEXP (op, 0), mode, NON_PREFIXED_DEFAULT); }) +;; Return true if m64 on p8v and above for vec_set with variable index. +(define_predicate "vec_set_index_operand" + (if_then_else (match_test "TARGET_P8_VECTOR && TARGET_DIRECT_MOVE_64BIT") + (match_operand 0 "reg_or_cint_operand") + (match_operand 0 "const_int_operand"))) + ;; Return true if the operand is a valid memory operand with a D-form ;; address that could be merged with the load of a PC-relative external address ;; with the PCREL_OPT optimization. We don't check here whether or not the diff --git a/gcc/config/rs6000/vector.md b/gcc/config/rs6000/vector.md index e5191bd..3446b03 100644 --- a/gcc/config/rs6000/vector.md +++ b/gcc/config/rs6000/vector.md @@ -1227,7 +1227,7 @@ (define_expand "vec_set" [(match_operand:VEC_E 0 "vlogical_operand") (match_operand: 1 "register_operand") - (match_operand 2 "reg_or_cint_operand")] + (match_operand 2 "vec_set_index_operand")] "VECTOR_MEM_ALTIVEC_OR_VSX_P (mode)" { rs6000_expand_vector_set (operands[0], operands[1], operands[2]);
[Bug target/97329] POWER9 default cache and line sizes appear to be wrong
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97329 luoxhu at gcc dot gnu.org changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #11 from luoxhu at gcc dot gnu.org --- Fixed with r11-7821-g08103e4d6ada9b57366f2df2a2b745babfab914c.
[Bug target/99718] [11 regression] ICE in new test case gcc.target/powerpc/pr98914.c for 32 bits
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99718 --- Comment #11 from luoxhu at gcc dot gnu.org --- Created attachment 50474 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50474&action=edit 32bit variable vec_insert LLVM also generates store-hit-load instruction: addi 3, 1, -16 rlwinm 4, 5, 2, 28, 29 stvx 2, 0, 3 stwx 6, 3, 4 lvx 2, 0, 3 blr .long 0 .quad 0 I didn't use "can't" in my reply, sorry that caused the confusion, we though it was inefficient to move SF to SI on 32bit mode , but it turns out also huge performance gain (46.704s -> 4.369s). Attached the patch that also support variable vec_insert for 32bit, testing on P8BE/PBLE/P9LE, could you please verify it on AIX? Will refine it and send to the mail-list to fix this P1 issue fundamentally.
[Bug target/99718] [11 regression] ICE in new test case gcc.target/powerpc/pr98914.c for 32 bits
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99718 --- Comment #12 from luoxhu at gcc dot gnu.org --- Not sure whether TARGET_DIRECT_MOVE_64BIT is the right MACRO to correctly differentiate m32 and m64?
[Bug target/99718] [11 regression] ICE in new test case gcc.target/powerpc/pr98914.c for 32 bits
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99718 --- Comment #13 from luoxhu at gcc dot gnu.org --- Performance data in #c11 is for int variable vec_insert of 32bit mode, the float variable vec_insert of 32-bit is a bit slower but much better than original(extra stfs+lwz of insn #17 and insn 18 in expand to move SF register to SI register by hex value.): 46.677s -> 8.723s test.c #include #define TYPE float vector TYPE test (vector TYPE u, TYPE i, signed int n){ return vec_insert (i, u, n); } Expand: 1: NOTE_INSN_DELETED 6: NOTE_INSN_BASIC_BLOCK 2 2: r122:V4SF=%2:V4SF 3: r123:SF=%1:SF 4: r124:SI=%3:SI 5: NOTE_INSN_FUNCTION_BEG 8: r120:V4SF=r122:V4SF 9: r125:SI=r124:SI&0x3 10: r126:V4SF=r120:V4SF 11: r128:SI=r125:SI<<0x2 12: {r128:SI=0x14-r128:SI;clobber ca:SI;} 13: r132:SI=high(`*.LC0') 14: r131:SI=r132:SI+low(`*.LC0') REG_EQUAL `*.LC0' 15: r130:V2DI=[r131:SI] REG_EQUAL const_vector 16: r129:V16QI=r130:V2DI#0 17: [r112:SI]=r123:SF 18: r133:SI=[r112:SI] 19: r136:DI#4=r133:SI 22: {r137:SI=r133:SI>>0x1f;clobber ca:SI;} 23: r136:DI#0=r137:SI 24: r138:DI=0 25: r135:V2DI=vec_concat(r136:DI,r138:DI) 26: r134:V16QI=r135:V2DI#0 27: r139:V16QI=unspec[r128:SI] 151 28: r140:V16QI=unspec[r134:V16QI,r134:V16QI,r139:V16QI] 236 29: r141:V16QI=unspec[r129:V16QI,r129:V16QI,r139:V16QI] 236 30: r126:V4SF#0={(r141:V16QI!=const_vector)?r140:V16QI:r126:V4SF#0} 31: r119:V4SF=r126:V4SF 32: r120:V4SF=r119:V4SF ASM: .LFB0: .cfi_startproc stwu 1,-16(1) .cfi_def_cfa_offset 16 lis 9,.LC0@ha rlwinm 3,3,2,28,29 xxlxor 0,0,0 la 9,.LC0@l(9) subfic 3,3,20 lxvd2x 33,0,9 lvsl 13,0,3 stfs 1,8(1) vperm 1,1,1,13 ori 2,2,0 lwz 9,8(1) addi 1,1,16 .cfi_def_cfa_offset 0 srawi 10,9,31 mtvsrwz 13,9 mtvsrwz 12,10 fmrgow 11,12,13 xxpermdi 32,11,0,0 vperm 0,0,0,13 xxsel 34,34,32,33 blr
[Bug target/99718] [11 regression] ICE in new test case gcc.target/powerpc/pr98914.c for 32 bits
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99718 --- Comment #15 from luoxhu at gcc dot gnu.org --- (In reply to Jakub Jelinek from comment #14) > You still have: > if (VECTOR_MEM_VSX_P (mode)) > { > if (!CONST_INT_P (elt_rtx)) > { > if ((TARGET_P9_VECTOR && TARGET_POWERPC64) || width == 8) > return ..._p9 (...); > else if (TARGET_P8_VECTOR) > return ..._p8 (...); > } > > if (mode == V2DFmode) > insn = gen_vsx_set_v2df (target, target, val, elt_rtx); > > else if (mode == V2DImode) > insn = gen_vsx_set_v2di (target, target, val, elt_rtx); > > else if (TARGET_P9_VECTOR && TARGET_POWERPC64) > { > ... > } > if (insn) > return; > } > > gcc_assert (CONST_INT_P (elt_rtx)); > > while the vector.md condition is VECTOR_MEM_ALTIVEC_OR_VSX_P (mode), > i.e. true for TARGET_ALTIVEC for many modes already (V4SI, V8HI, V16QI, V4SF > and > for TARGET_VSX also V2DF and V2DI, right). > I somehow don't see how this can work properly. > Looking at vsx_set_v2df and vsx_set_v2di, neither of them will handle > non-constant elt_rtx (it ICEs on anything but const0_rtx and const1_rtx). > > So, questions: > 1) does the rs6000_expand_vector_set_var_p9 routine for width == 8 (i.e. > V2DImode or V2DFmode?) > handle everything, even when TARGET_P9_VECTOR or TARGET_POWERPC64 is not > true, plain old VSX? Yes. V2DI/V2DF for P8 {BE,LE} {m32,m64} will call rs6000_expand_vector_set_var_p9 instead of xxx_p8. Do you mean Power7 for the plain old VSX? I verified the pr98914.c on Power7, it exactly ICEs on "gcc_assert (CONST_INT_P (elt_rtx));" for both m64 and m32. This is still not fixed by the patch in #c11 yet. For builtin call in rs6000-c.c:altivec_build_resolved_builtin, it is guarded by TARGET_P8_VECTOR, so Power7 doesn't generate IFN VEC_INSERT before. This ICE also comes from internal optimization gimple-isel.c:gimple_expand_vec_set_expr, can_vec_set_var_idx_p doesn't return false due to VECTOR_MEM_ALTIVEC_OR_VSX_P is true when Power7 VSX, change the "if (VECTOR_MEM_VSX_P (mode))" to "if (VECTOR_MEM_ALTIVEC_OR_VSX_P (mode))" in rs6000.c:rs6000_expand_vector_set and remove TARGET_P8_VECTOR in the else branch could fix the ICE on P7 {m32,64}, so this means even P7 VSX could benefit from this optimization, which is different from what discussed before. > 2) what happens if TARGET_P8_VECTOR is false and TARGET_VSX is true and mode > is other than V2DI/V2DF? If I read the code right, it will fall through to > gcc_assert (CONST_INT_P (elt_rtx)); Same like 1)? > 3) what happens if !TARGET_VSX (more specifically, when VECTOR_MEM_VSX_P > (mode) is false. > I see there just the assertion that would fail right away. > Perhaps I'm missing something obvious and those cases are impossible, but if > that is the case, it would still be better to add further assertion at least > to the if (...) else if (...) as else gcc_assert ... Thanks for pointing out, the "gcc_assert (CONST_INT_P (elt_rtx));" should be moved into the "if (!CONST_INT_P (elt_rtx))" condition like you said. gen_vsx_set_v2df and gen_vsx_set_v2di are supposed to handle only const elt_rtx.
[Bug target/99718] [11 regression] ICE in new test case gcc.target/powerpc/pr98914.c for 32 bits
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99718 --- Comment #19 from luoxhu at gcc dot gnu.org --- https://gcc.gnu.org/pipermail/gcc-patches/2021-March/567395.html This patch extends variable vec_insert to all 32bit VSX targets including Power7{BE} {32,64}, Power8{BE}{32, 64}, Power8{LE}{64}, Power9{LE}{64}, all tested pass for power testcases, though AIX is not tested yet. @Segher, please review this one instead of the previous that disables 32 bit variable vec_insert, thanks. For Altivec targets like power5/6/G4/G5, take the previous "vector store/scalar store/vector load" code path. -mcpu=power6 -O2 -maltivec -c -S f2: .LFB0: .cfi_startproc addi 10,1,-16 sldi 5,5,2 li 9,32 addi 8,1,-48 stvx 2,8,9 stwx 6,10,5 lvx 2,8,9 blr
[Bug target/99718] [11 regression] ICE in new test case gcc.target/powerpc/pr98914.c for 32 bits
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99718 luoxhu at gcc dot gnu.org changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #21 from luoxhu at gcc dot gnu.org --- Fixed on mater.
[Bug middle-end/90323] powerpc should convert equivalent sequences to vec_sel()
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90323 luoxhu at gcc dot gnu.org changed: What|Removed |Added CC||luoxhu at gcc dot gnu.org --- Comment #8 from luoxhu at gcc dot gnu.org --- Two minor updates for the case mentioned in #c2: for VEC_SEL (ARG1, ARG2, ARG3): Returns a vector containing the value of either ARG1 or ARG2 depending on the value of ARG3. #include #include volatile vector unsigned orig = {0xebebebeb, 0x34343434, 0x76767676, 0x12121212}; volatile vector unsigned mask = {0x, 0, 0x, 0}; volatile vector unsigned fill = {0xfefefefe, 0x, 0x, 0x}; volatile vector unsigned expected = {0xfefefefe, 0x34343434, 0x, 0x12121212}; __attribute__ ((noinline)) vector unsigned without_sel(vector unsigned l, vector unsigned r, vector unsigned mask) { -l = l & ~r; +l = l & ~mask; l |= mask & r; return l; } __attribute__ ((noinline)) vector unsigned with_sel(vector unsigned l, vector unsigned r, vector unsigned mask) { -return vec_sel(l, mask, r); +return vec_sel(l, r, mask); } int main() { vector unsigned res1 = without_sel(orig, fill, mask); vector unsigned res2 = with_sel(orig, fill, mask); if (!vec_all_eq(res1, expected)) printf ("error1\n"); if (!vec_all_eq(res2, expected)) printf ("error2\n"); return 0; } And the ASM would be: without_sel: xxlxor 35,34,35 xxland 35,35,36 xxlxor 34,34,35 blr .long 0 .byte 0,0,0,0,0,0,0,0 with_sel: xxsel 34,34,35,36 blr .long 0 .byte 0,0,0,0,0,0,0,0
[Bug middle-end/90323] powerpc should convert equivalent sequences to vec_sel()
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90323 --- Comment #9 from luoxhu at gcc dot gnu.org --- Then we could optimized it in match.pd diff --git a/gcc/match.pd b/gcc/match.pd index 036f92fa959..8944312c153 100644 --- a/gcc/match.pd +++ b/gcc/match.pd @@ -3711,6 +3711,17 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) (if (integer_all_onesp (@1) && integer_zerop (@2)) @0 +#if GIMPLE +(simplify + (bit_xor @0 (bit_and @2 (bit_xor @0 @1))) + (if (optimize_vectors_before_lowering_p () && types_match (@0, @1) + && types_match (@0, @2) && VECTOR_TYPE_P (TREE_TYPE (@0)) + && VECTOR_TYPE_P (TREE_TYPE (@1)) && VECTOR_TYPE_P (TREE_TYPE (@2))) + (with { tree itype = truth_type_for (type); } + (vec_cond (convert:itype @2) @1 @0 +#endif in pr90323.c.033t.forwprop1, it will be optimized to: : _1 = ~mask_3(D); l_5 = _1 & l_4(D); _2 = mask_3(D) & r_6(D); _8 = l_4(D) ^ r_6(D); _10 = mask_3(D) & _8; _11 = (vector(4) ) mask_3(D); l_7 = VEC_COND_EXPR <_11, r_6(D), l_4(D)>; return l_7; Then in pr90323.c.243t.isel: [local count: 1073741824]: _6 = (vector(4) ) mask_1(D); l_4 = .VCOND_MASK (_6, r_3(D), l_2(D)); return l_4; final ASM: without_sel: .LFB11: .cfi_startproc xxsel 34,34,35,36 blr .long 0 .byte 0,0,0,0,0,0,0,0 .cfi_endproc .LFE11: .size without_sel,.-without_sel .align 2 .p2align 4,,15 .globl with_sel .type with_sel, @function with_sel: .LFB12: .cfi_startproc xxsel 34,34,35,36 blr @segher, Is this reasonable fix ???
[Bug middle-end/90323] powerpc should convert equivalent sequences to vec_sel()
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90323 --- Comment #11 from luoxhu at gcc dot gnu.org --- I noticed that you added the below optimization with commit a62436c0a505155fc8becac07a8c0abe2c265bfe. But it doesn't even handle this case, cse1 pass will call simplify_binary_operation_1, both op0 and op1 are REGs instead of AND operators, do you have a test case to cover that piece of code? __attribute__ ((noinline)) long without_sel3( long l, long r) { long tmp = {0x0ff00fff}; l = ( (l ^ r) & tmp) ^ l; return l; } without_sel3: xor 4,3,4 rlwinm 4,4,0,20,11 rldicl 4,4,0,36 xor 3,4,3 blr .long 0 .byte 0,0,0,0,0,0,0,0 +2016-11-09 Segher Boessenkool + + * simplify-rtx.c (simplify_binary_operation_1): Simplify + (xor (and (xor A B) C) B) to (ior (and A C) (and B ~C)) and + (xor (and (xor A B) C) A) to (ior (and A ~C) (and B C)) if C + is a const_int. diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c index 5c3dea1a349..11a2e0267c7 100644 --- a/gcc/simplify-rtx.c +++ b/gcc/simplify-rtx.c @@ -2886,6 +2886,37 @@ simplify_binary_operation_1 (enum rtx_code code, machine_mode mode, } } + /* If we have (xor (and (xor A B) C) A) with C a constant we can instead +do (ior (and A ~C) (and B C)) which is a machine instruction on some +machines, and also has shorter instruction path length. */ + if (GET_CODE (op0) == AND + && GET_CODE (XEXP (op0, 0)) == XOR + && CONST_INT_P (XEXP (op0, 1)) + && rtx_equal_p (XEXP (XEXP (op0, 0), 0), trueop1)) + { + rtx a = trueop1; + rtx b = XEXP (XEXP (op0, 0), 1); + rtx c = XEXP (op0, 1); + rtx nc = simplify_gen_unary (NOT, mode, c, mode); + rtx a_nc = simplify_gen_binary (AND, mode, a, nc); + rtx bc = simplify_gen_binary (AND, mode, b, c); + return simplify_gen_binary (IOR, mode, a_nc, bc); + } + /* Similarly, (xor (and (xor A B) C) B) as (ior (and A C) (and B ~C)) */ + else if (GET_CODE (op0) == AND + && GET_CODE (XEXP (op0, 0)) == XOR + && CONST_INT_P (XEXP (op0, 1)) + && rtx_equal_p (XEXP (XEXP (op0, 0), 1), trueop1)) + { + rtx a = XEXP (XEXP (op0, 0), 0); + rtx b = trueop1; + rtx c = XEXP (op0, 1); + rtx nc = simplify_gen_unary (NOT, mode, c, mode); + rtx b_nc = simplify_gen_binary (AND, mode, b, nc); + rtx ac = simplify_gen_binary (AND, mode, a, c); + return simplify_gen_binary (IOR, mode, ac, b_nc); + }
[Bug middle-end/90323] powerpc should convert equivalent sequences to vec_sel()
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90323 --- Comment #12 from luoxhu at gcc dot gnu.org --- That code was called by combine pass but fail to match. pr newpat (set (reg:DI 125 [ l ]) (xor:DI (and:DI (xor:DI (reg/v:DI 120 [ l ]) (reg:DI 127)) (const_int 267390975 [0xff00fff])) (reg/v:DI 120 [ l ]))) Trying 8, 10 -> 11: 8: r123:DI=r120:DI^r127:DI REG_DEAD r127:DI 10: r118:DI=r123:DI&0xff00fff REG_DEAD r123:DI 11: r125:DI=r118:DI^r120:DI REG_DEAD r120:DI REG_DEAD r118:DI Failed to match this instruction: (set (reg:DI 125 [ l ]) (ior:DI (and:DI (reg/v:DI 120 [ l ]) (const_int -267390976 [0xf00ff000])) (and:DI (reg:DI 127) (const_int 267390975 [0xff00fff] Successfully matched this instruction: (set (reg:DI 118 [ _2 ]) (and:DI (reg:DI 127) (const_int 267390975 [0xff00fff]))) Failed to match this instruction: (set (reg:DI 125 [ l ]) (ior:DI (and:DI (reg/v:DI 120 [ l ]) (const_int -267390976 [0xf00ff000])) (reg:DI 118 [ _2 ])))
[Bug middle-end/90323] powerpc should convert equivalent sequences to vec_sel()
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90323 --- Comment #15 from luoxhu at gcc dot gnu.org --- (In reply to Segher Boessenkool from comment #14) > (In reply to luoxhu from comment #12) > > That code was called by combine pass but fail to match. > > > > > pr newpat > > (set (reg:DI 125 [ l ]) > > (xor:DI (and:DI (xor:DI (reg/v:DI 120 [ l ]) > > (reg:DI 127)) > > (const_int 267390975 [0xff00fff])) > > (reg/v:DI 120 [ l ]))) > > Note this is 0x0ff00fff, and this is not a valid mask for rlwimi. OK, it also fails to combine for 0x0100. .cfi_startproc xor 4,3,4 rlwinm 4,4,0,7,7 xor 3,4,3 blr
[Bug target/97142] __builtin_fmod not optimized on POWER
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97142 --- Comment #10 from luoxhu at gcc dot gnu.org --- If not built with fast-math, gimple_has_side_effects will return true and cause the expand_call_stmt fail to expand the "_1 = fmod (x_2(D), y_3(D));" to internal function. X86 also produces "bl fmod" for O3 build. xlF expands the fmod to below ASM, no FMA generated? 1900 : 1900: 8c 03 01 10 vspltisw v0,1 1904: 00 00 24 c8 lfd f1,0(r4) 1908: 00 00 03 c8 lfd f0,0(r3) 190c: e2 03 40 f0 xvcvsxwdp vs2,vs32 1910: c0 09 62 f0 xsdivdp vs3,vs2,vs1 1914: 80 19 80 f0 xsmuldp vs4,vs0,vs3 1918: 64 21 a0 f0 xsrdpiz vs5,vs4 191c: 88 2d 01 f0 xsnmsubadp vs0,vs1,vs5 1920: 18 00 20 fc frspf1,f0 1924: 20 00 80 4e blr
[Bug tree-optimization/22326] promotions (from float to double) are not removed when they should be able to
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=22326 luoxhu at gcc dot gnu.org changed: What|Removed |Added CC||luoxhu at gcc dot gnu.org, ||pinskia at gcc dot gnu.org, ||segher at kernel dot crashing.org --- Comment #4 from luoxhu at gcc dot gnu.org --- float foo(float f, float x, float y) { return (fabs(f)*x+y); } the input of fabs is float type, so use fabsf is enough here, drafted a patch to avoid double promotion when generating gimple if fabs could be replaced by fabsf as argument[0] is float type. diff --git a/gcc/c/c-parser.c b/gcc/c/c-parser.c index ecc3d2119fa..1a2d7e624cc 100644 --- a/gcc/c/c-parser.c +++ b/gcc/c/c-parser.c @@ -10470,6 +10470,20 @@ c_parser_postfix_expression_after_primary (c_parser *parser, && fndecl_built_in_p (expr.value, BUILT_IN_NORMAL) && vec_safe_length (exprlist) == 1) warn_for_abs (expr_loc, expr.value, (*exprlist)[0]); + + if (fndecl_built_in_p (expr.value, BUILT_IN_NORMAL) + && DECL_FUNCTION_CODE (expr.value) == BUILT_IN_FABS) + { + tree arg0 = (*exprlist)[0]; + if (TYPE_PRECISION (TREE_TYPE (TREE_TYPE (expr.value))) + > TYPE_PRECISION (TREE_TYPE (arg0)) + && TYPE_MODE (TREE_TYPE (arg0)) == E_SFmode) + { + tree abs_fun = get_identifier ("fabsf"); + expr.value = build_external_ref (expr_loc, abs_fun, true, + &expr.original_type); + } + } } start = expr.get_start (); .006t.gimple: __attribute__((noinline)) foo (float f, float x, float y) { float D.4347; _1 = ABS_EXPR ; _2 = x * _1; D.4347 = y + _2; return D.4347; } foo: .LFB0: .cfi_startproc fabs 1,1 fmadds 1,1,2,3 blr
[Bug tree-optimization/22326] promotions (from float to double) are not removed when they should be able to
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=22326 --- Comment #5 from luoxhu at gcc dot gnu.org --- With above hack, changing argument x from float to double could still generate correct code with conversion of fabsf result: float foo(float f, double x, float y) { return (fabs(f)*x+y); } 006t.gimple __attribute__((noinline)) foo (float f, double x, float y) { float D.4347; _1 = ABS_EXPR ; _2 = (double) _1; _3 = x * _2; _4 = (double) y; _5 = _3 + _4; D.4347 = (float) _5; return D.4347; }
[Bug tree-optimization/22326] promotions (from float to double) are not removed when they should be able to
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=22326 --- Comment #9 from luoxhu at gcc dot gnu.org --- (In reply to Andrew Pinski from comment #6) > (In reply to luoxhu from comment #4) > > float foo(float f, float x, float y) { > > return (fabs(f)*x+y); > > } > > > > the input of fabs is float type, so use fabsf is enough here, drafted a > > patch to avoid double promotion when generating gimple if fabs could be > > replaced by fabsf as argument[0] is float type. > > what about adding something to match.pd for: > ABS<(float_convert)f> into (float_convert)ABS > This is only valid prompting and not reducing the precision. Thanks, this is already implemented in fold-const.c, though not using match.pd and fabsf really. fabs will always convert arguments to double type first in front-end. And there are 3 kind of cases for this issue: 1) "return fabs(x);" tree fold_unary_loc (location_t loc, enum tree_code code, tree type, tree op0) { ... case ABS_EXPR: /* Convert fabs((double)float) into (double)fabsf(float). */ if (TREE_CODE (arg0) == NOP_EXPR && TREE_CODE (type) == REAL_TYPE) { tree targ0 = strip_float_extensions (arg0); if (targ0 != arg0) return fold_convert_loc (loc, type, fold_build1_loc (loc, ABS_EXPR, TREE_TYPE (targ0), targ0)); } return NULL_TREE; ... } This piece of code could convert the code from "(float)fabs((double)x)" to "(float)(double)(float)fabs(x)", then match.pd could remove the useless convert. 2) "return fabs(x)*y;" Frontend will generate "(float) (fabs((double) x) * (double) y)" expression first, then fold-const.c:fold_unary_loc will Convert fabs((double)float) into (double)fabsf(float) and get "(float)((double)fabs(x) * (double)y)", finally, match.pd will convert (outertype)((innertype0)a+(innertype1)b) into ((newtype)a+(newtype)b) to remove the double conversion. 3)"return fabs(x)*y + z;" Frontend produces: (float) ((fabs((double) float) * (double) y) + (double z)) So what we need here is to match the MUL&ADD in match.pd as followed, any comments? +(simplify (convert (plus (mult (convert@3 (abs @0)) (convert@4 @1)) (convert@5 @2))) + (if (( flag_unsafe_math_optimizations + && types_match (type, float_type_node) + && types_match (TREE_TYPE(@0), float_type_node) + && types_match (TREE_TYPE(@1), float_type_node) + && types_match (TREE_TYPE(@2), float_type_node) + && element_precision (TREE_TYPE(@3)) > element_precision (TREE_TYPE (@0)) + && element_precision (TREE_TYPE(@4)) > element_precision (TREE_TYPE (@1)) + && element_precision (TREE_TYPE(@5)) > element_precision (TREE_TYPE (@2)) + && ! HONOR_NANS (type) + && ! HONOR_INFINITIES (type))) + (plus (mult (abs @0) @1) @2) )) + 1) and 2) won't generate double conversion, only 3) has frsp in fast-math mode, and it could be removed by above pattern. PS: convert_to_real_1 seems to me not quite related here? It converts (float)sqrt((double)x) where x is float into sqrtf(x), but with recursive call to convert_to_real_1 and build_call_expr with new mathfn_built_in, I suppose it a bit complicated to move them to match.pd? The optimization should be under fast-math mode, is flag_unsafe_math_optimizations enough to guard them?
[Bug tree-optimization/22326] promotions (from float to double) are not removed when they should be able to
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=22326 --- Comment #10 from luoxhu at gcc dot gnu.org --- Even we could optimize fabs to fabsf, it doesn't help here as y and z are already promoted to double, then we still need a large pattern to match the MUL&PLUS expression in match.pd, so fabs to fabsf seems not a reasonable direction...
[Bug tree-optimization/22326] promotions (from float to double) are not removed when they should be able to
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=22326 --- Comment #13 from luoxhu at gcc dot gnu.org --- Tried implementation with backprop, found that this model seems not suitable for double promotion remove with BACK propagation. i.e: 1) mad1.c float foo (float x, float y, float z) { return ( y * fabs (x) + z ); } mad1.c.098t.cunrolli: foo (float x, float y, float z) { double _1; float _2; double _3; double _4; double _5; double _6; float _10; [local count: 1073741824]: _1 = (double) y_7(D); _2 = ABS_EXPR ; _3 = (double) _2; _4 = _1 * _3; _5 = (double) z_9(D); _6 = _4 + _5; _10 = (float) _6; return _10; } mad1.c.099t.backprop: [USE] _10 in return _10; [USE] _6 in _10 = (float) _6; _6: convert from float to double not important [DEF] Recording new information for _6 = _4 + _5; _6: convert from float to double not important [USE] _5 in _6 = _4 + _5; _5: convert from float to double not important [DEF] Recording new information for _5 = (double) z_9(D); _5: convert from float to double not important [USE] _4 in _6 = _4 + _5; _4: convert from float to double not important [DEF] Recording new information for _4 = _1 * _3; _4: convert from float to double not important [USE] _3 in _4 = _1 * _3; _3: convert from float to double not important [DEF] Recording new information for _3 = (double) _2; _3: convert from float to double not important [USE] _2 in _3 = (double) _2; _2: convert from float to double not important [DEF] Recording new information for _2 = ABS_EXPR ; _2: convert from float to double not important [USE] _1 in _4 = _1 * _3; _1: convert from float to double not important [DEF] Recording new information for _1 = (double) y_7(D); _1: convert from float to double not important gimple_simplified to _10 = _13; Deleting _6 = z_9(D) + _12; Deleting _5 = (double) z_9(D); Deleting _4 = _2 * y_7(D); Deleting _3 = (double) _2; Deleting _1 = (double) y_7(D); __attribute__((noinline)) foo (float x, float y, float z) { float _2; float _10; float _12; float _13; [local count: 1073741824]: _2 = ABS_EXPR ; _12 = _2 * y_7(D); _13 = z_9(D) + _12; _10 = _13; return _10; } All convert and promotions could be removed. But if change float x to double x, it doesn't work now: 2) mad2.c float foo (double x, float y, float z) { return ( y * fabs (x) + z ); } mad2.c.098t.cunrolli: foo (double x, float y, float z) { double _1; double _2; double _3; double _4; double _5; float _9; [local count: 1073741824]: _1 = (double) y_6(D); _2 = ABS_EXPR ; _3 = _1 * _2; _4 = (double) z_8(D); _5 = _3 + _4; _9 = (float) _5; return _9; } mad2.c.099t.backprop: [USE] _9 in return _9; [USE] _5 in _9 = (float) _5; _5: convert from float to double not important [DEF] Recording new information for _5 = _3 + _4; _5: convert from float to double not important [USE] _4 in _5 = _3 + _4; _4: convert from float to double not important [DEF] Recording new information for _4 = (double) z_8(D); _4: convert from float to double not important [USE] _3 in _5 = _3 + _4; _3: convert from float to double not important [DEF] Recording new information for _3 = _1 * _2; _3: convert from float to double not important [USE] _2 in _3 = _1 * _2; _2: convert from float to double not important [DEF] Recording new information for _2 = ABS_EXPR ; _2: convert from float to double not important [USE] _1 in _3 = _1 * _2; _1: convert from float to double not important [DEF] Recording new information for _1 = (double) y_6(D); _1: convert from float to double not important Deleting _4 = (double) z_8(D); Deleting _1 = (double) y_6(D); EMERGENCY DUMP: __attribute__((noinline)) foo (double x, float y, float z) { double _2; double _3; double _5; float _9; [local count: 1073741824]: _2 = ABS_EXPR ; _3 = _2 * y_6(D); _5 = _3 + z_8(D); _9 = (float) _5; return _9; }
[Bug tree-optimization/22326] promotions (from float to double) are not removed when they should be able to
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=22326 --- Comment #14 from luoxhu at gcc dot gnu.org --- (In reply to luoxhu from comment #13) > > 2) mad2.c > > float foo (double x, float y, float z) > { >return ( y * fabs (x) + z ); > } > > > mad2.c.098t.cunrolli: > > foo (double x, float y, float z) > { > double _1; > double _2; > double _3; > double _4; > double _5; > float _9; > >[local count: 1073741824]: > _1 = (double) y_6(D); > _2 = ABS_EXPR ; > _3 = _1 * _2; > _4 = (double) z_8(D); > _5 = _3 + _4; > _9 = (float) _5; > return _9; > > } > Maybe should use forward propagation here to save [_1, _2, _3 ... _9] to m_vars and set ignore_convert status in usage_info if rhs of the expression could remove double conversion, for stmt which has two rhs, need intersect status with AND operation of rhs1 ignore_convert and rhs2 ignore_convert, also clear the ignore_convert status if any of it is false. Not sure whether this works, also a bit more complicated then expected...
[Bug tree-optimization/22326] promotions (from float to double) are not removed when they should be able to
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=22326 --- Comment #17 from luoxhu at gcc dot gnu.org --- (In reply to rsand...@gcc.gnu.org from comment #16) > > 2) mad2.c > > > > float foo (double x, float y, float z) > > { > >return ( y * fabs (x) + z ); > > } > > > > > > mad2.c.098t.cunrolli: > > > > foo (double x, float y, float z) > > { > > double _1; > > double _2; > > double _3; > > double _4; > > double _5; > > float _9; > > > >[local count: 1073741824]: > > _1 = (double) y_6(D); > > _2 = ABS_EXPR ; > > _3 = _1 * _2; > > _4 = (double) z_8(D); > > _5 = _3 + _4; > > _9 = (float) _5; > > return _9; > > > > } > > > > mad2.c.099t.backprop: > > > > [USE] _9 in return _9; > > [USE] _5 in _9 = (float) _5; > > _5: convert from float to double not important > > [DEF] Recording new information for _5 = _3 + _4; > > _5: convert from float to double not important > > [USE] _4 in _5 = _3 + _4; > > _4: convert from float to double not important > > [DEF] Recording new information for _4 = (double) z_8(D); > > _4: convert from float to double not important > > [USE] _3 in _5 = _3 + _4; > > _3: convert from float to double not important > > [DEF] Recording new information for _3 = _1 * _2; > > _3: convert from float to double not important > > [USE] _2 in _3 = _1 * _2; > > _2: convert from float to double not important > > [DEF] Recording new information for _2 = ABS_EXPR ; > > _2: convert from float to double not important > > [USE] _1 in _3 = _1 * _2; > > _1: convert from float to double not important > > [DEF] Recording new information for _1 = (double) y_6(D); > > _1: convert from float to double not important > > > > Deleting _4 = (double) z_8(D); > > Deleting _1 = (double) y_6(D); > > > > > > EMERGENCY DUMP: > > > > __attribute__((noinline)) > > foo (double x, float y, float z) > > { > > double _2; > > double _3; > > double _5; > > float _9; > > > >[local count: 1073741824]: > > _2 = ABS_EXPR ; > > _3 = _2 * y_6(D); > > _5 = _3 + z_8(D); > > _9 = (float) _5; > > return _9; > > > > } > Maybe I'm misunderstanding the point, but isn't this > just an issue with the way that the results of the > analysis are applied to the IL, rather than a problem > in the analysis itself? Yes, the optimize operations on Gimple is a bit uncertain. Do you mean add convert from double to float at proper place like below to avoid ICE caused by type mismatch ICE in verify_ssa? Which one will be better, and whether it is correct for all kind of math operations like pow/exp, etc under fast-math? If so, no cancelling is needed again as Richi mentioned? 1) convert before ABS_EXPR: foo (double x, float y, float z) { float _9; float _11; float _12; float _13; float _14; [local count: 1073741824]: _11 = (float) x_6(D); _12 = ABS_EXPR <_11>; _13 = y_7(D) * _12; _14 = z_8(D) + _13; _9 = _14; return _9; } foo: .LFB0: .cfi_startproc frsp 0,1 fabs 0,0 fmadds 1,2,0,3 blr 2) OR convert after ABS_EXPR: foo (double x, float y, float z) { double _1; float _9; float _11; float _12; float _13; [local count: 1073741824]: _1 = ABS_EXPR ; _11 = (float) _1; _12 = y_7(D) * _11; _13 = z_8(D) + _12; _9 = _13; return _9; } foo: .LFB0: .cfi_startproc fabs 0,1 frsp 0,0 fmadds 1,2,0,3 blr
[Bug tree-optimization/98066] [11 Regression] ICE: Segmentation fault (in gsi_next)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98066 --- Comment #8 from luoxhu at gcc dot gnu.org --- Thanks for the quick fix!
[Bug target/98093] ICE in gen_vsx_set_v2df, at config/rs6000/vsx.md:3276
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98093 luoxhu at gcc dot gnu.org changed: What|Removed |Added CC||luoxhu at gcc dot gnu.org --- Comment #1 from luoxhu at gcc dot gnu.org --- Confirmed, I will fix it. Actually I have pending patch not committed yet. [PATCH 2/4] which generate VIEW_CONVERT_EXPR is not committed, but V2DF VIEW_CONVERT_EXPR will be convert to IFN VEC_SET in gimple-isel now which caused the ICE. VIEW_CONVERT_EXPR(t)[i_12] = x_6(D); (https://gcc.gnu.org/pipermail/gcc-patches/2020-October/555906.html) IFN VEC_SET is not expanded on Power8 yet, [PATCH 3/4] could fix this. Need Segher's approval.
[Bug target/98093] ICE in gen_vsx_set_v2df, at config/rs6000/vsx.md:3276
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98093 --- Comment #2 from luoxhu at gcc dot gnu.org --- https://gcc.gnu.org/pipermail/gcc-patches/2020-October/555907.html [PATCH 3/4] rs6000: Enable vec_insert for P8 with rs6000_expand_vector_set_var_p8
[Bug tree-optimization/22326] promotions (from float to double) are not removed when they should be able to
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=22326 --- Comment #22 from luoxhu at gcc dot gnu.org --- https://gcc.gnu.org/pipermail/gcc/2020-December/234474.html So this issue seems invalid since "fabs(x)*y+z” or "fabs(x)+y+z"(x,y,z are float) could result in -+Inf sometimes, while it won't getting float overflow under double computation. Float value range info is required here. Quoto Richard's reply: I still think that covering all "good" cases in match.pd will require excessive matching and that it is better done in a pass (this would include removing the frontend handling for math functions). Note that for example (float)((double)x + (double)y) with float x and y is also eligible to demotion to float, likewise may some compositions like (float)(sin((double)x)*cos ((double)y)) for float x and y since we can constrain ranges here. Likewise (float)((double)x + fabs ((double)y)) for float x and y. The propagation would need to stop when the range needed increases in unknown ways.
[Bug target/79251] PowerPC vec_insert generates store-hit-load if the element number is variable
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79251 luoxhu at gcc dot gnu.org changed: What|Removed |Added CC||luoxhu at gcc dot gnu.org --- Comment #5 from luoxhu at gcc dot gnu.org --- This patchset fixes this issue for both P8 and P9: [PATCH 0/4] rs6000: Enable variable vec_insert with IFN VEC_SET https://gcc.gnu.org/pipermail/gcc-patches/2020-October/555905.html https://gcc.gnu.org/pipermail/gcc-patches/2020-October/555906.html https://gcc.gnu.org/pipermail/gcc-patches/2020-October/555907.html
[Bug target/98065] [11 Regression] ICE in rs6000_expand_vector_set, at config/rs6000/rs6000.c:7024 since r11-5457
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98065 --- Comment #4 from luoxhu at gcc dot gnu.org --- Sorry, my patch https://gcc.gnu.org/pipermail/gcc-patches/2020-October/555906.html could fix this, but below two of them is still pending for approval, I pinged it 5 times since last Oct. @Segher :) https://gcc.gnu.org/pipermail/gcc-patches/2020-October/555907.html https://gcc.gnu.org/pipermail/gcc-patches/2020-October/555908.html
[Bug target/98799] [10 Regression] vector_set_var ICE
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98799 luoxhu at gcc dot gnu.org changed: What|Removed |Added CC||luoxhu at gcc dot gnu.org --- Comment #5 from luoxhu at gcc dot gnu.org --- (In reply to David Edelsohn from comment #4) > Created attachment 50043 [details] > patch > > Updated patch, but the entire rs6000_expand_set_var() logic seems to be > incomplete and missing some scenarios, i.e., P9 and P8 that assume PPC64 are > not sufficient. The ICE is caused by UNSPEC_SI_FROM_SF not supported when TARGET_DIRECT_MOVE_64BIT is false. Thank for the patch, but also need below change to fix the ICE in gcc.target/powerpc/fold-vec-insert-float-p8.c when build with -m32 to avoid generate IFN VEC_SET for P8BE-32bit. Not sure about the meaning of "P9 and P8 that assume PPC64 are not sufficient"? diff --git a/gcc/config/rs6000/rs6000-c.c b/gcc/config/rs6000/rs6000-c.c index f6ee1e6..656cdb3 100644 --- a/gcc/config/rs6000/rs6000-c.c +++ b/gcc/config/rs6000/rs6000-c.c @@ -1600,7 +1600,7 @@ altivec_resolve_overloaded_builtin (location_t loc, tree fndecl, stmt = build1 (COMPOUND_LITERAL_EXPR, arg1_type, stmt); } - if (TARGET_P8_VECTOR) + if (TARGET_P8_VECTOR && TARGET_DIRECT_MOVE_64BIT) { stmt = build_array_ref (loc, stmt, arg2); stmt = fold_build2 (MODIFY_EXPR, TREE_TYPE (arg0), stmt,
[Bug target/98827] [11 regression] gcc.target/powerpc/vsx-builtin-7.c assembler counts off after r11-6857
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98827 --- Comment #1 from luoxhu at gcc dot gnu.org --- Strange that I see only xxpermdi fail, should be 4 instead of 12. rldic passes for m64, what's your configuration please? === gcc tests === Schedule of variations: unix/-m32 unix/-m64 Running target unix/-m32 Running /home/luoxhu/workspace/gcc/gcc/testsuite/gcc.target/powerpc/powerpc.exp ... PASS: gcc.target/powerpc/vsx-builtin-7.c (test for excess errors) PASS: gcc.target/powerpc/vsx-builtin-7.c scan-assembler-times \\mrldic\\M 0 PASS: gcc.target/powerpc/vsx-builtin-7.c scan-assembler-times vspltisb 2 PASS: gcc.target/powerpc/vsx-builtin-7.c scan-assembler-times vspltish 2 PASS: gcc.target/powerpc/vsx-builtin-7.c scan-assembler-times vspltisw 2 PASS: gcc.target/powerpc/vsx-builtin-7.c scan-assembler-times xxpermdi 4 === gcc Summary for unix/-m32 === # of expected passes6 Running target unix/-m64 Running /home/luoxhu/workspace/gcc/gcc/testsuite/gcc.target/powerpc/powerpc.exp ... PASS: gcc.target/powerpc/vsx-builtin-7.c (test for excess errors) PASS: gcc.target/powerpc/vsx-builtin-7.c scan-assembler-times \\mrldic\\M 64 PASS: gcc.target/powerpc/vsx-builtin-7.c scan-assembler-times vspltisb 2 PASS: gcc.target/powerpc/vsx-builtin-7.c scan-assembler-times vspltish 2 PASS: gcc.target/powerpc/vsx-builtin-7.c scan-assembler-times vspltisw 2 PASS: gcc.target/powerpc/vsx-builtin-7.c scan-assembler-times xxpermdi 4 === gcc Summary for unix/-m64 === # of expected passes6 === gcc Summary === # of expected passes12 /home/luoxhu/workspace/build/gcc/xgcc version 11.0.0 20210125 (experimental) (GCC) luoxhu@bns:~/workspace/build$ gcc/xgcc -v Using built-in specs. COLLECT_GCC=gcc/xgcc Target: powerpc64-unknown-linux-gnu Configured with: ../gcc/configure --enable-languages=c,c++,fortran --prefix=/home/luoxhu/local/gcc/ --disable-bootstrap --with-cpu=power7 --disable-libsanitizer : (reconfigured) ../gcc/configure --prefix=/home/luoxhu/local/gcc/ --disable-bootstrap --with-cpu=power7 --disable-libsanitizer CC=/opt/gcc81/bin/gcc CXX=/opt/gcc81/bin/g++ --enable-languages=c,c++,fortran,lto --no-create --no-recursion Thread model: posix Supported LTO compression algorithms: zlib gcc version 11.0.0 20210125 (experimental) (GCC)
[Bug target/98827] [11 regression] gcc.target/powerpc/vsx-builtin-7.c assembler counts off after r11-6857
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98827 --- Comment #3 from luoxhu at gcc dot gnu.org --- I know it now, the r11-6858 did some changes the P8 code generation, so the latest failure also changes. https://gcc.gnu.org/pipermail/gcc-testresults/2021-January/651154.html current failures are: FAIL: gcc.dg/vect/vect-outer-call-1.c scan-tree-dump vect "OUTER LOOP VECTORIZED" FAIL: gcc.dg/vect/vect-strided-a-u8-i2-gap.c -flto -ffat-lto-objects scan-tree-dump-times vect "vectorized 1 loops" 1 FAIL: gcc.dg/vect/vect-strided-a-u8-i2-gap.c scan-tree-dump-times vect "vectorized 1 loops" 1 FAIL: gcc.target/powerpc/20050603-3.c scan-assembler-not mrldic FAIL: gcc.target/powerpc/rlwimi-2.c scan-assembler-times (?n)^s+[a-z] 20217 FAIL: gcc.target/powerpc/vsx-builtin-7.c scan-assembler-times xxpermdi 11 XPASS: gcc.target/powerpc/ppc-fortran/ieee128-math.f90 -O (test for excess errors) only xxpermdi need be updated to 4.
[Bug target/98799] [11 Regression] vector_set_var ICE
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98799 luoxhu at gcc dot gnu.org changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #7 from luoxhu at gcc dot gnu.org --- Fixed.
[Bug target/98065] [11 Regression] ICE in rs6000_expand_vector_set, at config/rs6000/rs6000.c:7024 since r11-5457
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98065 Bug 98065 depends on bug 98799, which changed state. Bug 98799 Summary: [11 Regression] vector_set_var ICE https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98799 What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED
[Bug target/79251] PowerPC vec_insert generates store-hit-load if the element number is variable
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79251 luoxhu at gcc dot gnu.org changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #7 from luoxhu at gcc dot gnu.org --- Fixed on master.
[Bug target/98093] ICE in gen_vsx_set_v2df, at config/rs6000/vsx.md:3276
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98093 luoxhu at gcc dot gnu.org changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #7 from luoxhu at gcc dot gnu.org --- Fixed.
[Bug target/98914] [11 Regression] ICE in rs6000_expand_vector_set, at config/rs6000/rs6000.c:7198
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98914 --- Comment #1 from luoxhu at gcc dot gnu.org --- The type of k in the case should be "long" to reproduce the issue, ICE happens at rs6000_expand_vector_set: gcc_assert (GET_MODE (idx) == E_SImode); Reason is the vector index variable need be "signed int" for all vec_insert prototype. ELFv2 ABI: vector signed char vec_insert (signed char, vector signed char, signed int); vector unsigned char vec_insert (unsigned char, vector unsigned char, signed int); vector signed int vec_insert (signed int, vector signed int, signed int); vector unsigned int vec_insert (unsigned int, vector unsigned int, signed int); vector signed long long vec_insert (signed long long, vector signed long long, signed int); Not sure all targets like X86/AArch64 also has some requirements, and whether below fix reasonable to not generate IFN VEC_SET for stmt like VIEW_CONVERT_EXPR(v)[k_7] = 170; ? diff --git a/gcc/gimple-isel.cc b/gcc/gimple-isel.cc index 2c78a08d3f1..dbbae270a36 100644 --- a/gcc/gimple-isel.cc +++ b/gcc/gimple-isel.cc @@ -77,6 +77,7 @@ gimple_expand_vec_set_expr (gimple_stmt_iterator *gsi) tree view_op0 = TREE_OPERAND (op0, 0); machine_mode outermode = TYPE_MODE (TREE_TYPE (view_op0)); if (auto_var_in_fn_p (view_op0, cfun->decl) + && TYPE_MODE (TREE_TYPE (pos)) == E_SImode && !TREE_ADDRESSABLE (view_op0) && can_vec_set_var_idx_p (outermode)) { location_t loc = gimple_location (stmt);
[Bug target/98958] ICE in rs6000_expand_vector_set_var_p8, at config/rs6000/rs6000.c:7050
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98958 luoxhu at gcc dot gnu.org changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |DUPLICATE --- Comment #1 from luoxhu at gcc dot gnu.org --- dup *** This bug has been marked as a duplicate of bug 98914 ***
[Bug target/98914] [11 Regression] ICE in rs6000_expand_vector_set, at config/rs6000/rs6000.c:7198
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98914 --- Comment #4 from luoxhu at gcc dot gnu.org --- *** Bug 98958 has been marked as a duplicate of this bug. ***
[Bug middle-end/90323] powerpc should convert equivalent sequences to vec_sel()
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90323 --- Comment #16 from luoxhu at gcc dot gnu.org --- > +2016-11-09 Segher Boessenkool > + > + * simplify-rtx.c (simplify_binary_operation_1): Simplify > + (xor (and (xor A B) C) B) to (ior (and A C) (and B ~C)) and > + (xor (and (xor A B) C) A) to (ior (and A ~C) (and B C)) if C > + is a const_int. Is it a MUST that C be const here? For this case in PR90323, C is not a const actually. l = l & ~mask; l |= mask & r; Trying 8, 9 -> 10: 8: r127:V4SI=r124:V4SI^r131:V4SI REG_DEAD r131:V4SI 9: r122:V4SI=r127:V4SI&r130:V4SI REG_DEAD r130:V4SI REG_DEAD r127:V4SI 10: r128:V4SI=r124:V4SI^r122:V4SI REG_DEAD r124:V4SI REG_DEAD r122:V4SI
[Bug middle-end/90323] powerpc should convert equivalent sequences to vec_sel()
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90323 --- Comment #17 from luoxhu at gcc dot gnu.org --- If the constant limitation is removed, it could be combined successfully with my new patch for PR94613. https://gcc.gnu.org/pipermail/gcc-patches/2021-April/569255.html And what do you mean"This is not canonical form on RTL, and it's not a useful form either" in c#7, please? Not understanding the point... Trying 11 -> 16: 11: r124:V4SI=r127:V4SI&r129:V4SI|~r129:V4SI&r128:V4SI REG_DEAD r128:V4SI REG_DEAD r129:V4SI REG_DEAD r127:V4SI 16: %v2:V4SI=r124:V4SI REG_DEAD r124:V4SI Successfully matched this instruction: (set (reg/i:V4SI 66 %v2) (ior:V4SI (and:V4SI (reg:V4SI 127) (reg:V4SI 129)) (and:V4SI (not:V4SI (reg:V4SI 129)) (reg:V4SI 128 allowing combination of insns 11 and 16 original costs 4 + 4 = 8 replacement cost 4 deferring deletion of insn with uid = 11. modifying insn i316: %v2:V4SI=r127:V4SI&r129:V4SI|~r129:V4SI&r128:V4SI REG_DEAD r127:V4SI REG_DEAD r129:V4SI REG_DEAD r128:V4SI deferring rescan insn with uid = 16. diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c index 571e2337e27..701f37eb03e 100644 --- a/gcc/simplify-rtx.c +++ b/gcc/simplify-rtx.c @@ -3405,7 +3405,6 @@ simplify_context::simplify_binary_operation_1 (rtx_code code, machines, and also has shorter instruction path length. */ if (GET_CODE (op0) == AND && GET_CODE (XEXP (op0, 0)) == XOR - && CONST_INT_P (XEXP (op0, 1)) && rtx_equal_p (XEXP (XEXP (op0, 0), 0), trueop1)) { rtx a = trueop1; @@ -3419,7 +3418,6 @@ simplify_context::simplify_binary_operation_1 (rtx_code code, /* Similarly, (xor (and (xor A B) C) B) as (ior (and A C) (and B ~C)) */ else if (GET_CODE (op0) == AND && GET_CODE (XEXP (op0, 0)) == XOR - && CONST_INT_P (XEXP (op0, 1)) && rtx_equal_p (XEXP (XEXP (op0, 0), 1), trueop1)) { rtx a = XEXP (XEXP (op0, 0), 0);
[Bug target/100085] Bad code for union transfer from __float128 to vector types
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100085 luoxhu at gcc dot gnu.org changed: What|Removed |Added CC||luoxhu at gcc dot gnu.org --- Comment #7 from luoxhu at gcc dot gnu.org --- (In reply to Segher Boessenkool from comment #3) > The rotates in 6 and 7 are not merged, and neither are the vec_selects in > 8 and 9. Both should be pretty easy to do, there is no unspec in sight, > etc. Should this be done in pass bswaps or combine or by peephole2? :)
[Bug target/97142] __builtin_fmod not optimized on POWER
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97142 --- Comment #12 from luoxhu at gcc dot gnu.org --- Patch submitted: https://gcc.gnu.org/pipermail/gcc-patches/2021-April/568143.html
[Bug target/94613] S/390, powerpc: Wrong code generated for vec_sel builtin
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94613 luoxhu at gcc dot gnu.org changed: What|Removed |Added CC||luoxhu at gcc dot gnu.org --- Comment #14 from luoxhu at gcc dot gnu.org --- Patch submmited: https://gcc.gnu.org/pipermail/gcc-patches/2021-April/569255.html
[Bug target/100085] Bad code for union transfer from __float128 to vector types
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100085 --- Comment #9 from luoxhu at gcc dot gnu.org --- Patch sent, it could fix the __float128 to vector __int128 issue, https://gcc.gnu.org/pipermail/gcc-patches/2021-June/571689.html But for __float128 to __int128 mentioned in #c4, need hack rs6000_modes_tieable_p to remove the stack operation in dse1. But I am not sure this is *LEGAL* since TImode is allocated to GPR, It seems not true to access TImode from ALTIVEC or VSX without copying? diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c index ad11b67b125..ee69463ac46 100644 --- a/gcc/config/rs6000/rs6000.c +++ b/gcc/config/rs6000/rs6000.c @@ -1974,6 +1974,9 @@ rs6000_modes_tieable_p (machine_mode mode1, machine_mode mode2) || mode2 == PTImode || mode2 == OOmode || mode2 == XOmode) return mode1 == mode2; + if (mode1 == TImode && ALTIVEC_OR_VSX_VECTOR_MODE (mode2)) +return true; + xxpermdi %vs0,%vs34,%vs34,3 mfvsrd %r4,%vs34 mfvsrd %r3,%vs0
[Bug target/100085] Bad code for union transfer from __float128 to vector types
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100085 --- Comment #10 from luoxhu at gcc dot gnu.org --- float128 to vector __int128 is fixed by: https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=f700e4b0ee3ef53b48975cf89be26b9177e3a3f3
[Bug testsuite/101020] [12 regression] Several test case failures after r12-1316
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101020 luoxhu at gcc dot gnu.org changed: What|Removed |Added CC||segher at gcc dot gnu.org, ||segher at kernel dot crashing.org --- Comment #1 from luoxhu at gcc dot gnu.org --- Confirmed. The BE-m32 test is a nightmare to me... :( For float128-call.c, need check target BE or LE. And for pr100085.c, vector __int128 is not supported with {-m32}, just skip it. Ok to trunk? [PATCH] rs6000: Fix test case failures by PR100085 [PR101020] gcc/testsuite/ChangeLog: PR target/101020 * gcc.target/powerpc/float128-call.c: Adjust. * gcc.target/powerpc/pr100085.c: Likewise. --- gcc/testsuite/gcc.target/powerpc/float128-call.c | 6 -- gcc/testsuite/gcc.target/powerpc/pr100085.c | 2 +- 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/gcc/testsuite/gcc.target/powerpc/float128-call.c b/gcc/testsuite/gcc.target/powerpc/float128-call.c index a1f09df..b64ffc6 100644 --- a/gcc/testsuite/gcc.target/powerpc/float128-call.c +++ b/gcc/testsuite/gcc.target/powerpc/float128-call.c @@ -21,5 +21,7 @@ TYPE one (void) { return ONE; } void store (TYPE a, TYPE *p) { *p = a; } -/* { dg-final { scan-assembler "lvx 2" } } */ -/* { dg-final { scan-assembler "stvx 2" } } */ +/* { dg-final { scan-assembler {\mlxvd2x 34\M} {target be} } } */ +/* { dg-final { scan-assembler {\mstxvd2x 34\M} {target be} } } */ +/* { dg-final { scan-assembler {\mlvx 2\M} {target le} } } */ +/* { dg-final { scan-assembler {\mstvx 2\M} {target le} } } */ diff --git a/gcc/testsuite/gcc.target/powerpc/pr100085.c b/gcc/testsuite/gcc.target/powerpc/pr100085.c index 7d8b147..b6738ea 100644 --- a/gcc/testsuite/gcc.target/powerpc/pr100085.c +++ b/gcc/testsuite/gcc.target/powerpc/pr100085.c @@ -1,4 +1,4 @@ -/* { dg-do compile } */ +/* { dg-do compile {target lp64} } */ /* { dg-options "-O2 -mdejagnu-cpu=power8" } */
[Bug target/100866] PPC: Inefficient code for vec_revb(vector unsigned short) < P9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100866 luoxhu at gcc dot gnu.org changed: What|Removed |Added CC||luoxhu at gcc dot gnu.org --- Comment #2 from luoxhu at gcc dot gnu.org --- But it only works for V8HImode, no better code generation for other modes like V4SI/V2DI/V1TI to do byte swap with only two instructions vspltish+vrlh? unsigned int swap1[16] = {15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0}; unsigned int swap2[16] = {7,6,5,4,3,2,1,0,15,14,13,12,11,10,9,8}; unsigned int swap4[16] = {3,2,1,0,7,6,5,4,11,10,9,8,15,14,13,12}; unsigned int swap8[16] = {1,0,3,2,5,4,7,6,9,8,11,10,13,12,15,14}; For example V4SI, need swap short first, then swap word, it seems not so straight forward than vperm?
[Bug target/100866] PPC: Inefficient code for vec_revb(vector unsigned short) < P9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100866 --- Comment #3 from luoxhu at gcc dot gnu.org --- diff --git a/gcc/config/rs6000/altivec.md b/gcc/config/rs6000/altivec.md index 097a127be07..35b3f1a0e1a 100644 --- a/gcc/config/rs6000/altivec.md +++ b/gcc/config/rs6000/altivec.md @@ -1932,7 +1932,7 @@ (define_insn "altivec_vpkuum_direct" } [(set_attr "type" "vecperm")]) -(define_insn "*altivec_vrl" +(define_insn "altivec_vrl" [(set (match_operand:VI2 0 "register_operand" "=v") (rotate:VI2 (match_operand:VI2 1 "register_operand" "v") (match_operand:VI2 2 "register_operand" "v")))] diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md index 8c5865b8c34..88b34a2285a 100644 --- a/gcc/config/rs6000/vsx.md +++ b/gcc/config/rs6000/vsx.md @@ -5849,9 +5849,18 @@ (define_expand "revb_" /* Want to have the elements in reverse order relative to the endian mode in use, i.e. in LE mode, put elements in BE order. */ - rtx sel = swap_endian_selector_for_mode(mode); - emit_insn (gen_altivec_vperm_ (operands[0], operands[1], - operands[1], sel)); + if (mode == V8HImode) + { + rtx splt = gen_reg_rtx (V8HImode); + emit_insn (gen_altivec_vspltish (splt, GEN_INT (8))); + emit_insn (gen_altivec_vrlh (operands[0], operands[1], splt)); + } + else + { + rtx sel = swap_endian_selector_for_mode ( mode); + emit_insn (gen_altivec_vperm_ (operands[0], operands[1], + operands[1], sel)); + } } With above change, it could generate the expected code: revb: .LFB0: .cfi_startproc vspltisw 0,8 vrlw 2,2,0 blr
[Bug testsuite/101020] [12 regression] Several test case failures after r12-1316
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101020 luoxhu at gcc dot gnu.org changed: What|Removed |Added Resolution|--- |FIXED Status|NEW |RESOLVED --- Comment #4 from luoxhu at gcc dot gnu.org --- Fixed.
[Bug target/100866] PPC: Inefficient code for vec_revb(vector unsigned short) < P9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100866 --- Comment #5 from luoxhu at gcc dot gnu.org --- (In reply to Segher Boessenkool from comment #4) > This PR is specifically about the vec_revb builtin. But yes, we should > look at what is generated for all other code (having only the builtin > generate good code is suboptimal for a generic thing like this), and for > other sizes as well. Sorry I don't quite understand what you mean. IMO vec_revb is expanded by CODE_FOR_revb_v8hi through revb_ pattern. So this is where we should change to make better code generation... For V8HI, it is natural to use vspltish 8+vrlh to turn {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15} to {1,0,3,2,5,4,7,6,9,8,11,10,13,12,15,14}. But for V4SI, we need use vspltish+vrlh to turn it to {1,0,3,2,5,4,7,6,9,8,11,10,13,12,15,14} first, and a "vrlw 16" to turn it to {3,2,1,0,7,6,5,4,11,10,9,8,15,14,13,12}. I am not sure whether this is better than lvx+xxlnor+vperm especially for V2DI&V1TI with additional "vrld 32" or "vrld 32"+"vrlq 64"? (Those are all operations on register without load from memory like lvx.) bt 5 #0 gen_revb_v8hi (operand0=0x74d4ce40, operand1=0x74d4cf60) at ../../gcc/gcc/config/rs6000/vsx.md:5858 #1 0x10b05360 in insn_gen_fn::operator() (this=0x130ab188 ) at../../gcc/gcc/recog.h:407 #2 0x11aa1e30 in rs6000_expand_unop_builtin (icode=CODE_FOR_revb_v8hi, exp= , target=0x74d4ce40) at ../../gcc/gcc/config/rs6000/rs6000-call.c:9451 #3 0x11ab27a4 in rs6000_expand_builtin (exp=, target=0x74d4ce40, subtarget=0x0, mode=E_V8HImode, ignore=0) at ../../gcc/gcc/config/rs6000/rs6000-call.c:13157 #4 0x10815268 in expand_builtin (exp=, target=0x74d4ce40, subtarget=0x0, mode=E_V8HImode, ignore=0) at ../../gcc/gcc/builtins.c:9559
[Bug target/93571] PPC: fmr gets used instead of faster xxlor
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93571 luoxhu at gcc dot gnu.org changed: What|Removed |Added CC||luoxhu at gcc dot gnu.org --- Comment #2 from luoxhu at gcc dot gnu.org --- It is generated by "*mov_hardfloat64" (i.e. {*movdf_hardfloat64}), switch the constraint of fmr and xxlor could generate expected code, is that correct?
[Bug target/93571] PPC: fmr gets used instead of faster xxlor
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93571 --- Comment #3 from luoxhu at gcc dot gnu.org --- BTW, I didn't see performance difference between fmr and xxlor within a small benchmark. Max Ops Per CycleLatency (Min) Latency (Max) fmr - - ALU FPR 4 2 2 1 R - - - - Floating Move Register xxlor - - ALU VSR 2 2 2 1 V - 1 S - - VSX Vector Logical OR
[Bug target/100866] PPC: Inefficient code for vec_revb(vector unsigned short) < P9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100866 --- Comment #6 from luoxhu at gcc dot gnu.org --- For V4SI, it is also better to use vector splat and vector rotate operations. revb: .LFB0: .cfi_startproc vspltish %v1,8 vspltisw %v0,-16 vrlh %v2,%v2,%v1 vrlw %v2,%v2,%v0 blr Performance improved from 7.322s to 2.445s with a small benchmark due to load instruction replaced. But for V2DI, we don't have "vspltisd" to splat {32,32} to vector register before Power9, so lvx is still required? vector unsigned long long revb_pwr7_l(vector unsigned long long a) { return vec_rl(a, vec_splats((unsigned long long)32)); } generates: revb_pwr7_l: .LFB1: .cfi_startproc .LCF1: 0: addis 2,12,.TOC.-.LCF1@ha addi 2,2,.TOC.-.LCF1@l .localentry revb_pwr7_l,.-revb_pwr7_l addis %r9,%r2,.LC0@toc@ha addi %r9,%r9,.LC0@toc@l lvx %v0,0,%r9 vrld %v2,%v2,%v0 blr .LC0: .quad 32 .quad 32 .align 4
[Bug target/100866] PPC: Inefficient code for vec_revb(vector unsigned short) < P9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100866 --- Comment #8 from luoxhu at gcc dot gnu.org --- (In reply to Jens Seifert from comment #7) > Regarding vec_revb for vector unsigned int. I agree that > revb: > .LFB0: > .cfi_startproc > vspltish %v1,8 > vspltisw %v0,-16 > vrlh %v2,%v2,%v1 > vrlw %v2,%v2,%v0 > blr > > works. But in this case, I would prefer the vperm approach assuming that the > loaded constant for the permute vector can be re-used multiple times. > But please get rid of the xxlnor 32,32,32. That does not make sense after > loading a constant. Change the constant that need to be loaded. xxlnor is LE specific requirement(not existed if build with -mbig), we need to turn the index {0,1,2,3} to {31, 30,29,28} for vperm usage, it is required otherwise produces incorrect result: 6|0x1630 <+16>:lvx v0,0,r9 7+> 0x1634 <+20>:xxlnor vs32,vs32,vs32 8|0x1638 <+24>:vperm v2,v2,v2,v0 9|0x163c <+28>:blr (gdb) 0x1634 in revb () 2: /x $vs34.uint128 = 0x42345678323456782234567812345678 5: /x $vs32.uint128 = 0xc0d0e0f08090a0b0405060700010203 (gdb) si 0x1638 in revb () 2: /x $vs34.uint128 = 0x42345678323456782234567812345678 5: /x $vs32.uint128 = 0xf3f2f1f0f7f6f5f4fbfaf9f8fffefdfc (gdb) si 0x163c in revb () 2: /x $vs34.uint128 = 0x78563442785634327856342278563412 5: /x $vs32.uint128 = 0xf3f2f1f0f7f6f5f4fbfaf9f8fffefdfc Quoted from the ISA: vperm VRT,VRA,VRB,VRC vsrc.qword[0] ← VSR[VRA+32] vsrc.qword[1] ← VSR[VRB+32] do i = 0 to 15 index ← VSR[VRC+32].byte[i].bit[3:7] VSR[VRT+32].byte[i] ← src.byte[index] end Let the source vector be the concatenation of the contents of VSR[VRA+32] followed by the contents of VSR[VRB+32]. For each integer value i from 0 to 15, do the following. Let index be the value specified by bits 3:7 of byte element i of VSR[VRC+32]. The contents of byte element index of src are placed into byte element i of VSR[VRT+32].
[Bug target/100866] PPC: Inefficient code for vec_revb(vector unsigned short) < P9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100866 --- Comment #13 from luoxhu at gcc dot gnu.org --- It is not visible in combine due to the constant data is in *.LC0 and UNSPEC_VPERM. Will shelf this and switch to other high priority issues. pr100866.c.277r.combine: (note 4 0 20 2 [bb 2] NOTE_INSN_BASIC_BLOCK) (insn 20 4 2 2 (set (reg:V8HI 126) (reg:V8HI 66 %v2 [ a ])) "pr100866.c":18:1 1132 {vsx_movv8hi_64bit} (expr_list:REG_DEAD (reg:V8HI 66 %v2 [ a ]) (nil))) (note 2 20 3 2 NOTE_INSN_DELETED) (note 3 2 6 2 NOTE_INSN_FUNCTION_BEG) (insn 6 3 18 2 (set (reg/f:DI 122) (unspec:DI [ (symbol_ref/u:DI ("*.LC0") [flags 0x82]) (reg:DI 2 %r2) ] UNSPEC_TOCREL)) "pr100866.c":19:13 719 {*tocrefdi} (expr_list:REG_EQUAL (symbol_ref/u:DI ("*.LC0") [flags 0x82]) (nil))) (insn 18 6 9 2 (set (reg:V16QI 123) (mem/u/c:V16QI (and:DI (reg/f:DI 122) (const_int -16 [0xfff0])) [0 S16 A128])) "pr100866.c":19:13 1131 {vsx_movv16qi_64bit} (expr_list:REG_DEAD (reg/f:DI 122) (nil))) (insn 9 18 10 2 (set (reg:V16QI 124) (not:V16QI (reg:V16QI 123))) "pr100866.c":19:13 508 {one_cmplv16qi2} (expr_list:REG_DEAD (reg:V16QI 123) (nil))) (note 10 9 15 2 NOTE_INSN_DELETED) (insn 15 10 16 2 (set (reg/i:V8HI 66 %v2) (unspec:V8HI [ (reg:V8HI 126) repeated x2 (reg:V16QI 124) ] UNSPEC_VPERM)) "pr100866.c":20:1 1830 {altivec_vperm_v8hi_direct} (expr_list:REG_DEAD (reg:V16QI 124) (expr_list:REG_DEAD (reg:V8HI 126) (nil (insn 16 15 0 2 (use (reg/i:V8HI 66 %v2)) "pr100866.c":20:1 -1 (nil)) ;; Combiner totals: 12 attempts, 12 substitutions (2 requiring new space),
[Bug middle-end/101250] New: adjust_iv_update_pos update the iv statement unexpectedly cause memory address offset mismatch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101250 Bug ID: 101250 Summary: adjust_iv_update_pos update the iv statement unexpectedly cause memory address offset mismatch Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: luoxhu at gcc dot gnu.org Target Milestone: --- Test case: unsigned int foo (unsigned char *ip, unsigned char *ref, unsigned int maxlen) { unsigned int len = 2; do { len++; }while(len < maxlen && ip[len] == ref[len]); return len; } ivopts: [local count: 1014686026]: _3 = MEM[(unsigned char *)ip_10(D) + ivtmp.16_15 * 1]; ivtmp.16_16 = ivtmp.16_15 + 1; _19 = ref_12(D) + 18446744073709551615; _6 = MEM[(unsigned char *)_19 + ivtmp.16_16 * 1]; if (_3 == _6) goto ; [94.50%] else goto ; [5.50%] Disable adjust_iv_update_pos will produce: [local count: 1014686026]: _3 = MEM[(unsigned char *)ip_10(D) + ivtmp.16_15 * 1]; _6 = MEM[(unsigned char *)ref_12(D) + ivtmp.16_15 * 1]; ivtmp.16_16 = ivtmp.16_15 + 1; if (_3 == _6) goto ; [94.50%] else goto ; [5.50%] discussions: https://gcc.gnu.org/pipermail/gcc-patches/2021-June/573709.html
[Bug lto/105133] New: lto/gold: lto failed to link --start-lib/--end-lib in gold
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105133 Bug ID: 105133 Summary: lto/gold: lto failed to link --start-lib/--end-lib in gold Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: lto Assignee: unassigned at gcc dot gnu.org Reporter: luoxhu at gcc dot gnu.org CC: marxin at gcc dot gnu.org Target Milestone: --- Hi, linker gold supports --start-lib and --end-lib to "mimics the semantics of static libraries, but without needing to actually create the archive file."(https://reviews.llvm.org/D66848). Sometimes large application may introduce multiple libraries from different repositories with same source code, they would be linked into one binary finally, recently I suffered from a link error with gold as linker and reduced an example as below: cat hello.c extern int hello(int a); int main(void) { return 0; /* hello(10); */ } cat ./B/libhello.c #include int hello(int a) { puts("Hello"); return 0; } cat ./C/libhello.c #include int hello(int a) { puts("Hello"); return 0; } (1) NON lto link with gold is OK: gcc -O2 -o ./B/libhello.c.o -c ./B/libhello.c gcc-ar qc ./B/libhello.a ./B/libhello.c.o gcc-ranlib ./B/libhello.a gcc -O2 -o ./C/libhello.c.o -c ./C/libhello.c gcc-ar qc ./C/libhello.a ./C/libhello.c.o gcc-ranlib ./C/libhello.a gcc hello.c -o hello.o -c -O2 gcc -o hellow hello.o -Wl,--start-lib ./B/libhello.c.o -Wl,--end-lib -Wl,--start-lib ./C/libhello.c.o -Wl,--end-lib -O2 -fuse-ld=gold (2) lto link with gold fails with redefinition: gcc -O2 -flto -o ./B/libhello.c.o -c ./B/libhello.c gcc-ar qc ./B/libhello.a ./B/libhello.c.o gcc-ranlib ./B/libhello.a gcc -O2 -flto -o ./C/libhello.c.o -c ./C/libhello.c gcc-ar qc ./C/libhello.a ./C/libhello.c.o gcc-ranlib ./C/libhello.a gcc hello.c -o hello.o -c -O2 -flto gcc -o hellow hello.o -Wl,--start-lib ./B/libhello.c.o -Wl,--end-lib -Wl,--start-lib ./C/libhello.c.o -Wl,--end-lib -O2 -flto -fuse-ld=gold ./B/libhello.c:5:5: error: 'hello' has already been defined 5 | int hello(int a) | ^ ./B/libhello.c:5:5: note: previously defined here lto1: fatal error: errors during merging of translation units compilation terminated. lto-wrapper: fatal error: gcc returned 1 exit status compilation terminated. /usr/bin/ld.gold: fatal error: lto-wrapper failed collect2: error: ld returned 1 exit status This error happens at function gcc/lto/lto-symtab.c:lto_symtab_resolve_symbols, simply remove the error_at line could work, but this may be not a reasonable fix. /* Find the single non-replaceable prevailing symbol and diagnose ODR violations. */ for (e = first; e; e = e->next_sharing_asm_name) { if (!lto_symtab_resolve_can_prevail_p (e)) continue; /* If we have a non-replaceable definition it prevails. */ if (!lto_symtab_resolve_replaceable_p (e)) { if (prevailing) { error_at (DECL_SOURCE_LOCATION (e->decl), "%qD has already been defined", e->decl); inform (DECL_SOURCE_LOCATION (prevailing->decl), "previously defined here"); } prevailing = e; } } cat hellow.res 3 hello.o 2 192 ccb9165e03755470 PREVAILING_DEF main 197 ccb9165e03755470 PREVAILING_DEF_IRONLY s ./B/libhello.c.o 1 205 68e0b97e93a52d7a PREEMPTED_REG hello ./C/libhello.c.o 1 205 18fe2d3482bfb511 PREEMPTED_REG hello Secondly, If call hello(10) in hello.c , there will be NO error reported out. The difference is the resolution type is changed from PREEMPTED_REG to RESOLVED_IR/PREVAILING_DEF_IRONLY. 3 hello.o 3 192 19ef867d12f62129 PREVAILING_DEF main 197 19ef867d12f62129 PREVAILING_DEF_IRONLY s 201 19ef867d12f62129 RESOLVED_IR hello ./B/libhello.c.o 1 205 23c5c855935478ce PREVAILING_DEF_IRONLY hello ./C/libhello.c.o 1 205 abbf050f5c23b448 PREEMPTED_REG hello Is this a valid bug? Thanks.
[Bug lto/105133] lto/gold: lto failed to link --start-lib/--end-lib in gold for duplicate libraries
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105133 --- Comment #2 from luoxhu at gcc dot gnu.org --- (In reply to Richard Biener from comment #1) > (In reply to luoxhu from comment #0) > > > > cat hellow.res > > 3 > > hello.o 2 > > 192 ccb9165e03755470 PREVAILING_DEF main > > 197 ccb9165e03755470 PREVAILING_DEF_IRONLY s > > ./B/libhello.c.o 1 > > 205 68e0b97e93a52d7a PREEMPTED_REG hello > > ./C/libhello.c.o 1 > > 205 18fe2d3482bfb511 PREEMPTED_REG hello > > This looks like a gold bug - we have 'hello' pre-empted twice but no > prevailing > symbol in the IR - are you ending up with fat LTO objects? It is not fat LTO objects since I didn't add -ffat-lto-objects when generating lib: nm libhello.a libhello.c.o: nm: libhello.c.o: plugin needed to handle lto object 0001 C __gnu_lto_slim > > OTOH PREEMPTED_REG seems then handled wrongly by LTO as well - it should > throw away both copies since the linker told us it found a preempting > definition in a non-IR object file. So I'd expect a unresolved reference > to 'hello' rather than LTO complaining about multiple definitions ... Will you fix it? :) > > Note gold is really unmaintained, so you should probably avoid using it. Thanks. Will try lld instead.
[Bug tree-optimization/106293] [13 Regression] 456.hmmer at -Ofast -march=native regressed by 19% on zen2 and zen3 in July 2022
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106293 --- Comment #4 from luoxhu at gcc dot gnu.org --- Could you try revert (In reply to Richard Biener from comment #2) > I can reproduce a regression with -Ofast -march=znver2 running on Haswell as > well. -fopt-info doesn't reveal anything interesting besides > > -fast_algorithms.c:133:19: optimized: loop with 2 iterations completely > unrolled (header execution count 32987933) > +fast_algorithms.c:133:19: optimized: loop with 2 iterations completely > unrolled (header execution count 129072791) > > obviously the slowdown is in P7Viterbi. There's only minimal changes on the > GIMPLE side, one notable: > > niters_vector_mult_vf.205_2406 = niters.203_442 & 429496729 | _2041 = > niters.203_438 & 3; > _2408 = (int) niters_vector_mult_vf.205_2406; | if (_2041 > == 0) > tmp.206_2407 = k_384 + _2408; | goto 66>; [25.00%] > _2300 = niters.203_442 & 3; < > if (_2300 == 0) < > goto ; [25.00%]< > elseelse > goto ; [75.00%] goto 36>; [75.00%] > >[local count: 41646173]:| > [local count: 177683003]: > # k_2403 = PHI | > niters_vector_mult_vf.205_2409 = niters.203_438 & 429496729 > # DEBUG k => k_2403 | _2411 = > (int) niters_vector_mult_vf.205_2409; > > > tmp.206_2410 = k_382 + _2411; > > > > > [local count: 162950122]: > > # k_2406 = > PHI > > the sink pass now does the transform where it did not do so before. > > That's appearantly because of > > /* If BEST_BB is at the same nesting level, then require it to have > significantly lower execution frequency to avoid gratuitous movement. > */ > if (bb_loop_depth (best_bb) == bb_loop_depth (early_bb) > /* If result of comparsion is unknown, prefer EARLY_BB. > Thus use !(...>=..) rather than (...<...) */ > && !(best_bb->count * 100 >= early_bb->count * threshold)) > return best_bb; > > /* No better block found, so return EARLY_BB, which happens to be the > statement's original block. */ > return early_bb; > > where the SRC count is 96726596 before, 236910671 after and the > destination count is 72544947 before, 177683003 at the destination after. > The edge probabilities are 75% vs 25% and param_sink_frequency_threshold > is exactly 75 as well. Since 236910671*0.75 > is rounded down it passes the test while the previous state has an exact > match defeating it. > > It's a little bit of an arbitrary choice, > > diff --git a/gcc/tree-ssa-sink.cc b/gcc/tree-ssa-sink.cc > index 2e744d6ae50..9b368e13463 100644 > --- a/gcc/tree-ssa-sink.cc > +++ b/gcc/tree-ssa-sink.cc > @@ -230,7 +230,7 @@ select_best_block (basic_block early_bb, >if (bb_loop_depth (best_bb) == bb_loop_depth (early_bb) >/* If result of comparsion is unknown, prefer EARLY_BB. > Thus use !(...>=..) rather than (...<...) */ > - && !(best_bb->count * 100 >= early_bb->count * threshold)) > + && !(best_bb->count * 100 > early_bb->count * threshold)) > return best_bb; > >/* No better block found, so return EARLY_BB, which happens to be the > > fixes the missed sinking but not the regression :/ > > The count differences start to appear in when LC PHI blocks are added > only for virtuals and then pre-existing 'Invalid sum of incoming counts' > eventually lead to mismatches. The 'Invalid sum of incoming counts' > start with the loop splitting pass. > > fast_algorithms.c:145:10: optimized: loop split > > Xionghu Lou did profile count updates there, not sure if that made things > worse in this case. > > At least with broken BB counts splitting/unsplitting an edge can propagate > bogus counts elsewhere it seems. :(, Could you please try revert cd5ae148c47c6dee05adb19acd6a523f7187be7f and see whether performance is back?
[Bug tree-optimization/106293] [13 Regression] 456.hmmer at -Ofast -march=native regressed by 19% on zen2 and zen3 in July 2022
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106293 --- Comment #5 from luoxhu at gcc dot gnu.org --- r12-6086
[Bug target/106069] [12/13 Regression] wrong code with -O -fno-tree-forwprop -maltivec on ppc64le
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106069 --- Comment #12 from luoxhu at gcc dot gnu.org --- Created attachment 53352 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53352&action=edit combine
[Bug target/106069] [12/13 Regression] wrong code with -O -fno-tree-forwprop -maltivec on ppc64le
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106069 --- Comment #13 from luoxhu at gcc dot gnu.org --- Created attachment 53353 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53353&action=edit after combine