[Bug testsuite/94036] [9 regression] gcc.target/powerpc/pr72804.c fails

2020-03-05 Thread luoxhu at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94036

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2020-03-06
   Assignee|unassigned at gcc dot gnu.org  |luoxhu at gcc dot 
gnu.org
 Ever confirmed|0   |1

--- Comment #1 from luoxhu at gcc dot gnu.org ---
patch posted:

https://gcc.gnu.org/ml/gcc-patches/2020-03/msg00284.html

[Bug testsuite/94036] [9 regression] gcc.target/powerpc/pr72804.c fails

2020-03-10 Thread luoxhu at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94036

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #2 from luoxhu at gcc dot gnu.org ---
Committed in r9-8357(85c08558c66dd8e2000a4ad282ca03368028fce3).

[Bug target/91518] [9/10 Regression] segfault when run CPU2006 465.tonto since r263875

2020-03-26 Thread luoxhu at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91518

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 CC||luoxhu at gcc dot gnu.org

--- Comment #8 from luoxhu at gcc dot gnu.org ---
patch sent to: https://gcc.gnu.org/pipermail/gcc-patches/2020-March/542693.html

[Bug target/61837] missed loop invariant expression optimization

2020-04-07 Thread luoxhu at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61837

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 CC||luoxhu at gcc dot gnu.org

--- Comment #3 from luoxhu at gcc dot gnu.org ---
"addi 8,4,-1" and "subf 9,8,5" could not be hoisted out as there are dependency
to "lbzu 9,1(8)". r8 need be initialized to p2-1 in each iteration of outer
loop. Only the result of subf 9,8,5 is loop invariant (p2+s-1)-(p2-1).

But the latest GCC code could be optimized as A, B, C is loop invariant.

foo:
.LFB0:
.cfi_startproc
cmpwi 7,5,0
li 6,0
rldicl 5,5,0,32
li 7,0
.p2align 4,,15
.L2:
ble 7,.L7
addi 8,5,-1   // A
addi 10,4,-1
rldicl 8,8,0,32   // B
mr 9,3
addi 8,8,1// C
mtctr 8
.p2align 5
.L4:
lbzu 8,1(10)
cmpw 0,8,7
bne 0,.L3
stw 6,0(9)
.L3:
addi 9,9,4
bdnz .L4
.L7:
addi 6,6,88
addi 7,7,1
cmpwi 0,6,
extsw 7,7
extsw 6,6
bne 0,.L2
blr

[Bug target/61837] missed loop invariant expression optimization

2020-04-13 Thread luoxhu at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61837

--- Comment #5 from luoxhu at gcc dot gnu.org ---
"-O2 -funswitch-loops" could generate expected code for s<=0, unswitch-loops is
enabled by -O3, so this issue is reduced to duplicate of PR67288?

foo:
.LFB0:
.cfi_startproc
cmpwi 0,5,0
blelr 0
rldicl 5,5,0,32
addi 4,4,-1
li 6,0
li 7,0
.p2align 4,,15
.L2:
rldicl 8,5,0,32
mr 10,4
mtctr 8
mr 9,3
.p2align 5
.L5:
lbzu 8,1(10)
cmpw 0,8,7
bne 0,.L4
stw 6,0(9)
.L4:
addi 9,9,4
bdnz .L5
addi 6,6,88
addi 7,7,1
cmpwi 0,6,
extsw 7,7
extsw 6,6
bne 0,.L2
blr
.long 0
.byte 0,0,0,0,0,0,0,0
.cfi_endproc
.LFE0:

[Bug target/61837] missed loop invariant expression optimization

2020-04-14 Thread luoxhu at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61837

--- Comment #7 from luoxhu at gcc dot gnu.org ---
(In reply to Segher Boessenkool from comment #6)
> But -funswitch-loops is much stronger than we want here, and the wrong
> thing to use at -O2 (it often generates *slower* code!)

Not sure your meaning here, -funswitch-loops is to generate "blelr 0" as you
pointed out in (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61837#c4), not to
optimize 
"-1, zero_ext, +1", which is to move loop invariant out, and if "-1, zero_ext,
+1" could be simplified to "zero_ext" for non zero, this is actually a special
case of PR67288.

[Bug target/61837] missed loop invariant expression optimization

2020-04-14 Thread luoxhu at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61837

--- Comment #9 from luoxhu at gcc dot gnu.org ---
(In reply to Segher Boessenkool from comment #8)
> -funswitch-loops changes things like
> 
>   for (...) {
> if (...)
>   ...1;
> else
>   ...2;
>   }
> 
> into
> 
>   if (...) {
> for (...)
>   ...1;
>   } else {
> for (...)
>   ...2;
>   }
> 
> which often is not a good idea.  This is why this is not done at -O2:
> -O2 is only for optimisations that almost never hurt performance.

Yes, for this case it performs better with unswitch-loops, and I see many usage
of -O2 with unswith-loops in testsuite.  I thought you were meaning do this at
O2 without -funswitch-loops...

[Bug tree-optimization/83403] Missed register promotion opportunities in loop

2020-04-27 Thread luoxhu at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83403

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 CC||luoxhu at gcc dot gnu.org

--- Comment #7 from luoxhu at gcc dot gnu.org ---
int could pass but unsigned will fail to capture the refs independent, drafted
a patch to use the range info when checking the CONVERT expression on
PLUS/MINUS/MULT for wrapping overflow(unsigned).

https://gcc.gnu.org/pipermail/gcc-patches/2020-April/544684.html


(gdb) p debug_aff(&off1)
{
  type = sizetype
  offset = 8
  elements = {
[0] = (long unsigned int) n_93 * 80,
[1] = &C * 1
  }
}
$571 = void
(gdb) p debug_aff(&off2)
{
  type = sizetype
  offset = 0
  elements = {
[0] = (long unsigned int) n_93 * 80,
[1] = &C * 1
  }
}

Is this a reasonable solution, please?

[Bug middle-end/26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)

2020-05-11 Thread luoxhu at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
Bug 26163 depends on bug 91518, which changed state.

Bug 91518 Summary: [9 Regression] segfault when run CPU2006 465.tonto since 
r263875
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91518

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

[Bug target/91518] [9 Regression] segfault when run CPU2006 465.tonto since r263875

2020-05-11 Thread luoxhu at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91518

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|NEW |RESOLVED

--- Comment #12 from luoxhu at gcc dot gnu.org ---
Also fixed on gcc-9.

[Bug tree-optimization/83403] Missed register promotion opportunities in loop

2020-05-11 Thread luoxhu at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83403

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #9 from luoxhu at gcc dot gnu.org ---
Fixed on master.

[Bug tree-optimization/53947] [meta-bug] vectorizer missed-optimizations

2020-05-11 Thread luoxhu at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
Bug 53947 depends on bug 83403, which changed state.

Bug 83403 Summary: Missed register promotion opportunities in loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83403

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

[Bug tree-optimization/88842] missing optimization CSE, reassociation

2020-05-14 Thread luoxhu at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88842

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 CC||luoxhu at gcc dot gnu.org

--- Comment #3 from luoxhu at gcc dot gnu.org ---
One more case of missed optimization CSE, reassoc:

void foo(unsigned int a, unsigned int b, unsigned int c, unsigned int d, int
*res1, int*res2, int *res3)
{
  *res1 = a + b + c + d;
  *res2 = b + c;
  *res3 = a + d;
}

cat foo.s
.file   "foo.c"
.machine power8
.abiversion 2
.section".text"
.align 2
.p2align 4,,15
.globl foo
.type   foo, @function
foo:
.LFB0:
.cfi_startproc
add 10,5,6
add 10,10,4
add 4,4,5
add 10,10,3
add 3,3,6
stw 10,0(7)
stw 4,0(8)
stw 3,0(9)
blr

[Bug rtl-optimization/37451] Extra addition for doloop in some cases

2020-05-14 Thread luoxhu at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=37451

--- Comment #11 from luoxhu at gcc dot gnu.org ---
fixed on master.

[Bug rtl-optimization/37451] Extra addition for doloop in some cases

2020-05-14 Thread luoxhu at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=37451

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #12 from luoxhu at gcc dot gnu.org ---
Close this.

[Bug target/70053] Returning a struct of _Decimal128 values generates extraneous stores and loads

2020-05-20 Thread luoxhu at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70053

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 CC||bergner at gcc dot gnu.org,
   ||luoxhu at gcc dot gnu.org

--- Comment #4 from luoxhu at gcc dot gnu.org ---
Is this just the difference of O3 and O2? Since O3 is OK, maybe this bug is not
effective?

$ /opt/at10.0/bin/gcc -O3 -S pr70053.c
$ cat pr70053.s
.file   "pr70053.c"
.abiversion 2
.section".text"
.align 2
.p2align 4,,15
.globl D256_add_finite
.type   D256_add_finite, @function
D256_add_finite:
dcmpuq 7,4,6
beq 7,.L3
fmr 7,3
fmr 6,2
fmr 3,7
fmr 2,6
blr
.p2align 4,,15
.L3:
fmr 5,7
fmr 4,6
fmr 3,7
fmr 2,6
blr
.long 0
.byte 0,0,0,0,0,0,0,0
.size   D256_add_finite,.-D256_add_finite
.ident  "GCC: (GNU) 6.4.1 20170720 (Advance-Toolchain-at10.0) IBM AT 10
branch, based on subversion id 250395."
.section.note.GNU-stack,"",@progbits

[Bug target/30271] -mstrict-align can an store extra for struct agrument passing

2020-05-20 Thread luoxhu at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=30271

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 CC||luoxhu at gcc dot gnu.org

--- Comment #12 from luoxhu at gcc dot gnu.org ---
Fixed at least from GCC 4.9.4?
$ /opt/at8.0/bin/gcc -O3 -c -S  pr30271.c -mstrict-align
$ cat pr30271.s
.file   "pr30271.c"
.abiversion 2
.section".toc","aw"
.section".text"
.align 2
.p2align 4,,15
.globl f
.type   f, @function
f:
extsh 9,3
srawi 3,3,16
add 3,9,3
extsw 3,3
blr
.long 0
.byte 0,0,0,0,0,0,0,0
.size   f,.-f
.ident  "GCC: (GNU) 4.9.4 20150824 (Advance-Toolchain-at8.0)
[ibm/gcc-4_9-branch, revision: 227153 merged from gcc-4_9-branch, revision
227151]"
.section.note.GNU-stack,"",@progbits

[Bug target/69493] Poor code generation for return of struct containing vectors on PPC64LE

2020-05-20 Thread luoxhu at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69493

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 CC||luoxhu at gcc dot gnu.org

--- Comment #9 from luoxhu at gcc dot gnu.org ---
No load/store on Power9.
cat pr69493.s
.file   "pr69493.c"
.abiversion 2
.section".text"
.align 2
.p2align 4,,15
.globl test_big_double
.type   test_big_double, @function
test_big_double:
.LFB0:
.cfi_startproc
mfvsrd 7,1
mfvsrd 10,2
mfvsrd 8,3
mfvsrd 9,4
mtvsrdd 34,10,7
mtvsrdd 35,9,8
blr
.long 0
.byte 0,0,0,0,0,0,0,0
.cfi_endproc
.LFE0:
.size   test_big_double,.-test_big_double
.ident  "GCC: (GNU) 9.2.1 20191023 (Advance-Toolchain 13.0-1)
[aba1f4e8b6ac]"
.gnu_attribute 4, 5
.section.note.GNU-stack,"",@progbits

[Bug target/70053] Returning a struct of _Decimal128 values generates extraneous stores and loads

2020-05-25 Thread luoxhu at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70053

--- Comment #6 from luoxhu at gcc dot gnu.org ---
"-O2 -ftree-slp-vectorize" could also generate the expected simple fmrs.

Reason is pass_cselim will transform conditional stores into unconditional ones
with PHI instructions when vectorization and if-conversion is
enabled(gcc/tree-ssa-phiopt.c:2482).

pr70053.c.108t.cdce:
D256_add_finite (_Decimal128 a, _Decimal128 b, _Decimal128 c)
{
  struct TDx2_t D.2914;

   [local count: 1073741824]:
  if (b_4(D) == c_5(D))
goto ; [34.00%]
  else
goto ; [66.00%]

   [local count: 365072224]:
  D.2914.td0 = c_5(D);
  D.2914.td1 = c_5(D);
  goto ; [100.00%]

   [local count: 708669601]:
  D.2914.td0 = a_3(D);
  D.2914.td1 = b_4(D);

   [local count: 1073741824]:
  return D.2914;

}

=> pr70053.c.109t.cselim:

D256_add_finite (_Decimal128 a, _Decimal128 b, _Decimal128 c)
{
  struct TDx2_t D.2914;
  _Decimal128 cstore_10;
  _Decimal128 cstore_11;

   [local count: 1073741824]:
  if (b_4(D) == c_5(D))
goto ; [34.00%]
  else
goto ; [66.00%]

   [local count: 708669601]:

   [local count: 1073741824]:
  # cstore_10 = PHI 
  # cstore_11 = PHI 
  D.2914.td1 = cstore_11;
  D.2914.td0 = cstore_10;
  return D.2914;

}

Then at expand pass, the PHI instruction "cstore_10 = PHI " will be expanded to move for "-O2 -ftree-slp-vectorize". If no such
PHI generated, bb3 and bb4 in pr70053.c.108t.cdce will be expanded to
STORE/LOAD with TD->DI conversion, causing a lot st/ld conversion finally.

[Bug target/70053] Returning a struct of _Decimal128 values generates extraneous stores and loads

2020-05-25 Thread luoxhu at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70053

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 CC||segher at gcc dot gnu.org

--- Comment #7 from luoxhu at gcc dot gnu.org ---
When expanding "D.2914.td0 = c_5(D);" in expand_assignment (to=, from=, nontemporal=false) at
../../gcc-master/gcc/expr.c:5058

1) expr.c:5158:   to_rtx = expand_expr (tem, NULL_RTX, VOIDmode, EXPAND_WRITE);

gdb pr to_rtx
(mem/c:BLK (reg/f:DI 112 virtual-stack-vars) [2 D.2914+0 S32 A128])

...

2) expr.c:5167:   to_rtx = adjust_address (to_rtx, mode1, 0);

p mode1
$86 = E_TDmode
(gdb) pr to_rtx
(mem/c:TD (reg/f:DI 112 virtual-stack-vars) [2 D.2914+0 S16 A128])

to_rtx is generated with address conversion from DImode to TDmode here.

...

3) expr.c:5374:   result = store_field (to_rtx, bitsize,
bitpos,bitregion_start, bitregion_end, mode1, from, get_alias_set (to),
nontemporal, reversep);

then the assignment instruction is generated as below:

(insn 11 10 12 4 (set (mem/c:TD (reg/f:DI 112 virtual-stack-vars) [1
D.2914.td0+0 S16 A128]) (reg/v:TD 121 [ c ])) "pr70053.c":20:14 -1 (nil))


So if we need remove the redundant store/load in expand, the conversion from
DImode to TDmode should be avoided for this case when using virtual-stack-vars
registers. (For PR65421, there are similar DImode to DFmode conversion). 

pr70053.c.236r.expand with -O2:
1: NOTE_INSN_DELETED
6: NOTE_INSN_BASIC_BLOCK 2
2: r119:TD=%2:TD
3: r120:TD=%4:TD
4: r121:TD=%6:TD
5: NOTE_INSN_FUNCTION_BEG
8: r122:CCFP=cmp(r120:TD,r121:TD)
9: pc={(r122:CCFP!=0)?L16:pc}
  REG_BR_PROB 708669604
   10: NOTE_INSN_BASIC_BLOCK 4
   11: [r112:DI]=r121:TD
   12: r123:DI=r112:DI+0x10
   13: [r123:DI]=r121:TD
   14: pc=L21
   15: barrier
   16: L16:
   17: NOTE_INSN_BASIC_BLOCK 5
   18: [r112:DI]=r119:TD
   19: r124:DI=r112:DI+0x10
   20: [r124:DI]=r120:TD
   21: L21:
   22: NOTE_INSN_BASIC_BLOCK 6
   23: r125:TD=[r112:DI]
   24: r127:DI=r112:DI+0x10
   25: r126:TD=[r127:DI]
   26: r117:TD=r125:TD
   27: r118:TD=r126:TD
   31: %2:TD=r117:TD
   32: %4:TD=r118:TD
   33: use %2:TD
   34: use %4:TD

[Bug target/69493] Poor code generation for return of struct containing vectors on PPC64LE

2020-05-25 Thread luoxhu at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69493

--- Comment #10 from luoxhu at gcc dot gnu.org ---
In expand, Power8 will emit two register permute instructions to byte swap the
contents by rs6000_emit_le_vsx_move.

P9:
5: NOTE_INSN_BASIC_BLOCK 2
2: r129:TF=%1:TF
3: r130:TF=%3:TF
4: NOTE_INSN_FUNCTION_BEG
7: r117:DF=unspec[r129:TF,0] 70
8: r131:V2DF=r121:V2DF
9: r133:DF=vec_select(r131:V2DF,parallel)
   10: r131:V2DF=vec_concat(r117:DF,r133:DF)
   11: r122:V2DF=r131:V2DF
   12: r118:DF=unspec[r129:TF,0x1] 70
   13: r119:DF=unspec[r130:TF,0] 70
   14: r134:V2DF=r124:V2DF
   15: r136:DF=vec_select(r134:V2DF,parallel)
   16: r134:V2DF=vec_concat(r119:DF,r136:DF)
   17: r125:V2DF=r134:V2DF
   18: r120:DF=unspec[r130:TF,0x1] 70
   19: r137:V2DF=r122:V2DF
   20: r139:DF=vec_select(r137:V2DF,parallel)
   21: r137:V2DF=vec_concat(r139:DF,r118:DF)
   22: [r112:DI]=r137:V2DF
   23: r140:V2DF=r125:V2DF
   24: r142:DF=vec_select(r140:V2DF,parallel)
   25: r140:V2DF=vec_concat(r142:DF,r120:DF)
   26: [r112:DI+0x10]=r140:V2DF
   27: r143:V4SI=[r112:DI]
   28: r144:V4SI=[r112:DI+0x10]
   29: r127:V4SI=r143:V4SI
   30: r128:V4SI=r144:V4SI
   34: %2:V4SI=r127:V4SI
   35: %3:V4SI=r128:V4SI
   36: use %2:V4SI
   37: use %3:V4SI

P8:
5: NOTE_INSN_BASIC_BLOCK 2
2: r129:TF=%1:TF
3: r130:TF=%3:TF
4: NOTE_INSN_FUNCTION_BEG
7: r117:DF=unspec[r129:TF,0] 70
8: r131:V2DF=r121:V2DF
9: r133:DF=vec_select(r131:V2DF,parallel)
   10: r131:V2DF=vec_concat(r117:DF,r133:DF)
   11: r122:V2DF=r131:V2DF
   12: r118:DF=unspec[r129:TF,0x1] 70
   13: r119:DF=unspec[r130:TF,0] 70
   14: r134:V2DF=r124:V2DF
   15: r136:DF=vec_select(r134:V2DF,parallel)
   16: r134:V2DF=vec_concat(r119:DF,r136:DF)
   17: r125:V2DF=r134:V2DF
   18: r120:DF=unspec[r130:TF,0x1] 70
   19: r137:V2DF=r122:V2DF
   20: r139:DF=vec_select(r137:V2DF,parallel)
   21: r137:V2DF=vec_concat(r139:DF,r118:DF)
   22: r140:V2DF=vec_select(r137:V2DF,parallel)
   23: [r112:DI]=vec_select(r140:V2DF,parallel)
   24: r141:V2DF=r125:V2DF
   25: r143:DF=vec_select(r141:V2DF,parallel)
   26: r141:V2DF=vec_concat(r143:DF,r120:DF)
   27: r144:V2DF=vec_select(r141:V2DF,parallel)
   28: [r112:DI+0x10]=vec_select(r144:V2DF,parallel)
   29: r146:V4SI=vec_select([r112:DI],parallel)
   30: r145:V4SI=vec_select(r146:V4SI,parallel)
   31: r148:V4SI=vec_select([r112:DI+0x10],parallel)
   32: r147:V4SI=vec_select(r148:V4SI,parallel)
   33: r127:V4SI=r145:V4SI
   34: r128:V4SI=r147:V4SI
   38: %2:V4SI=r127:V4SI
   39: %3:V4SI=r128:V4SI
   40: use %2:V4SI
   41: use %3:V4SI

Difference starts from #22. Power8 will emit two vec_select instructions for
stack store/load operations. But power9 needs only one.

[Bug target/70053] Returning a struct of _Decimal128 values generates extraneous stores and loads

2020-05-31 Thread luoxhu at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70053

--- Comment #9 from luoxhu at gcc dot gnu.org ---
(In reply to Segher Boessenkool from comment #8)
> I see no conversion there?
> 
> But, why does it it store to memory at all?

Yes, no conversion for this case, only adjust_address to TImode. mem/c:TD means
a MEM cannot trap.

Reason of store to memory:
D.2914 is a local struct variable here, seems we need do some optimization to
sink the D.2914.td0 and D.2914.td1 from BB3&BB4 to BB5 to avoid store/load on
stack? Or if there already exists some pass in Gimple? Or should this be
optimized after expander by some new pass like store sink?

O2/pr70053.c.234t.optimized:
D256_add_finite (_Decimal128 a, _Decimal128 b, _Decimal128 c)
{
  struct TDx2_t D.2914;

   [local count: 1073741824]:
  if (b_4(D) == c_5(D))
goto ; [34.00%]
  else
goto ; [66.00%]

   [local count: 365072224]:
  D.2914.td0 = c_5(D);
  D.2914.td1 = c_5(D);
  goto ; [100.00%]

   [local count: 708669601]:
  D.2914.td0 = a_3(D);
  D.2914.td1 = b_4(D);

   [local count: 1073741824]:
  return D.2914;

}

[Bug rtl-optimization/89310] Poor code generation returning float field from a struct

2020-06-22 Thread luoxhu at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89310

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 CC||luoxhu at gcc dot gnu.org

--- Comment #3 from luoxhu at gcc dot gnu.org ---
rs6000.md:

(define_insn_and_split "movsf_from_si"
...
  "&& reload_completed
   && vsx_reg_sfsubreg_ok (operands[0], SFmode)
   && int_reg_operand_not_pseudo (operands[1], SImode)"
  [(const_int 0)
...
  /* Move SF value to upper 32-bits for xscvspdpn.  */
  emit_insn (gen_ashldi3 (op2, op1_di, GEN_INT (32)));
  emit_insn (gen_p8_mtvsrd_sf (op0, op2));
  emit_insn (gen_vsx_xscvspdpn_directmove (op0, op0));
  DONE


The split seems inevitable as reload_completed is true here, can this
lshrdi3+ashldi3 be optimized by peephole? 

r9 is DImode, is there any benefit of using mtvsrw[az] instead of mtvsrd?

Or could we replace the 3 instructions with better sequence?  Thanks.

[Bug rtl-optimization/89310] Poor code generation returning float field from a struct

2020-06-28 Thread luoxhu at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89310

--- Comment #5 from luoxhu at gcc dot gnu.org ---
Thanks.  I copied the code from movsf_from_si to make a define_insn_and_split
for "movsf_from_si2", but we don't have define_insn for rldicr, so I use
gen_anddi3 instead, any comment?

foo:
.LFB0:
.cfi_startproc
rldicr 3,3,0,31
mtvsrd 1,3
xscvspdpn 1,1
blr


diff --git a/gcc/config/rs6000/rs6000.md b/gcc/config/rs6000/rs6000.md
index 4fcd6a94022..92c237edfad 100644
--- a/gcc/config/rs6000/rs6000.md
+++ b/gcc/config/rs6000/rs6000.md
@@ -7593,6 +7593,48 @@ (define_insn_and_split "movsf_from_si"
"*,  *, p9v,   p8v,   *, *,
 p8v,p8v,   p8v,   *")])

+(define_insn_and_split "movsf_from_si2"
+  [(set (match_operand:SF 0 "nonimmediate_operand"
+   "=!r,   f, v, wa,m, Z,
+Z, wa,?r,!r")
+   (unspec:SF [
+(subreg:SI (ashiftrt:DI
+  (match_operand:DI 1 "input_operand"
+  "m, m, wY,Z, r, f,
+  wa,r, wa,r")
+ (const_int 32)) 0)]
+  UNSPEC_SF_FROM_SI))
+   (clobber (match_scratch:DI 2
+   "=X,X, X, X, X, X,
+ X, r, X, X"))]
+  "TARGET_NO_SF_SUBREG
+   && (register_operand (operands[0], SFmode)
+   || register_operand (operands[1], SImode))"
+   "#"
+  "&& !reload_completed
+   && vsx_reg_sfsubreg_ok (operands[0], SFmode)"
+  [(const_int 0)]
+{
+  rtx op0 = operands[0];
+  rtx op1 = operands[1];
+  rtx tmp = gen_reg_rtx (DImode);
+
+  emit_insn (gen_anddi3 (tmp, op1, GEN_INT(0xULL)));
+  emit_insn (gen_p8_mtvsrd_sf (op0, tmp));
+  emit_insn (gen_vsx_xscvspdpn_directmove (op0, op0));
+  DONE;
+})
+

[Bug rtl-optimization/89310] Poor code generation returning float field from a struct

2020-07-21 Thread luoxhu at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89310

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #10 from luoxhu at gcc dot gnu.org ---
Fixed on upstream.

[Bug lto/96343] LTO ICE on PPC64le

2020-07-27 Thread luoxhu at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96343

--- Comment #4 from luoxhu at gcc dot gnu.org ---
I tried to build both ADIOS2 and WarpX(with INTERPROCEDURAL_OPTIMIZATION) on a
Power8 machine with gcc 9.3.0&9.2.1, no LTO error seen.

/usr/bin/cmake ../ -DCMAKE_C_COMPILER=/opt/at12.0/bin/gcc
-DCMAKE_CXX+COMPILER=/opt/at12.0/bin/g++ -DADIOS2_USE_Fortran=OFF
-DADIOS2_USE_ZeroMQ=OFF -DBUILD_SHARED_LIBS=ON -DCMAKE_BUILD_TYPE=Release
-DCMAKE_POSITION_INDEPENDENT_CODE=ON -DADIOS2_USE_SST=OFF
-DCMAKE_CXX_FLAGS="-flto -fno-fat-lto-objects ${CMAKE_CXX_FLAGS}"
 make -j50

Not sure any difference with your configuration?

Anyway, it will be much better if you could try new GCC or reduce a smaller
test case, BTW, I see that someone mentioned that it may related to conda and
python https://github.com/ornladios/ADIOS2/issues/1524#issue-458229988?

[Bug rtl-optimization/71309] Copying fields within a struct followed by use results in load hit store

2020-08-04 Thread luoxhu at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71309

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 CC||luoxhu at gcc dot gnu.org
 Resolution|--- |FIXED
 Status|NEW |RESOLVED

--- Comment #5 from luoxhu at gcc dot gnu.org ---
Fixed on master.

[Bug testsuite/92398] [10 regression] error in update of gcc.target/powerpc/pr72804.c in r277872

2019-12-04 Thread luoxhu at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92398

--- Comment #10 from luoxhu at gcc dot gnu.org ---
Author: luoxhu
Revision: 278890
Modified property: svn:log

Modified: svn:log at Wed Dec  4 08:50:33 2019
--
--- svn:log (original)
+++ svn:log Wed Dec  4 08:50:33 2019
@@ -10,7 +10,7 @@

2019-12-02  Luo Xiong Hu  

-   testsuite/pr92398
+   PR testsuite/92398
* gcc.target/powerpc/pr72804.c: Split the store function to...
* gcc.target/powerpc/pr92398.h: ... this one.  New.
* gcc.target/powerpc/pr92398.p9+.c: New.

[Bug middle-end/93189] [10 regression] Many test case failures starting with r279942

2020-01-07 Thread luoxhu at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93189

--- Comment #3 from luoxhu at gcc dot gnu.org ---
Author: luoxhu
Revision: 279986
Modified property: svn:log

Modified: svn:log at Wed Jan  8 01:32:45 2020
--
--- svn:log (original)
+++ svn:log Wed Jan  8 01:32:45 2020
@@ -7,5 +7,6 @@

2020-01-08  Luo Xiong Hu  

+   PR middle-end/93189
* ipa-inline.c (caller_growth_limits): Restore the AND.

[Bug ipa/69678] Missed function specialization + partial devirtualization opportunity

2020-01-14 Thread luoxhu at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69678

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #4 from luoxhu at gcc dot gnu.org ---
Fixed.

[Bug middle-end/71509] Bitfield causes load hit store with larger store than load

2020-02-09 Thread luoxhu at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71509

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 CC||luoxhu at gcc dot gnu.org

--- Comment #11 from luoxhu at gcc dot gnu.org ---
(In reply to Anton Blanchard from comment #4)
> Created attachment 39683 [details]
> Another bitop LHS test case
> 
> Here's another issue found in the Linux kernel. Seems like this should be a
> single lwz/stw since the union of counter and the bitops completely overlap.
> 
> The half word store followed by word load is going to prevent it from store
> forwarding.
> 
>  :
>0: 00 00 03 81 lwz r8,0(r3)
>4: 20 00 89 78 clrldi  r9,r4,32
>8: c2 0f 2a 79 rldicl  r10,r9,33,31
>c: 00 f8 48 51 rlwimi  r8,r10,31,0,0
>   10: 5e 00 2a 55 rlwinm  r10,r9,0,1,15
>   14: 00 00 03 91 stw r8,0(r3)
>   18: 00 00 83 b0 sth r4,0(r3)
>   1c: 00 00 42 60 ori r2,r2,0
>   20: 00 00 23 81 lwz r9,0(r3)
>   24: 00 04 29 55 rlwinm  r9,r9,0,16,0
>   28: 78 53 29 7d or  r9,r9,r10
>   2c: 00 00 23 91 stw r9,0(r3)
>   30: 20 00 80 4e blr

This case only is fixed on latest gcc 10 already (issues in case
__skb_decr_checksum_unnecessary from Anton Blanchard and test2 from Nicholas
Piggin  still exist).

gcc version 10.0.1 20200210 

objdump -d set_page_slub_counters.o

set_page_slub_counters.o: file format elf64-powerpcle


Disassembly of section .text:

 :
   0:   22 84 89 78 rldicl  r9,r4,48,48
   4:   00 00 83 b0 sth r4,0(r3)
   8:   02 00 23 b1 sth r9,2(r3)
   c:   20 00 80 4e blr

[Bug lto/92599] [8/9 regression] ICE in speculative_call_info, at cgraph.c:1142

2020-02-09 Thread luoxhu at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92599

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 CC||luoxhu at gcc dot gnu.org

--- Comment #9 from luoxhu at gcc dot gnu.org ---
Fixed on master, could be closed?

[Bug middle-end/71509] Bitfield causes load hit store with larger store than load

2020-02-10 Thread luoxhu at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71509

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 CC||linkw at gcc dot gnu.org

--- Comment #13 from luoxhu at gcc dot gnu.org ---
(In reply to Segher Boessenkool from comment #12)
> But it could do just
> 
>   stw r4,0(r3)
> 
> (on LE; and with a rotate first, on BE).

Thanks for the catching, this optimization is not related to load hit store.  I
will investigate why store-merging pass failed to merge the 2 half store.

[Bug middle-end/93582] [10 Regression] -Warray-bounds gives error: array subscript 0 is outside array bounds of struct E[1]

2020-02-18 Thread luoxhu at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93582

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 CC||luoxhu at gcc dot gnu.org

--- Comment #30 from luoxhu at gcc dot gnu.org ---
Hi Jakub, thanks for your information, seems this test case from Linux kernel
could be added to your future fix patch.

struct page
{
  union
  {
unsigned counters;
struct
{
  union
  {
struct
{
  unsigned inuse : 16;
  unsigned objects : 15;
  unsigned frozen : 1;
};
  };
};
  };
};

void
foo1 (struct page *page, unsigned long counters_new)
{
struct page tmp;
tmp.counters = counters_new;
page->inuse   = tmp.inuse;
page->objects = tmp.objects;
page->frozen  = tmp.frozen;
}

 Tried gcc(r10-6717) with -O3 on powerpcle, the asm is:
Disassembly of section .text:

 :
   0:   3e 84 89 54 rlwinm  r9,r4,16,16,31
   4:   00 00 83 b0 sth r4,0(r3)
   8:   02 00 23 b1 sth r9,2(r3)
   c:   20 00 80 4e blr
...
  1c:   00 00 42 60 ori r2,r2,0

It is expected to emit only one store stw instruction(Also two half store
instructions emitted for x86 platforms!).
I am not sure whether the fre pass could check consecutive store and do the
merge similar to store-merging pass as the input parameter counters_new is not
a constant. :)

[Bug lto/91287] LTO disables linking with scalar MASS library (Fortran only)

2019-08-13 Thread luoxhu at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91287

--- Comment #38 from luoxhu at gcc dot gnu.org ---
Author: luoxhu
Date: Wed Aug 14 02:18:33 2019
New Revision: 274411

URL: https://gcc.gnu.org/viewcvs?rev=274411&root=gcc&view=rev
Log:
Enable math functions linking with static library for LTO

In LTO mode, if static library and dynamic library contains same
function and both libraries are passed as arguments, linker will link
the function in dynamic library no matter the sequence.  This patch
will output LTO symbol node as UNDEF if BUILT_IN_NORMAL function FNDECL
is a math function, then the function in static library will be linked
first if its sequence is ahead of the dynamic library.

gcc/ChangeLog

2019-08-14  Xiong Hu Luo  

PR lto/91287
* builtins.c (builtin_with_linkage_p): New function.
* builtins.h (builtin_with_linkage_p): New function.
* symtab.c (write_symbol): Remove redundant assert.
* lto-streamer-out.c (symtab_node::output_to_lto_symbol_table_p):
Remove FIXME and use builtin_with_linkage_p.


Modified:
trunk/gcc/ChangeLog
trunk/gcc/builtins.c
trunk/gcc/builtins.h
trunk/gcc/lto-streamer-out.c
trunk/gcc/symtab.c

[Bug lto/91287] LTO disables linking with scalar MASS library (Fortran only)

2019-08-26 Thread luoxhu at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91287

--- Comment #39 from luoxhu at gcc dot gnu.org ---
Author: luoxhu
Date: Mon Aug 26 08:53:27 2019
New Revision: 274921

URL: https://gcc.gnu.org/viewcvs?rev=274921&root=gcc&view=rev
Log:
Backport r274411 from trunk to gcc-9-branch

Backport r274411 of "Enable math functions linking with static library
for LTO" from mainline to gcc-9-branch.

Bootstrapped/Regression-tested on Linux POWER8 LE.

gcc/ChangeLog
2019-08-26  Xiong Hu Luo  

Backport r274411 from trunk to gcc-9-branch.
2019-08-14  Xiong Hu Luo  

PR lto/91287
* builtins.c (builtin_with_linkage_p): New function.
* builtins.h (builtin_with_linkage_p): New function.
* symtab.c (write_symbol): Remove redundant assert.
* lto-streamer-out.c (symtab_node::output_to_lto_symbol_table_p):
Remove FIXME and use builtin_with_linkage_p.


Modified:
branches/gcc-9-branch/gcc/ChangeLog
branches/gcc-9-branch/gcc/builtins.c
branches/gcc-9-branch/gcc/builtins.h
branches/gcc-9-branch/gcc/lto-streamer-out.c
branches/gcc-9-branch/gcc/symtab.c

[Bug target/98914] [11 Regression] ICE in rs6000_expand_vector_set, at config/rs6000/rs6000.c:7198

2021-03-21 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98914

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #6 from luoxhu at gcc dot gnu.org ---
Fixed on master.

[Bug target/99718] [11 regression] ICE in new test case gcc.target/powerpc/pr98914.c for 32 bits

2021-03-22 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99718

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 CC||dje.gcc at gmail dot com,
   ||segher at gcc dot gnu.org

--- Comment #1 from luoxhu at gcc dot gnu.org ---
Confirmed.  Another case forgot to test m32 again :(

David mentioned no need variable vec_insert support for m32 build, so I think
we should avoid generating IFN VEC_SET in
gimple-isel.c:gimple_expand_vec_set_expr, but it seems not possible to check
"TARGET_P8_VECTOR && TARGET_DIRECT_MOVE_64BIT" in the common file or through
can_vec_set_var_idx_p. Any suggestions?

https://gcc.gnu.org/pipermail/gcc-patches/2021-January/564403.html

[Bug target/97329] POWER9 default cache and line sizes appear to be wrong

2021-03-22 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97329

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 CC||luoxhu at gcc dot gnu.org

--- Comment #9 from luoxhu at gcc dot gnu.org ---
Yes, it seems a copy paste error for Power8 from Power7.  Is this supposed to
be fix by gcc-12 stage1? And any performance evaluation required?


diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 616dae35bae..34c4edae20e 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -1055,7 +1055,7 @@ struct processor_costs power8_cost = {
   COSTS_N_INSNS (17),  /* ddiv */
   128, /* cache line size */
   32,  /* l1 cache */
-  256, /* l2 cache */
+  512, /* l2 cache */
   12,  /* prefetch streams */
   COSTS_N_INSNS (3),   /* SF->DF convert */
 };

[Bug target/99718] [11 regression] ICE in new test case gcc.target/powerpc/pr98914.c for 32 bits

2021-03-23 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99718

--- Comment #4 from luoxhu at gcc dot gnu.org ---
Thanks, Jakub. It tested pass on both m32/m64, is this a reasonable fix?
@segher, will make it a patch if so.


git diff
diff --git a/gcc/config/rs6000/predicates.md b/gcc/config/rs6000/predicates.md
index 859af75..0a5cae2 100644
--- a/gcc/config/rs6000/predicates.md
+++ b/gcc/config/rs6000/predicates.md
@@ -1920,6 +1920,12 @@
   return address_is_prefixed (XEXP (op, 0), mode, NON_PREFIXED_DEFAULT);
 })

+;; Return true if m64 on p8v and above for vec_set with variable index.
+(define_predicate "vec_set_index_operand"
+ (if_then_else (match_test "TARGET_P8_VECTOR && TARGET_DIRECT_MOVE_64BIT")
+  (match_operand 0 "reg_or_cint_operand")
+  (match_operand 0 "const_int_operand")))
+
 ;; Return true if the operand is a valid memory operand with a D-form
 ;; address that could be merged with the load of a PC-relative external
address
 ;; with the PCREL_OPT optimization.  We don't check here whether or not the
diff --git a/gcc/config/rs6000/vector.md b/gcc/config/rs6000/vector.md
index e5191bd..3446b03 100644
--- a/gcc/config/rs6000/vector.md
+++ b/gcc/config/rs6000/vector.md
@@ -1227,7 +1227,7 @@
 (define_expand "vec_set"
   [(match_operand:VEC_E 0 "vlogical_operand")
(match_operand: 1 "register_operand")
-   (match_operand 2 "reg_or_cint_operand")]
+   (match_operand 2 "vec_set_index_operand")]
   "VECTOR_MEM_ALTIVEC_OR_VSX_P (mode)"
 {
   rs6000_expand_vector_set (operands[0], operands[1], operands[2]);

[Bug target/97329] POWER9 default cache and line sizes appear to be wrong

2021-03-24 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97329

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #11 from luoxhu at gcc dot gnu.org ---
Fixed with r11-7821-g08103e4d6ada9b57366f2df2a2b745babfab914c.

[Bug target/99718] [11 regression] ICE in new test case gcc.target/powerpc/pr98914.c for 32 bits

2021-03-25 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99718

--- Comment #11 from luoxhu at gcc dot gnu.org ---
Created attachment 50474
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50474&action=edit
32bit variable vec_insert

LLVM also generates store-hit-load instruction:

addi 3, 1, -16
rlwinm 4, 5, 2, 28, 29
stvx 2, 0, 3
stwx 6, 3, 4
lvx 2, 0, 3
blr
.long   0
.quad   0

I didn't use "can't" in my reply, sorry that caused the confusion, we though it
was  inefficient to move SF to SI on 32bit mode , but it turns out also huge
performance gain (46.704s -> 4.369s).

Attached the patch that also support variable vec_insert for 32bit, testing on
P8BE/PBLE/P9LE, could you please verify it on AIX? Will refine it and send to
the mail-list to fix this P1 issue fundamentally.

[Bug target/99718] [11 regression] ICE in new test case gcc.target/powerpc/pr98914.c for 32 bits

2021-03-25 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99718

--- Comment #12 from luoxhu at gcc dot gnu.org ---
Not sure whether TARGET_DIRECT_MOVE_64BIT is the right MACRO to correctly
differentiate m32 and m64?

[Bug target/99718] [11 regression] ICE in new test case gcc.target/powerpc/pr98914.c for 32 bits

2021-03-25 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99718

--- Comment #13 from luoxhu at gcc dot gnu.org ---
Performance data in #c11 is for int variable vec_insert of 32bit mode, the
float variable vec_insert of 32-bit is a bit slower but much better than
original(extra stfs+lwz of insn #17 and insn 18 in expand to move SF register
to SI register by hex value.):

46.677s -> 8.723s

test.c

#include 
#define TYPE float

vector TYPE
test (vector TYPE u, TYPE i, signed int n){
return vec_insert (i, u, n);
}

Expand:
1: NOTE_INSN_DELETED
6: NOTE_INSN_BASIC_BLOCK 2
2: r122:V4SF=%2:V4SF
3: r123:SF=%1:SF
4: r124:SI=%3:SI
5: NOTE_INSN_FUNCTION_BEG
8: r120:V4SF=r122:V4SF
9: r125:SI=r124:SI&0x3
   10: r126:V4SF=r120:V4SF
   11: r128:SI=r125:SI<<0x2
   12: {r128:SI=0x14-r128:SI;clobber ca:SI;}
   13: r132:SI=high(`*.LC0')
   14: r131:SI=r132:SI+low(`*.LC0')
  REG_EQUAL `*.LC0'
   15: r130:V2DI=[r131:SI]
  REG_EQUAL const_vector
   16: r129:V16QI=r130:V2DI#0
   17: [r112:SI]=r123:SF
   18: r133:SI=[r112:SI]
   19: r136:DI#4=r133:SI
   22: {r137:SI=r133:SI>>0x1f;clobber ca:SI;}
   23: r136:DI#0=r137:SI
   24: r138:DI=0
   25: r135:V2DI=vec_concat(r136:DI,r138:DI)
   26: r134:V16QI=r135:V2DI#0
   27: r139:V16QI=unspec[r128:SI] 151
   28: r140:V16QI=unspec[r134:V16QI,r134:V16QI,r139:V16QI] 236
   29: r141:V16QI=unspec[r129:V16QI,r129:V16QI,r139:V16QI] 236
   30: r126:V4SF#0={(r141:V16QI!=const_vector)?r140:V16QI:r126:V4SF#0}
   31: r119:V4SF=r126:V4SF
   32: r120:V4SF=r119:V4SF

ASM:

.LFB0:
.cfi_startproc
stwu 1,-16(1)
.cfi_def_cfa_offset 16
lis 9,.LC0@ha
rlwinm 3,3,2,28,29
xxlxor 0,0,0
la 9,.LC0@l(9)
subfic 3,3,20
lxvd2x 33,0,9
lvsl 13,0,3
stfs 1,8(1)
vperm 1,1,1,13
ori 2,2,0
lwz 9,8(1)
addi 1,1,16
.cfi_def_cfa_offset 0
srawi 10,9,31
mtvsrwz 13,9
mtvsrwz 12,10
fmrgow 11,12,13
xxpermdi 32,11,0,0
vperm 0,0,0,13
xxsel 34,34,32,33
blr

[Bug target/99718] [11 regression] ICE in new test case gcc.target/powerpc/pr98914.c for 32 bits

2021-03-26 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99718

--- Comment #15 from luoxhu at gcc dot gnu.org ---
(In reply to Jakub Jelinek from comment #14)
> You still have:
>   if (VECTOR_MEM_VSX_P (mode))
> {
>   if (!CONST_INT_P (elt_rtx))
> {
>   if ((TARGET_P9_VECTOR && TARGET_POWERPC64) || width == 8)
> return ..._p9 (...);
>   else if (TARGET_P8_VECTOR)
> return ..._p8 (...);
> }
> 
>   if (mode == V2DFmode)
> insn = gen_vsx_set_v2df (target, target, val, elt_rtx);
> 
>   else if (mode == V2DImode)
> insn = gen_vsx_set_v2di (target, target, val, elt_rtx);
> 
>   else if (TARGET_P9_VECTOR && TARGET_POWERPC64)
> {
>   ...
> }
>   if (insn)
> return;
> }
> 
>   gcc_assert (CONST_INT_P (elt_rtx));
> 
> while the vector.md condition is VECTOR_MEM_ALTIVEC_OR_VSX_P (mode),
> i.e. true for TARGET_ALTIVEC for many modes already (V4SI, V8HI, V16QI, V4SF
> and
> for TARGET_VSX also V2DF and V2DI, right).
> I somehow don't see how this can work properly.
> Looking at vsx_set_v2df and vsx_set_v2di, neither of them will handle
> non-constant elt_rtx (it ICEs on anything but const0_rtx and const1_rtx).
> 
> So, questions:
> 1) does the rs6000_expand_vector_set_var_p9 routine for width == 8 (i.e.
> V2DImode or V2DFmode?)
> handle everything, even when TARGET_P9_VECTOR or TARGET_POWERPC64 is not
> true, plain old VSX?

Yes. V2DI/V2DF for P8 {BE,LE} {m32,m64} will call
rs6000_expand_vector_set_var_p9 instead of xxx_p8. 

Do you mean Power7 for the plain old VSX? I verified the pr98914.c on Power7,
it exactly ICEs on "gcc_assert (CONST_INT_P (elt_rtx));" for both m64 and m32. 
This is still not fixed by the patch in #c11 yet.

For builtin call in rs6000-c.c:altivec_build_resolved_builtin, it is guarded by
TARGET_P8_VECTOR, so Power7 doesn't generate IFN VEC_INSERT before. This ICE
also comes from internal optimization gimple-isel.c:gimple_expand_vec_set_expr,
can_vec_set_var_idx_p doesn't return false due to VECTOR_MEM_ALTIVEC_OR_VSX_P
is true when Power7 VSX, change the "if (VECTOR_MEM_VSX_P (mode))" to "if
(VECTOR_MEM_ALTIVEC_OR_VSX_P (mode))" in rs6000.c:rs6000_expand_vector_set and
remove TARGET_P8_VECTOR in the else branch could fix the ICE on P7 {m32,64}, so
this means even P7 VSX could benefit from this optimization, which is different
from what discussed before.


> 2) what happens if TARGET_P8_VECTOR is false and TARGET_VSX is true and mode
> is other than V2DI/V2DF? If I read the code right, it will fall through to
> gcc_assert (CONST_INT_P (elt_rtx));

Same like 1)?

> 3) what happens if !TARGET_VSX (more specifically, when VECTOR_MEM_VSX_P
> (mode) is false.
> I see there just the assertion that would fail right away.
> Perhaps I'm missing something obvious and those cases are impossible, but if
> that is the case, it would still be better to add further assertion at least
> to the if (...) else if (...) as else gcc_assert ...

Thanks for pointing out, the "gcc_assert (CONST_INT_P (elt_rtx));" should be
moved into the "if (!CONST_INT_P (elt_rtx))" condition like you said. 
gen_vsx_set_v2df and gen_vsx_set_v2di are supposed to handle only const
elt_rtx.

[Bug target/99718] [11 regression] ICE in new test case gcc.target/powerpc/pr98914.c for 32 bits

2021-03-26 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99718

--- Comment #19 from luoxhu at gcc dot gnu.org ---
https://gcc.gnu.org/pipermail/gcc-patches/2021-March/567395.html

This patch extends variable vec_insert to all 32bit VSX targets including
Power7{BE} {32,64}, Power8{BE}{32, 64}, Power8{LE}{64}, Power9{LE}{64}, all
tested  pass for power testcases, though AIX is not tested yet. @Segher, please
review this one instead of the previous that disables 32 bit variable
vec_insert, thanks.

For Altivec targets like power5/6/G4/G5, take the previous "vector store/scalar
store/vector load" code path.

-mcpu=power6 -O2 -maltivec -c -S

f2:
.LFB0:
.cfi_startproc
addi 10,1,-16
sldi 5,5,2
li 9,32
addi 8,1,-48
stvx 2,8,9
stwx 6,10,5
lvx 2,8,9
blr

[Bug target/99718] [11 regression] ICE in new test case gcc.target/powerpc/pr98914.c for 32 bits

2021-03-30 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99718

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #21 from luoxhu at gcc dot gnu.org ---
Fixed on mater.

[Bug middle-end/90323] powerpc should convert equivalent sequences to vec_sel()

2021-04-07 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90323

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 CC||luoxhu at gcc dot gnu.org

--- Comment #8 from luoxhu at gcc dot gnu.org ---
Two minor updates for the case mentioned in #c2:

 for VEC_SEL (ARG1, ARG2, ARG3):

   Returns a vector containing the value of either ARG1 or ARG2 depending on
the 
   value of ARG3.


#include 
#include 
volatile vector unsigned orig = {0xebebebeb, 0x34343434, 0x76767676,
0x12121212};
volatile vector unsigned mask = {0x, 0, 0x, 0};
volatile vector unsigned fill = {0xfefefefe, 0x, 0x,
0x};
volatile vector unsigned expected = {0xfefefefe, 0x34343434, 0x,
0x12121212};
__attribute__ ((noinline))
vector unsigned without_sel(vector unsigned l, vector unsigned r, vector
unsigned mask) {
-l = l & ~r;
+l = l & ~mask;
l |= mask & r;
return l;
}

__attribute__ ((noinline))
vector unsigned with_sel(vector unsigned l, vector unsigned r, vector unsigned
mask) {
-return vec_sel(l, mask, r);
+return vec_sel(l, r, mask);
}

int main() {
vector unsigned res1 = without_sel(orig, fill, mask);
vector unsigned res2 = with_sel(orig, fill, mask);
if (!vec_all_eq(res1, expected)) printf ("error1\n");
if (!vec_all_eq(res2, expected)) printf ("error2\n");
return 0;
}


And the ASM would be:

without_sel:
xxlxor 35,34,35
xxland 35,35,36
xxlxor 34,34,35
blr
.long 0
.byte 0,0,0,0,0,0,0,0
with_sel:
xxsel 34,34,35,36
blr
.long 0
.byte 0,0,0,0,0,0,0,0

[Bug middle-end/90323] powerpc should convert equivalent sequences to vec_sel()

2021-04-07 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90323

--- Comment #9 from luoxhu at gcc dot gnu.org ---
Then we could optimized it in match.pd

diff --git a/gcc/match.pd b/gcc/match.pd
index 036f92fa959..8944312c153 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -3711,6 +3711,17 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
(if (integer_all_onesp (@1) && integer_zerop (@2))
 @0

+#if GIMPLE
+(simplify
+ (bit_xor @0 (bit_and @2 (bit_xor @0 @1)))
+ (if (optimize_vectors_before_lowering_p () && types_match (@0, @1)
+  && types_match (@0, @2) && VECTOR_TYPE_P (TREE_TYPE (@0))
+  && VECTOR_TYPE_P (TREE_TYPE (@1)) && VECTOR_TYPE_P (TREE_TYPE (@2)))
+ (with { tree itype = truth_type_for (type); }
+ (vec_cond (convert:itype @2) @1 @0
+#endif

in pr90323.c.033t.forwprop1, it will be optimized to:

   :
  _1 = ~mask_3(D);
  l_5 = _1 & l_4(D);
  _2 = mask_3(D) & r_6(D);
  _8 = l_4(D) ^ r_6(D);
  _10 = mask_3(D) & _8;
  _11 = (vector(4) ) mask_3(D);
  l_7 = VEC_COND_EXPR <_11, r_6(D), l_4(D)>;
  return l_7;

Then in pr90323.c.243t.isel:

   [local count: 1073741824]:
  _6 = (vector(4) ) mask_1(D);
  l_4 = .VCOND_MASK (_6, r_3(D), l_2(D));
  return l_4;

final ASM:

without_sel:
.LFB11:
.cfi_startproc
xxsel 34,34,35,36
blr
.long 0
.byte 0,0,0,0,0,0,0,0
.cfi_endproc
.LFE11:
.size   without_sel,.-without_sel
.align 2
.p2align 4,,15
.globl with_sel
.type   with_sel, @function
with_sel:
.LFB12:
.cfi_startproc
xxsel 34,34,35,36
blr


@segher, Is this reasonable fix ???

[Bug middle-end/90323] powerpc should convert equivalent sequences to vec_sel()

2021-04-08 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90323

--- Comment #11 from luoxhu at gcc dot gnu.org ---
I noticed that you added the below optimization with commit
a62436c0a505155fc8becac07a8c0abe2c265bfe. But it doesn't even handle this case,
cse1 pass will call simplify_binary_operation_1, both op0 and op1 are REGs
instead of AND operators, do you have a test case to cover that piece of code?

__attribute__ ((noinline))
 long without_sel3( long l,  long r) {
long tmp = {0x0ff00fff};
l =  ( (l ^ r) & tmp) ^ l;
return l;
}


without_sel3:
xor 4,3,4
rlwinm 4,4,0,20,11
rldicl 4,4,0,36
xor 3,4,3
blr
.long 0
.byte 0,0,0,0,0,0,0,0


+2016-11-09  Segher Boessenkool  
+
+   * simplify-rtx.c (simplify_binary_operation_1): Simplify
+   (xor (and (xor A B) C) B) to (ior (and A C) (and B ~C)) and
+   (xor (and (xor A B) C) A) to (ior (and A ~C) (and B C)) if C
+   is a const_int.

diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c
index 5c3dea1a349..11a2e0267c7 100644
--- a/gcc/simplify-rtx.c
+++ b/gcc/simplify-rtx.c
@@ -2886,6 +2886,37 @@ simplify_binary_operation_1 (enum rtx_code code,
machine_mode mode,
}
}

+  /* If we have (xor (and (xor A B) C) A) with C a constant we can instead
+do (ior (and A ~C) (and B C)) which is a machine instruction on some
+machines, and also has shorter instruction path length.  */
+  if (GET_CODE (op0) == AND
+ && GET_CODE (XEXP (op0, 0)) == XOR
+ && CONST_INT_P (XEXP (op0, 1))
+ && rtx_equal_p (XEXP (XEXP (op0, 0), 0), trueop1))
+   {
+ rtx a = trueop1;
+ rtx b = XEXP (XEXP (op0, 0), 1);
+ rtx c = XEXP (op0, 1);
+ rtx nc = simplify_gen_unary (NOT, mode, c, mode);
+ rtx a_nc = simplify_gen_binary (AND, mode, a, nc);
+ rtx bc = simplify_gen_binary (AND, mode, b, c);
+ return simplify_gen_binary (IOR, mode, a_nc, bc);
+   }
+  /* Similarly, (xor (and (xor A B) C) B) as (ior (and A C) (and B ~C)) 
*/
+  else if (GET_CODE (op0) == AND
+ && GET_CODE (XEXP (op0, 0)) == XOR
+ && CONST_INT_P (XEXP (op0, 1))
+ && rtx_equal_p (XEXP (XEXP (op0, 0), 1), trueop1))
+   {
+ rtx a = XEXP (XEXP (op0, 0), 0);
+ rtx b = trueop1;
+ rtx c = XEXP (op0, 1);
+ rtx nc = simplify_gen_unary (NOT, mode, c, mode);
+ rtx b_nc = simplify_gen_binary (AND, mode, b, nc);
+ rtx ac = simplify_gen_binary (AND, mode, a, c);
+ return simplify_gen_binary (IOR, mode, ac, b_nc);
+   }

[Bug middle-end/90323] powerpc should convert equivalent sequences to vec_sel()

2021-04-09 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90323

--- Comment #12 from luoxhu at gcc dot gnu.org ---

That code was called by combine pass but fail to match. 

pr newpat
(set (reg:DI 125 [ l ])
(xor:DI (and:DI (xor:DI (reg/v:DI 120 [ l ])
(reg:DI 127))
(const_int 267390975 [0xff00fff]))
(reg/v:DI 120 [ l ])))


Trying 8, 10 -> 11:
8: r123:DI=r120:DI^r127:DI
  REG_DEAD r127:DI
   10: r118:DI=r123:DI&0xff00fff
  REG_DEAD r123:DI
   11: r125:DI=r118:DI^r120:DI
  REG_DEAD r120:DI
  REG_DEAD r118:DI
Failed to match this instruction:
(set (reg:DI 125 [ l ])
(ior:DI (and:DI (reg/v:DI 120 [ l ])
(const_int -267390976 [0xf00ff000]))
(and:DI (reg:DI 127)
(const_int 267390975 [0xff00fff]
Successfully matched this instruction:
(set (reg:DI 118 [ _2 ])
(and:DI (reg:DI 127)
(const_int 267390975 [0xff00fff])))
Failed to match this instruction:
(set (reg:DI 125 [ l ])
(ior:DI (and:DI (reg/v:DI 120 [ l ])
(const_int -267390976 [0xf00ff000]))
(reg:DI 118 [ _2 ])))

[Bug middle-end/90323] powerpc should convert equivalent sequences to vec_sel()

2021-04-12 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90323

--- Comment #15 from luoxhu at gcc dot gnu.org ---
(In reply to Segher Boessenkool from comment #14)
> (In reply to luoxhu from comment #12)
> > That code was called by combine pass but fail to match. 
> 
> > 
> > pr newpat
> > (set (reg:DI 125 [ l ])
> > (xor:DI (and:DI (xor:DI (reg/v:DI 120 [ l ])
> > (reg:DI 127))
> > (const_int 267390975 [0xff00fff]))
> > (reg/v:DI 120 [ l ])))
> 
> Note this is 0x0ff00fff, and this is not a valid mask for rlwimi.

OK, it also fails to combine for 0x0100.


.cfi_startproc
xor 4,3,4
rlwinm 4,4,0,7,7
xor 3,4,3
blr

[Bug target/97142] __builtin_fmod not optimized on POWER

2021-04-12 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97142

--- Comment #10 from luoxhu at gcc dot gnu.org ---

If not built with fast-math, gimple_has_side_effects will return true and cause
the expand_call_stmt fail to expand the "_1 = fmod (x_2(D), y_3(D));" to
internal function. X86 also produces "bl fmod" for O3 build.


xlF expands the fmod to below ASM, no FMA generated?


1900 :
1900:   8c 03 01 10 vspltisw v0,1
1904:   00 00 24 c8 lfd f1,0(r4)
1908:   00 00 03 c8 lfd f0,0(r3)
190c:   e2 03 40 f0 xvcvsxwdp vs2,vs32
1910:   c0 09 62 f0 xsdivdp vs3,vs2,vs1
1914:   80 19 80 f0 xsmuldp vs4,vs0,vs3
1918:   64 21 a0 f0 xsrdpiz vs5,vs4
191c:   88 2d 01 f0 xsnmsubadp vs0,vs1,vs5
1920:   18 00 20 fc frspf1,f0
1924:   20 00 80 4e blr

[Bug tree-optimization/22326] promotions (from float to double) are not removed when they should be able to

2020-11-16 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=22326

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 CC||luoxhu at gcc dot gnu.org,
   ||pinskia at gcc dot gnu.org,
   ||segher at kernel dot 
crashing.org

--- Comment #4 from luoxhu at gcc dot gnu.org ---
float foo(float f, float x, float y) {
return (fabs(f)*x+y);
}

the input of fabs is float type, so use fabsf is enough here, drafted a patch
to avoid double promotion when generating gimple if fabs could be replaced by
fabsf as argument[0] is float type.


diff --git a/gcc/c/c-parser.c b/gcc/c/c-parser.c
index ecc3d2119fa..1a2d7e624cc 100644
--- a/gcc/c/c-parser.c
+++ b/gcc/c/c-parser.c
@@ -10470,6 +10470,20 @@ c_parser_postfix_expression_after_primary (c_parser
*parser,
  && fndecl_built_in_p (expr.value, BUILT_IN_NORMAL)
  && vec_safe_length (exprlist) == 1)
warn_for_abs (expr_loc, expr.value, (*exprlist)[0]);
+
+ if (fndecl_built_in_p (expr.value, BUILT_IN_NORMAL)
+ && DECL_FUNCTION_CODE (expr.value) == BUILT_IN_FABS)
+   {
+ tree arg0 = (*exprlist)[0];
+ if (TYPE_PRECISION (TREE_TYPE (TREE_TYPE (expr.value)))
+   > TYPE_PRECISION (TREE_TYPE (arg0))
+ && TYPE_MODE (TREE_TYPE (arg0)) == E_SFmode)
+   {
+ tree abs_fun = get_identifier ("fabsf");
+ expr.value = build_external_ref (expr_loc, abs_fun, true,
+  &expr.original_type);
+   }
+   }
}

  start = expr.get_start ();


.006t.gimple:

__attribute__((noinline))
foo (float f, float x, float y)
{
  float D.4347;

  _1 = ABS_EXPR ;
  _2 = x * _1;
  D.4347 = y + _2;
  return D.4347;
}


foo:
.LFB0:
.cfi_startproc
fabs 1,1
fmadds 1,1,2,3
blr

[Bug tree-optimization/22326] promotions (from float to double) are not removed when they should be able to

2020-11-16 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=22326

--- Comment #5 from luoxhu at gcc dot gnu.org ---
With above hack, changing argument x from float to double could still generate
correct code with conversion of fabsf result:

float foo(float f, double x, float y) {
return (fabs(f)*x+y);
}

006t.gimple
__attribute__((noinline))
foo (float f, double x, float y)
{
  float D.4347;

  _1 = ABS_EXPR ;
  _2 = (double) _1;
  _3 = x * _2;
  _4 = (double) y;
  _5 = _3 + _4;
  D.4347 = (float) _5;
  return D.4347;
}

[Bug tree-optimization/22326] promotions (from float to double) are not removed when they should be able to

2020-11-23 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=22326

--- Comment #9 from luoxhu at gcc dot gnu.org ---
(In reply to Andrew Pinski from comment #6)
> (In reply to luoxhu from comment #4)
> > float foo(float f, float x, float y) {
> > return (fabs(f)*x+y);
> > }
> > 
> > the input of fabs is float type, so use fabsf is enough here, drafted a
> > patch to avoid double promotion when generating gimple if fabs could be
> > replaced by fabsf as argument[0] is float type.
> 
> what about adding something to match.pd for:
> ABS<(float_convert)f> into (float_convert)ABS
> This is only valid prompting and not reducing the precision.

Thanks, this is already implemented in fold-const.c, though not using match.pd
and fabsf really.  fabs will always convert arguments to double type first in
front-end.  And there are 3 kind of cases for this issue:

1) "return fabs(x);"
tree
fold_unary_loc (location_t loc, enum tree_code code, tree type, tree op0)
{
...
case ABS_EXPR:
  /* Convert fabs((double)float) into (double)fabsf(float).  */
  if (TREE_CODE (arg0) == NOP_EXPR
  && TREE_CODE (type) == REAL_TYPE)
{
  tree targ0 = strip_float_extensions (arg0);
  if (targ0 != arg0)
return fold_convert_loc (loc, type,
 fold_build1_loc (loc, ABS_EXPR,
  TREE_TYPE (targ0),
  targ0));
}
  return NULL_TREE;
...
}

This piece of code could convert the code from "(float)fabs((double)x)" to
"(float)(double)(float)fabs(x)", then match.pd could remove the useless
convert.

2) "return fabs(x)*y;"

Frontend will generate "(float) (fabs((double) x) * (double) y)" expression
first, 
then fold-const.c:fold_unary_loc will Convert fabs((double)float) into
(double)fabsf(float) and get "(float)((double)fabs(x) * (double)y)", finally,
match.pd will convert (outertype)((innertype0)a+(innertype1)b) into
((newtype)a+(newtype)b) to remove the double conversion.

3)"return fabs(x)*y + z;"

Frontend produces: (float) ((fabs((double) float) * (double) y) + (double z))

So what we need here is to match the MUL&ADD in match.pd as followed, any
comments?

+(simplify (convert (plus (mult (convert@3 (abs @0)) (convert@4 @1)) (convert@5
@2)))
+ (if (( flag_unsafe_math_optimizations
+   && types_match (type, float_type_node)
+   && types_match (TREE_TYPE(@0), float_type_node)
+   && types_match (TREE_TYPE(@1), float_type_node)
+   && types_match (TREE_TYPE(@2), float_type_node)
+   && element_precision (TREE_TYPE(@3)) > element_precision (TREE_TYPE
(@0))
+   && element_precision (TREE_TYPE(@4)) > element_precision (TREE_TYPE
(@1))
+   && element_precision (TREE_TYPE(@5)) > element_precision (TREE_TYPE
(@2))
+   && ! HONOR_NANS (type)
+ && ! HONOR_INFINITIES (type)))
+  (plus (mult (abs @0) @1) @2) ))
+

1) and 2) won't generate double conversion, only 3) has frsp in fast-math mode,
and it could be removed by above pattern.

PS: convert_to_real_1 seems to me not quite related here? It converts
(float)sqrt((double)x) where x is float into sqrtf(x), but with recursive call
to convert_to_real_1 and build_call_expr with new mathfn_built_in, I suppose it
a bit complicated to move them to match.pd?

The optimization should be under fast-math mode, is
flag_unsafe_math_optimizations enough to guard them?

[Bug tree-optimization/22326] promotions (from float to double) are not removed when they should be able to

2020-11-23 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=22326

--- Comment #10 from luoxhu at gcc dot gnu.org ---
Even we could optimize fabs to fabsf, it doesn't help here as y and z are
already promoted to double, then we still need a large pattern to match the
MUL&PLUS expression in match.pd, so fabs to fabsf seems not a reasonable
direction...

[Bug tree-optimization/22326] promotions (from float to double) are not removed when they should be able to

2020-11-26 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=22326

--- Comment #13 from luoxhu at gcc dot gnu.org ---
Tried implementation with backprop, found that this model seems not suitable
for double promotion remove with BACK propagation.   i.e:

1) mad1.c

float foo (float x, float y, float z)
{
   return ( y * fabs (x) + z ); 
}

mad1.c.098t.cunrolli:

foo (float x, float y, float z)
{
  double _1;
  float _2;
  double _3;
  double _4;
  double _5;
  double _6;
  float _10;

   [local count: 1073741824]:
  _1 = (double) y_7(D);
  _2 = ABS_EXPR ;
  _3 = (double) _2;
  _4 = _1 * _3;
  _5 = (double) z_9(D);
  _6 = _4 + _5;
  _10 = (float) _6;
  return _10;
}

mad1.c.099t.backprop:

[USE] _10 in return _10;
[USE] _6 in _10 = (float) _6;
  _6: convert from float to double not important
[DEF] Recording new information for _6 = _4 + _5;
  _6: convert from float to double not important
[USE] _5 in _6 = _4 + _5;
  _5: convert from float to double not important
[DEF] Recording new information for _5 = (double) z_9(D);
  _5: convert from float to double not important
[USE] _4 in _6 = _4 + _5;
  _4: convert from float to double not important
[DEF] Recording new information for _4 = _1 * _3;
  _4: convert from float to double not important
[USE] _3 in _4 = _1 * _3;
  _3: convert from float to double not important
[DEF] Recording new information for _3 = (double) _2;
  _3: convert from float to double not important
[USE] _2 in _3 = (double) _2;
  _2: convert from float to double not important
[DEF] Recording new information for _2 = ABS_EXPR ;
  _2: convert from float to double not important
[USE] _1 in _4 = _1 * _3;
  _1: convert from float to double not important
[DEF] Recording new information for _1 = (double) y_7(D);
  _1: convert from float to double not important

gimple_simplified to _10 = _13;
Deleting _6 = z_9(D) + _12;
Deleting _5 = (double) z_9(D);
Deleting _4 = _2 * y_7(D);
Deleting _3 = (double) _2;
Deleting _1 = (double) y_7(D);


__attribute__((noinline))
foo (float x, float y, float z)
{
  float _2;
  float _10;
  float _12;
  float _13;

   [local count: 1073741824]:
  _2 = ABS_EXPR ;
  _12 = _2 * y_7(D);
  _13 = z_9(D) + _12;
  _10 = _13;
  return _10;

}


All convert and promotions could be removed. But if change float x to double x,
it doesn't work now:

2)  mad2.c

float foo (double x, float y, float z)
{
   return ( y * fabs (x) + z ); 
}


mad2.c.098t.cunrolli:

foo (double x, float y, float z)
{
  double _1;
  double _2;
  double _3;
  double _4;
  double _5;
  float _9;

   [local count: 1073741824]:
  _1 = (double) y_6(D);
  _2 = ABS_EXPR ;
  _3 = _1 * _2;
  _4 = (double) z_8(D);
  _5 = _3 + _4;
  _9 = (float) _5;
  return _9;

}

mad2.c.099t.backprop:

[USE] _9 in return _9;
[USE] _5 in _9 = (float) _5;
  _5: convert from float to double not important
[DEF] Recording new information for _5 = _3 + _4;
  _5: convert from float to double not important
[USE] _4 in _5 = _3 + _4;
  _4: convert from float to double not important
[DEF] Recording new information for _4 = (double) z_8(D);
  _4: convert from float to double not important
[USE] _3 in _5 = _3 + _4;
  _3: convert from float to double not important
[DEF] Recording new information for _3 = _1 * _2;
  _3: convert from float to double not important
[USE] _2 in _3 = _1 * _2;
  _2: convert from float to double not important
[DEF] Recording new information for _2 = ABS_EXPR ;
  _2: convert from float to double not important
[USE] _1 in _3 = _1 * _2;
  _1: convert from float to double not important
[DEF] Recording new information for _1 = (double) y_6(D);
  _1: convert from float to double not important

Deleting _4 = (double) z_8(D);
Deleting _1 = (double) y_6(D);


EMERGENCY DUMP:

__attribute__((noinline))
foo (double x, float y, float z)
{
  double _2;
  double _3;
  double _5;
  float _9;

   [local count: 1073741824]:
  _2 = ABS_EXPR ;
  _3 = _2 * y_6(D);
  _5 = _3 + z_8(D);
  _9 = (float) _5;
  return _9;

}

[Bug tree-optimization/22326] promotions (from float to double) are not removed when they should be able to

2020-11-26 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=22326

--- Comment #14 from luoxhu at gcc dot gnu.org ---
(In reply to luoxhu from comment #13)
> 
> 2)  mad2.c
> 
> float foo (double x, float y, float z)
> {
>return ( y * fabs (x) + z ); 
> }
> 
> 
> mad2.c.098t.cunrolli:
> 
> foo (double x, float y, float z)
> {
>   double _1;
>   double _2;
>   double _3;
>   double _4;
>   double _5;
>   float _9;
> 
>[local count: 1073741824]:
>   _1 = (double) y_6(D);
>   _2 = ABS_EXPR ;
>   _3 = _1 * _2;
>   _4 = (double) z_8(D);
>   _5 = _3 + _4;
>   _9 = (float) _5;
>   return _9;
> 
> }
> 

Maybe should use forward propagation here to save [_1, _2, _3 ... _9] to m_vars
and set ignore_convert status in usage_info if rhs of the expression could
remove double conversion, for stmt which has two rhs, need intersect status
with AND operation of rhs1 ignore_convert and rhs2 ignore_convert, also clear
the ignore_convert status if any of it is false.  Not sure whether this works,
also a bit more complicated then expected...

[Bug tree-optimization/22326] promotions (from float to double) are not removed when they should be able to

2020-11-30 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=22326

--- Comment #17 from luoxhu at gcc dot gnu.org ---
(In reply to rsand...@gcc.gnu.org from comment #16)
> > 2)  mad2.c
> > 
> > float foo (double x, float y, float z)
> > {
> >return ( y * fabs (x) + z ); 
> > }
> > 
> > 
> > mad2.c.098t.cunrolli:
> > 
> > foo (double x, float y, float z)
> > {
> >   double _1;
> >   double _2;
> >   double _3;
> >   double _4;
> >   double _5;
> >   float _9;
> > 
> >[local count: 1073741824]:
> >   _1 = (double) y_6(D);
> >   _2 = ABS_EXPR ;
> >   _3 = _1 * _2;
> >   _4 = (double) z_8(D);
> >   _5 = _3 + _4;
> >   _9 = (float) _5;
> >   return _9;
> > 
> > }
> > 
> > mad2.c.099t.backprop:
> > 
> > [USE] _9 in return _9;
> > [USE] _5 in _9 = (float) _5;
> >   _5: convert from float to double not important
> > [DEF] Recording new information for _5 = _3 + _4;
> >   _5: convert from float to double not important
> > [USE] _4 in _5 = _3 + _4;
> >   _4: convert from float to double not important
> > [DEF] Recording new information for _4 = (double) z_8(D);
> >   _4: convert from float to double not important
> > [USE] _3 in _5 = _3 + _4;
> >   _3: convert from float to double not important
> > [DEF] Recording new information for _3 = _1 * _2;
> >   _3: convert from float to double not important
> > [USE] _2 in _3 = _1 * _2;
> >   _2: convert from float to double not important
> > [DEF] Recording new information for _2 = ABS_EXPR ;
> >   _2: convert from float to double not important
> > [USE] _1 in _3 = _1 * _2;
> >   _1: convert from float to double not important
> > [DEF] Recording new information for _1 = (double) y_6(D);
> >   _1: convert from float to double not important
> > 
> > Deleting _4 = (double) z_8(D);
> > Deleting _1 = (double) y_6(D);
> > 
> > 
> > EMERGENCY DUMP:
> > 
> > __attribute__((noinline))
> > foo (double x, float y, float z)
> > {
> >   double _2;
> >   double _3;
> >   double _5;
> >   float _9;
> > 
> >[local count: 1073741824]:
> >   _2 = ABS_EXPR ;
> >   _3 = _2 * y_6(D);
> >   _5 = _3 + z_8(D);
> >   _9 = (float) _5;
> >   return _9;
> > 
> > }
> Maybe I'm misunderstanding the point, but isn't this
> just an issue with the way that the results of the
> analysis are applied to the IL, rather than a problem
> in the analysis itself?

Yes, the optimize operations on Gimple is a bit uncertain.
Do you mean add convert from double to float at proper place
like below to avoid ICE caused by type mismatch ICE in verify_ssa?
Which one will be better, and whether it is correct for all kind
of math operations like pow/exp, etc under fast-math?  If so, no
cancelling is needed again as Richi mentioned?

1) convert before ABS_EXPR:

foo (double x, float y, float z)
{
  float _9;
  float _11;
  float _12;
  float _13;
  float _14;

   [local count: 1073741824]:
  _11 = (float) x_6(D);
  _12 = ABS_EXPR <_11>;
  _13 = y_7(D) * _12;
  _14 = z_8(D) + _13;
  _9 = _14;
  return _9;

}

foo:
.LFB0:
.cfi_startproc
frsp 0,1
fabs 0,0
fmadds 1,2,0,3
blr


2) OR convert after ABS_EXPR:

foo (double x, float y, float z)
{
  double _1;
  float _9;
  float _11;
  float _12;
  float _13;

   [local count: 1073741824]:
  _1 = ABS_EXPR ;
  _11 = (float) _1;
  _12 = y_7(D) * _11;
  _13 = z_8(D) + _12;
  _9 = _13;
  return _9;

}

foo:
.LFB0:
.cfi_startproc
fabs 0,1
frsp 0,0
fmadds 1,2,0,3
blr

[Bug tree-optimization/98066] [11 Regression] ICE: Segmentation fault (in gsi_next)

2020-11-30 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98066

--- Comment #8 from luoxhu at gcc dot gnu.org ---
Thanks for the quick fix!

[Bug target/98093] ICE in gen_vsx_set_v2df, at config/rs6000/vsx.md:3276

2020-12-02 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98093

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 CC||luoxhu at gcc dot gnu.org

--- Comment #1 from luoxhu at gcc dot gnu.org ---
Confirmed, I will fix it. 

Actually I have pending patch not committed yet. [PATCH 2/4] which generate
VIEW_CONVERT_EXPR is not committed, but V2DF VIEW_CONVERT_EXPR will be convert
to IFN VEC_SET in gimple-isel now which caused the ICE.

VIEW_CONVERT_EXPR(t)[i_12] = x_6(D);

(https://gcc.gnu.org/pipermail/gcc-patches/2020-October/555906.html)

IFN VEC_SET is not expanded on Power8 yet, [PATCH 3/4] could fix this. Need
Segher's approval.

[Bug target/98093] ICE in gen_vsx_set_v2df, at config/rs6000/vsx.md:3276

2020-12-02 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98093

--- Comment #2 from luoxhu at gcc dot gnu.org ---
https://gcc.gnu.org/pipermail/gcc-patches/2020-October/555907.html

[PATCH 3/4] rs6000: Enable vec_insert for P8 with
rs6000_expand_vector_set_var_p8

[Bug tree-optimization/22326] promotions (from float to double) are not removed when they should be able to

2020-12-13 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=22326

--- Comment #22 from luoxhu at gcc dot gnu.org ---
https://gcc.gnu.org/pipermail/gcc/2020-December/234474.html

So this issue seems invalid since "fabs(x)*y+z” or "fabs(x)+y+z"(x,y,z are
float) could result in -+Inf sometimes, while it won't getting float overflow
under double computation.  Float value range info is required here.



Quoto Richard's reply:

I still think that covering all "good" cases in match.pd will require excessive
matching and that it is better done in a pass (this would include removing
the frontend handling for math functions).  Note that for example
(float)((double)x + (double)y) with float x and y is also eligible to demotion
to float, likewise may some compositions like
(float)(sin((double)x)*cos ((double)y))
for float x and y since we can constrain ranges here.  Likewise
(float)((double)x + fabs ((double)y)) for float x and y.  The
propagation would need
to stop when the range needed increases in unknown ways.

[Bug target/79251] PowerPC vec_insert generates store-hit-load if the element number is variable

2021-01-06 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79251

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 CC||luoxhu at gcc dot gnu.org

--- Comment #5 from luoxhu at gcc dot gnu.org ---

This patchset fixes this issue for both P8 and P9:

[PATCH 0/4] rs6000: Enable variable vec_insert with IFN VEC_SET

https://gcc.gnu.org/pipermail/gcc-patches/2020-October/555905.html
https://gcc.gnu.org/pipermail/gcc-patches/2020-October/555906.html
https://gcc.gnu.org/pipermail/gcc-patches/2020-October/555907.html

[Bug target/98065] [11 Regression] ICE in rs6000_expand_vector_set, at config/rs6000/rs6000.c:7024 since r11-5457

2021-01-19 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98065

--- Comment #4 from luoxhu at gcc dot gnu.org ---
Sorry, my patch 
https://gcc.gnu.org/pipermail/gcc-patches/2020-October/555906.html
could fix this, but below two of them is still pending for approval, I pinged
it 5 times since last Oct. @Segher :)

https://gcc.gnu.org/pipermail/gcc-patches/2020-October/555907.html
https://gcc.gnu.org/pipermail/gcc-patches/2020-October/555908.html

[Bug target/98799] [10 Regression] vector_set_var ICE

2021-01-25 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98799

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 CC||luoxhu at gcc dot gnu.org

--- Comment #5 from luoxhu at gcc dot gnu.org ---
(In reply to David Edelsohn from comment #4)
> Created attachment 50043 [details]
> patch
> 
> Updated patch, but the entire rs6000_expand_set_var() logic seems to be
> incomplete and missing some scenarios, i.e., P9 and P8 that assume PPC64 are
> not sufficient.

The ICE is caused by UNSPEC_SI_FROM_SF not supported when
TARGET_DIRECT_MOVE_64BIT
is false.
Thank for the patch, but also need below change to fix the ICE in
gcc.target/powerpc/fold-vec-insert-float-p8.c when build with -m32 to avoid
generate IFN VEC_SET for P8BE-32bit.
Not sure about the meaning of "P9 and P8 that assume PPC64 are not sufficient"?


diff --git a/gcc/config/rs6000/rs6000-c.c b/gcc/config/rs6000/rs6000-c.c
index f6ee1e6..656cdb3 100644
--- a/gcc/config/rs6000/rs6000-c.c
+++ b/gcc/config/rs6000/rs6000-c.c
@@ -1600,7 +1600,7 @@ altivec_resolve_overloaded_builtin (location_t loc, tree
fndecl,
  stmt = build1 (COMPOUND_LITERAL_EXPR, arg1_type, stmt);
}

-  if (TARGET_P8_VECTOR)
+  if (TARGET_P8_VECTOR && TARGET_DIRECT_MOVE_64BIT)
{
  stmt = build_array_ref (loc, stmt, arg2);
  stmt = fold_build2 (MODIFY_EXPR, TREE_TYPE (arg0), stmt,

[Bug target/98827] [11 regression] gcc.target/powerpc/vsx-builtin-7.c assembler counts off after r11-6857

2021-01-25 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98827

--- Comment #1 from luoxhu at gcc dot gnu.org ---
Strange that I see only xxpermdi fail, should be 4 instead of 12. rldic passes
for m64, what's your configuration please?



=== gcc tests ===

Schedule of variations:
unix/-m32
unix/-m64

Running target unix/-m32
Running /home/luoxhu/workspace/gcc/gcc/testsuite/gcc.target/powerpc/powerpc.exp
...
PASS: gcc.target/powerpc/vsx-builtin-7.c (test for excess errors)
PASS: gcc.target/powerpc/vsx-builtin-7.c scan-assembler-times \\mrldic\\M 0
PASS: gcc.target/powerpc/vsx-builtin-7.c scan-assembler-times vspltisb 2
PASS: gcc.target/powerpc/vsx-builtin-7.c scan-assembler-times vspltish 2
PASS: gcc.target/powerpc/vsx-builtin-7.c scan-assembler-times vspltisw 2
PASS: gcc.target/powerpc/vsx-builtin-7.c scan-assembler-times xxpermdi 4
=== gcc Summary for unix/-m32 ===

# of expected passes6
Running target unix/-m64
Running /home/luoxhu/workspace/gcc/gcc/testsuite/gcc.target/powerpc/powerpc.exp
...
PASS: gcc.target/powerpc/vsx-builtin-7.c (test for excess errors)
PASS: gcc.target/powerpc/vsx-builtin-7.c scan-assembler-times \\mrldic\\M 64
PASS: gcc.target/powerpc/vsx-builtin-7.c scan-assembler-times vspltisb 2
PASS: gcc.target/powerpc/vsx-builtin-7.c scan-assembler-times vspltish 2
PASS: gcc.target/powerpc/vsx-builtin-7.c scan-assembler-times vspltisw 2
PASS: gcc.target/powerpc/vsx-builtin-7.c scan-assembler-times xxpermdi 4
=== gcc Summary for unix/-m64 ===

# of expected passes6

=== gcc Summary ===

# of expected passes12
/home/luoxhu/workspace/build/gcc/xgcc  version 11.0.0 20210125 (experimental)
(GCC)

luoxhu@bns:~/workspace/build$ gcc/xgcc -v
Using built-in specs.
COLLECT_GCC=gcc/xgcc
Target: powerpc64-unknown-linux-gnu
Configured with: ../gcc/configure --enable-languages=c,c++,fortran
--prefix=/home/luoxhu/local/gcc/ --disable-bootstrap --with-cpu=power7
--disable-libsanitizer : (reconfigured) ../gcc/configure
--prefix=/home/luoxhu/local/gcc/ --disable-bootstrap --with-cpu=power7
--disable-libsanitizer CC=/opt/gcc81/bin/gcc CXX=/opt/gcc81/bin/g++
--enable-languages=c,c++,fortran,lto --no-create --no-recursion
Thread model: posix
Supported LTO compression algorithms: zlib
gcc version 11.0.0 20210125 (experimental) (GCC)

[Bug target/98827] [11 regression] gcc.target/powerpc/vsx-builtin-7.c assembler counts off after r11-6857

2021-01-26 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98827

--- Comment #3 from luoxhu at gcc dot gnu.org ---
I know it now, the r11-6858 did some changes the P8 code generation, so the
latest failure also changes.

https://gcc.gnu.org/pipermail/gcc-testresults/2021-January/651154.html

current failures are:

FAIL: gcc.dg/vect/vect-outer-call-1.c scan-tree-dump vect "OUTER LOOP
VECTORIZED"
FAIL: gcc.dg/vect/vect-strided-a-u8-i2-gap.c -flto -ffat-lto-objects 
scan-tree-dump-times vect "vectorized 1 loops" 1
FAIL: gcc.dg/vect/vect-strided-a-u8-i2-gap.c scan-tree-dump-times vect
"vectorized 1 loops" 1
FAIL: gcc.target/powerpc/20050603-3.c scan-assembler-not mrldic
FAIL: gcc.target/powerpc/rlwimi-2.c scan-assembler-times (?n)^s+[a-z] 20217
FAIL: gcc.target/powerpc/vsx-builtin-7.c scan-assembler-times xxpermdi 11
XPASS: gcc.target/powerpc/ppc-fortran/ieee128-math.f90   -O  (test for excess
errors)

only xxpermdi need be updated to 4.

[Bug target/98799] [11 Regression] vector_set_var ICE

2021-01-28 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98799

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #7 from luoxhu at gcc dot gnu.org ---
Fixed.

[Bug target/98065] [11 Regression] ICE in rs6000_expand_vector_set, at config/rs6000/rs6000.c:7024 since r11-5457

2021-01-28 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98065
Bug 98065 depends on bug 98799, which changed state.

Bug 98799 Summary: [11 Regression] vector_set_var ICE
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98799

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

[Bug target/79251] PowerPC vec_insert generates store-hit-load if the element number is variable

2021-01-28 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79251

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #7 from luoxhu at gcc dot gnu.org ---
Fixed on master.

[Bug target/98093] ICE in gen_vsx_set_v2df, at config/rs6000/vsx.md:3276

2021-02-01 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98093

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #7 from luoxhu at gcc dot gnu.org ---
Fixed.

[Bug target/98914] [11 Regression] ICE in rs6000_expand_vector_set, at config/rs6000/rs6000.c:7198

2021-02-02 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98914

--- Comment #1 from luoxhu at gcc dot gnu.org ---
The type of k in the case should be "long" to reproduce the issue, 

ICE happens at 

rs6000_expand_vector_set:  gcc_assert (GET_MODE (idx) == E_SImode);


Reason is the vector index variable need be "signed int" for all vec_insert
prototype. 


ELFv2 ABI:
vector signed char vec_insert (signed char, vector signed char, signed int);
vector unsigned char vec_insert (unsigned char, vector unsigned char, signed
int);
vector signed int vec_insert (signed int, vector signed int, signed int);
vector unsigned int vec_insert (unsigned int, vector unsigned int, signed int);
vector signed long long vec_insert (signed long long, vector signed long long,
signed int);


Not sure all targets like X86/AArch64 also has some requirements, and whether
below fix reasonable to not generate IFN VEC_SET for stmt like 

VIEW_CONVERT_EXPR(v)[k_7] = 170;  ?


diff --git a/gcc/gimple-isel.cc b/gcc/gimple-isel.cc
index 2c78a08d3f1..dbbae270a36 100644
--- a/gcc/gimple-isel.cc
+++ b/gcc/gimple-isel.cc
@@ -77,6 +77,7 @@ gimple_expand_vec_set_expr (gimple_stmt_iterator *gsi)
   tree view_op0 = TREE_OPERAND (op0, 0);
   machine_mode outermode = TYPE_MODE (TREE_TYPE (view_op0));
   if (auto_var_in_fn_p (view_op0, cfun->decl)
+ && TYPE_MODE (TREE_TYPE (pos)) == E_SImode
  && !TREE_ADDRESSABLE (view_op0) && can_vec_set_var_idx_p (outermode))
{
  location_t loc = gimple_location (stmt);

[Bug target/98958] ICE in rs6000_expand_vector_set_var_p8, at config/rs6000/rs6000.c:7050

2021-02-03 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98958

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |DUPLICATE

--- Comment #1 from luoxhu at gcc dot gnu.org ---
dup

*** This bug has been marked as a duplicate of bug 98914 ***

[Bug target/98914] [11 Regression] ICE in rs6000_expand_vector_set, at config/rs6000/rs6000.c:7198

2021-02-03 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98914

--- Comment #4 from luoxhu at gcc dot gnu.org ---
*** Bug 98958 has been marked as a duplicate of this bug. ***

[Bug middle-end/90323] powerpc should convert equivalent sequences to vec_sel()

2021-04-29 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90323

--- Comment #16 from luoxhu at gcc dot gnu.org ---

> +2016-11-09  Segher Boessenkool  
> +
> +   * simplify-rtx.c (simplify_binary_operation_1): Simplify
> +   (xor (and (xor A B) C) B) to (ior (and A C) (and B ~C)) and
> +   (xor (and (xor A B) C) A) to (ior (and A ~C) (and B C)) if C
> +   is a const_int.


Is it a MUST that C be const here? For this case in PR90323, C is not a const 
actually.

l = l & ~mask;
l |= mask & r;

Trying 8, 9 -> 10:
8: r127:V4SI=r124:V4SI^r131:V4SI
  REG_DEAD r131:V4SI
9: r122:V4SI=r127:V4SI&r130:V4SI
  REG_DEAD r130:V4SI
  REG_DEAD r127:V4SI
   10: r128:V4SI=r124:V4SI^r122:V4SI
  REG_DEAD r124:V4SI
  REG_DEAD r122:V4SI

[Bug middle-end/90323] powerpc should convert equivalent sequences to vec_sel()

2021-04-29 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90323

--- Comment #17 from luoxhu at gcc dot gnu.org ---
If the constant limitation is removed, it could be combined successfully with
my new patch for PR94613.

https://gcc.gnu.org/pipermail/gcc-patches/2021-April/569255.html

And what do you mean"This is not canonical form on RTL, and it's not a useful
form either" in c#7, please? Not understanding the point...


Trying 11 -> 16:
   11: r124:V4SI=r127:V4SI&r129:V4SI|~r129:V4SI&r128:V4SI
  REG_DEAD r128:V4SI
  REG_DEAD r129:V4SI
  REG_DEAD r127:V4SI
   16: %v2:V4SI=r124:V4SI
  REG_DEAD r124:V4SI
Successfully matched this instruction:
(set (reg/i:V4SI 66 %v2)
(ior:V4SI (and:V4SI (reg:V4SI 127)
(reg:V4SI 129))
(and:V4SI (not:V4SI (reg:V4SI 129))
(reg:V4SI 128
allowing combination of insns 11 and 16
original costs 4 + 4 = 8
replacement cost 4
deferring deletion of insn with uid = 11.
modifying insn i316: %v2:V4SI=r127:V4SI&r129:V4SI|~r129:V4SI&r128:V4SI
  REG_DEAD r127:V4SI
  REG_DEAD r129:V4SI
  REG_DEAD r128:V4SI
deferring rescan insn with uid = 16.


diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c
index 571e2337e27..701f37eb03e 100644
--- a/gcc/simplify-rtx.c
+++ b/gcc/simplify-rtx.c
@@ -3405,7 +3405,6 @@ simplify_context::simplify_binary_operation_1 (rtx_code
code,
 machines, and also has shorter instruction path length.  */
   if (GET_CODE (op0) == AND
  && GET_CODE (XEXP (op0, 0)) == XOR
- && CONST_INT_P (XEXP (op0, 1))
  && rtx_equal_p (XEXP (XEXP (op0, 0), 0), trueop1))
{
  rtx a = trueop1;
@@ -3419,7 +3418,6 @@ simplify_context::simplify_binary_operation_1 (rtx_code
code,
   /* Similarly, (xor (and (xor A B) C) B) as (ior (and A C) (and B ~C)) 
*/
   else if (GET_CODE (op0) == AND
  && GET_CODE (XEXP (op0, 0)) == XOR
- && CONST_INT_P (XEXP (op0, 1))
  && rtx_equal_p (XEXP (XEXP (op0, 0), 1), trueop1))
{
  rtx a = XEXP (XEXP (op0, 0), 0);

[Bug target/100085] Bad code for union transfer from __float128 to vector types

2021-05-23 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100085

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 CC||luoxhu at gcc dot gnu.org

--- Comment #7 from luoxhu at gcc dot gnu.org ---
(In reply to Segher Boessenkool from comment #3)
> The rotates in 6 and 7 are not merged, and neither are the vec_selects in
> 8 and 9.  Both should be pretty easy to do, there is no unspec in sight,
> etc.

Should this be done in pass bswaps or combine or by peephole2? :)

[Bug target/97142] __builtin_fmod not optimized on POWER

2021-05-26 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97142

--- Comment #12 from luoxhu at gcc dot gnu.org ---
Patch submitted:

https://gcc.gnu.org/pipermail/gcc-patches/2021-April/568143.html

[Bug target/94613] S/390, powerpc: Wrong code generated for vec_sel builtin

2021-05-26 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94613

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 CC||luoxhu at gcc dot gnu.org

--- Comment #14 from luoxhu at gcc dot gnu.org ---
Patch submmited:

https://gcc.gnu.org/pipermail/gcc-patches/2021-April/569255.html

[Bug target/100085] Bad code for union transfer from __float128 to vector types

2021-06-02 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100085

--- Comment #9 from luoxhu at gcc dot gnu.org ---
Patch sent, it could fix the __float128 to vector __int128 issue, 

https://gcc.gnu.org/pipermail/gcc-patches/2021-June/571689.html


But for __float128 to __int128 mentioned in #c4, need hack
rs6000_modes_tieable_p
to remove the stack operation in dse1. But I am not sure this is *LEGAL* since
TImode is allocated to GPR, It seems not true to access TImode from ALTIVEC or
VSX without copying?

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index ad11b67b125..ee69463ac46 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -1974,6 +1974,9 @@ rs6000_modes_tieable_p (machine_mode mode1, machine_mode
mode2)
   || mode2 == PTImode || mode2 == OOmode || mode2 == XOmode)
 return mode1 == mode2;

+  if (mode1 == TImode && ALTIVEC_OR_VSX_VECTOR_MODE (mode2))
+return true;
+


xxpermdi %vs0,%vs34,%vs34,3
mfvsrd %r4,%vs34
mfvsrd %r3,%vs0

[Bug target/100085] Bad code for union transfer from __float128 to vector types

2021-06-08 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100085

--- Comment #10 from luoxhu at gcc dot gnu.org ---
float128 to vector __int128 is fixed by:

https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=f700e4b0ee3ef53b48975cf89be26b9177e3a3f3

[Bug testsuite/101020] [12 regression] Several test case failures after r12-1316

2021-06-10 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101020

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 CC||segher at gcc dot gnu.org,
   ||segher at kernel dot 
crashing.org

--- Comment #1 from luoxhu at gcc dot gnu.org ---
Confirmed. The BE-m32 test is a nightmare to me... :(

For float128-call.c, need check target BE or LE.
And for pr100085.c, vector __int128 is not supported with {-m32}, just skip it.
Ok to trunk?


[PATCH] rs6000: Fix test case failures by PR100085 [PR101020]

gcc/testsuite/ChangeLog:

PR target/101020
* gcc.target/powerpc/float128-call.c: Adjust.
* gcc.target/powerpc/pr100085.c: Likewise.
---
 gcc/testsuite/gcc.target/powerpc/float128-call.c | 6 --
 gcc/testsuite/gcc.target/powerpc/pr100085.c  | 2 +-
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/gcc/testsuite/gcc.target/powerpc/float128-call.c
b/gcc/testsuite/gcc.target/powerpc/float128-call.c
index a1f09df..b64ffc6 100644
--- a/gcc/testsuite/gcc.target/powerpc/float128-call.c
+++ b/gcc/testsuite/gcc.target/powerpc/float128-call.c
@@ -21,5 +21,7 @@
 TYPE one (void) { return ONE; }
 void store (TYPE a, TYPE *p) { *p = a; }

-/* { dg-final { scan-assembler "lvx 2"  } } */
-/* { dg-final { scan-assembler "stvx 2" } } */
+/* { dg-final { scan-assembler {\mlxvd2x 34\M} {target be} } } */
+/* { dg-final { scan-assembler {\mstxvd2x 34\M} {target be} } } */
+/* { dg-final { scan-assembler {\mlvx 2\M} {target le} } }  */
+/* { dg-final { scan-assembler {\mstvx 2\M} {target le} } } */
diff --git a/gcc/testsuite/gcc.target/powerpc/pr100085.c
b/gcc/testsuite/gcc.target/powerpc/pr100085.c
index 7d8b147..b6738ea 100644
--- a/gcc/testsuite/gcc.target/powerpc/pr100085.c
+++ b/gcc/testsuite/gcc.target/powerpc/pr100085.c
@@ -1,4 +1,4 @@
-/* { dg-do compile } */
+/* { dg-do compile {target lp64} } */
 /* { dg-options "-O2 -mdejagnu-cpu=power8" } */

[Bug target/100866] PPC: Inefficient code for vec_revb(vector unsigned short) < P9

2021-06-15 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100866

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 CC||luoxhu at gcc dot gnu.org

--- Comment #2 from luoxhu at gcc dot gnu.org ---
But it only works for V8HImode, no better code generation for other modes like
V4SI/V2DI/V1TI to do byte swap with only two instructions vspltish+vrlh?

  unsigned int swap1[16] = {15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0};
  unsigned int swap2[16] = {7,6,5,4,3,2,1,0,15,14,13,12,11,10,9,8};
  unsigned int swap4[16] = {3,2,1,0,7,6,5,4,11,10,9,8,15,14,13,12};
  unsigned int swap8[16] = {1,0,3,2,5,4,7,6,9,8,11,10,13,12,15,14};

For example V4SI, need swap short first,  then swap word, it seems not so
straight forward than vperm?

[Bug target/100866] PPC: Inefficient code for vec_revb(vector unsigned short) < P9

2021-06-15 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100866

--- Comment #3 from luoxhu at gcc dot gnu.org ---

diff --git a/gcc/config/rs6000/altivec.md b/gcc/config/rs6000/altivec.md
index 097a127be07..35b3f1a0e1a 100644
--- a/gcc/config/rs6000/altivec.md
+++ b/gcc/config/rs6000/altivec.md
@@ -1932,7 +1932,7 @@ (define_insn "altivec_vpkuum_direct"
 }
   [(set_attr "type" "vecperm")])

-(define_insn "*altivec_vrl"
+(define_insn "altivec_vrl"
   [(set (match_operand:VI2 0 "register_operand" "=v")
 (rotate:VI2 (match_operand:VI2 1 "register_operand" "v")
(match_operand:VI2 2 "register_operand" "v")))]
diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md
index 8c5865b8c34..88b34a2285a 100644
--- a/gcc/config/rs6000/vsx.md
+++ b/gcc/config/rs6000/vsx.md
@@ -5849,9 +5849,18 @@ (define_expand "revb_"
   /* Want to have the elements in reverse order relative
 to the endian mode in use, i.e. in LE mode, put elements
 in BE order.  */
-  rtx sel = swap_endian_selector_for_mode(mode);
-  emit_insn (gen_altivec_vperm_ (operands[0], operands[1],
-  operands[1], sel));
+  if (mode == V8HImode)
+   {
+ rtx splt = gen_reg_rtx (V8HImode);
+ emit_insn (gen_altivec_vspltish (splt, GEN_INT (8)));
+ emit_insn (gen_altivec_vrlh (operands[0], operands[1], splt));
+   }
+  else
+   {
+ rtx sel = swap_endian_selector_for_mode ( mode);
+ emit_insn (gen_altivec_vperm_ (operands[0], operands[1],
+  operands[1], sel));
+   }
 }


With above change, it could generate the expected code:

revb:
.LFB0:
.cfi_startproc
vspltisw 0,8
vrlw 2,2,0
blr

[Bug testsuite/101020] [12 regression] Several test case failures after r12-1316

2021-06-15 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101020

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|NEW |RESOLVED

--- Comment #4 from luoxhu at gcc dot gnu.org ---
Fixed.

[Bug target/100866] PPC: Inefficient code for vec_revb(vector unsigned short) < P9

2021-06-15 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100866

--- Comment #5 from luoxhu at gcc dot gnu.org ---
(In reply to Segher Boessenkool from comment #4)
> This PR is specifically about the vec_revb builtin.  But yes, we should
> look at what is generated for all other code (having only the builtin
> generate good code is suboptimal for a generic thing like this), and for
> other sizes as well.

Sorry I don't quite understand what you mean. IMO vec_revb is expanded by
CODE_FOR_revb_v8hi through revb_ pattern. So this is where we should
change to make better code generation... 
For V8HI, it is natural to use vspltish 8+vrlh to turn
{0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15} to
{1,0,3,2,5,4,7,6,9,8,11,10,13,12,15,14}.

But for V4SI, we need use vspltish+vrlh to turn it to
{1,0,3,2,5,4,7,6,9,8,11,10,13,12,15,14} first, and a "vrlw 16" to turn it to 
{3,2,1,0,7,6,5,4,11,10,9,8,15,14,13,12}. I am not sure whether this is better
than lvx+xxlnor+vperm especially for V2DI&V1TI with additional "vrld 32" or
"vrld 32"+"vrlq 64"? (Those are all operations on register without load from
memory like lvx.)


bt 5
#0  gen_revb_v8hi (operand0=0x74d4ce40, operand1=0x74d4cf60) at
../../gcc/gcc/config/rs6000/vsx.md:5858
#1  0x10b05360 in insn_gen_fn::operator()
(this=0x130ab188 ) at../../gcc/gcc/recog.h:407
#2  0x11aa1e30 in rs6000_expand_unop_builtin (icode=CODE_FOR_revb_v8hi,
exp=
, target=0x74d4ce40) at ../../gcc/gcc/config/rs6000/rs6000-call.c:9451
#3  0x11ab27a4 in rs6000_expand_builtin (exp=, target=0x74d4ce40, subtarget=0x0, mode=E_V8HImode,
ignore=0) at ../../gcc/gcc/config/rs6000/rs6000-call.c:13157
#4  0x10815268 in expand_builtin (exp=,
target=0x74d4ce40, subtarget=0x0, mode=E_V8HImode, ignore=0) at
../../gcc/gcc/builtins.c:9559

[Bug target/93571] PPC: fmr gets used instead of faster xxlor

2021-06-15 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93571

luoxhu at gcc dot gnu.org changed:

   What|Removed |Added

 CC||luoxhu at gcc dot gnu.org

--- Comment #2 from luoxhu at gcc dot gnu.org ---
It is generated by "*mov_hardfloat64" (i.e. {*movdf_hardfloat64}), switch
the constraint of fmr and xxlor could generate expected code, is that correct?

[Bug target/93571] PPC: fmr gets used instead of faster xxlor

2021-06-15 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93571

--- Comment #3 from luoxhu at gcc dot gnu.org ---
BTW, I didn't see performance difference between fmr and xxlor within a small
benchmark.

   Max Ops Per CycleLatency (Min)   Latency (Max)   

fmr -   -   ALU FPR 4   2  
2   1   R   -   -   -   -  
Floating Move Register  


xxlor   -   -   ALU VSR 2   2  
2   1   V   -   1   S   -   -  
VSX Vector Logical OR

[Bug target/100866] PPC: Inefficient code for vec_revb(vector unsigned short) < P9

2021-06-17 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100866

--- Comment #6 from luoxhu at gcc dot gnu.org ---
For V4SI, it is also better to use vector splat and vector rotate operations.

revb:
.LFB0:
.cfi_startproc
vspltish %v1,8
vspltisw %v0,-16
vrlh %v2,%v2,%v1
vrlw %v2,%v2,%v0
blr


Performance improved from 7.322s to 2.445s with a small benchmark due to load
instruction replaced.

But for V2DI, we don't have "vspltisd" to splat {32,32} to vector register
before Power9, so lvx is still required?

vector unsigned long long revb_pwr7_l(vector unsigned long long a)
{
 return vec_rl(a, vec_splats((unsigned long long)32));
} 

generates:

revb_pwr7_l:
.LFB1:
.cfi_startproc
.LCF1:
0:  addis 2,12,.TOC.-.LCF1@ha
addi 2,2,.TOC.-.LCF1@l
.localentry revb_pwr7_l,.-revb_pwr7_l
addis %r9,%r2,.LC0@toc@ha
addi %r9,%r9,.LC0@toc@l
lvx %v0,0,%r9
vrld %v2,%v2,%v0
blr
.LC0:
.quad   32
.quad   32
.align 4

[Bug target/100866] PPC: Inefficient code for vec_revb(vector unsigned short) < P9

2021-06-20 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100866

--- Comment #8 from luoxhu at gcc dot gnu.org ---
(In reply to Jens Seifert from comment #7)
> Regarding vec_revb for vector unsigned int. I agree that
> revb:
> .LFB0:
> .cfi_startproc
> vspltish %v1,8
> vspltisw %v0,-16
> vrlh %v2,%v2,%v1
> vrlw %v2,%v2,%v0
> blr
> 
> works. But in this case, I would prefer the vperm approach assuming that the
> loaded constant for the permute vector can be re-used multiple times.
> But please get rid of the xxlnor 32,32,32. That does not make sense after
> loading a constant. Change the constant that need to be loaded.

xxlnor is LE specific requirement(not existed if build with -mbig), we need to
turn the index {0,1,2,3} to {31, 30,29,28} for vperm usage, it is required
otherwise produces incorrect result:

 6|0x1630 <+16>:lvx v0,0,r9
 7+>   0x1634 <+20>:xxlnor  vs32,vs32,vs32
 8|0x1638 <+24>:vperm   v2,v2,v2,v0
 9|0x163c <+28>:blr

(gdb)
0x1634 in revb ()
2: /x $vs34.uint128 = 0x42345678323456782234567812345678
5: /x $vs32.uint128 = 0xc0d0e0f08090a0b0405060700010203
(gdb) si
0x1638 in revb ()
2: /x $vs34.uint128 = 0x42345678323456782234567812345678
5: /x $vs32.uint128 = 0xf3f2f1f0f7f6f5f4fbfaf9f8fffefdfc
(gdb) si
0x163c in revb ()
2: /x $vs34.uint128 = 0x78563442785634327856342278563412
5: /x $vs32.uint128 = 0xf3f2f1f0f7f6f5f4fbfaf9f8fffefdfc



Quoted from the ISA:

vperm VRT,VRA,VRB,VRC

vsrc.qword[0] ← VSR[VRA+32]
vsrc.qword[1] ← VSR[VRB+32]
do i = 0 to 15
index ← VSR[VRC+32].byte[i].bit[3:7]
VSR[VRT+32].byte[i] ← src.byte[index]
end

Let the source vector be the concatenation of the
contents of VSR[VRA+32] followed by the contents of
VSR[VRB+32].
For each integer value i from 0 to 15, do the following.
Let index be the value specified by bits 3:7 of byte
element i of VSR[VRC+32].
The contents of byte element index of src are
placed into byte element i of VSR[VRT+32].

[Bug target/100866] PPC: Inefficient code for vec_revb(vector unsigned short) < P9

2021-06-21 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100866

--- Comment #13 from luoxhu at gcc dot gnu.org ---
It is not visible in combine due to the constant data is in *.LC0 and
UNSPEC_VPERM. Will shelf this and switch to other high priority issues.

pr100866.c.277r.combine:

(note 4 0 20 2 [bb 2] NOTE_INSN_BASIC_BLOCK)
(insn 20 4 2 2 (set (reg:V8HI 126)
(reg:V8HI 66 %v2 [ a ])) "pr100866.c":18:1 1132 {vsx_movv8hi_64bit}
 (expr_list:REG_DEAD (reg:V8HI 66 %v2 [ a ])
(nil)))
(note 2 20 3 2 NOTE_INSN_DELETED)
(note 3 2 6 2 NOTE_INSN_FUNCTION_BEG)
(insn 6 3 18 2 (set (reg/f:DI 122)
(unspec:DI [
(symbol_ref/u:DI ("*.LC0") [flags 0x82])
(reg:DI 2 %r2)
] UNSPEC_TOCREL)) "pr100866.c":19:13 719 {*tocrefdi}
 (expr_list:REG_EQUAL (symbol_ref/u:DI ("*.LC0") [flags 0x82])
(nil)))
(insn 18 6 9 2 (set (reg:V16QI 123)
(mem/u/c:V16QI (and:DI (reg/f:DI 122)
(const_int -16 [0xfff0])) [0  S16 A128]))
"pr100866.c":19:13 1131 {vsx_movv16qi_64bit}
 (expr_list:REG_DEAD (reg/f:DI 122)
(nil)))
(insn 9 18 10 2 (set (reg:V16QI 124)
(not:V16QI (reg:V16QI 123))) "pr100866.c":19:13 508 {one_cmplv16qi2}
 (expr_list:REG_DEAD (reg:V16QI 123)
(nil)))
(note 10 9 15 2 NOTE_INSN_DELETED)
(insn 15 10 16 2 (set (reg/i:V8HI 66 %v2)
(unspec:V8HI [
(reg:V8HI 126) repeated x2
(reg:V16QI 124)
] UNSPEC_VPERM)) "pr100866.c":20:1 1830 {altivec_vperm_v8hi_direct}
 (expr_list:REG_DEAD (reg:V16QI 124)
(expr_list:REG_DEAD (reg:V8HI 126)
(nil
(insn 16 15 0 2 (use (reg/i:V8HI 66 %v2)) "pr100866.c":20:1 -1
 (nil))

;; Combiner totals: 12 attempts, 12 substitutions (2 requiring new space),

[Bug middle-end/101250] New: adjust_iv_update_pos update the iv statement unexpectedly cause memory address offset mismatch

2021-06-29 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101250

Bug ID: 101250
   Summary: adjust_iv_update_pos update the iv statement
unexpectedly cause memory address offset mismatch
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: luoxhu at gcc dot gnu.org
  Target Milestone: ---

Test case:

unsigned int foo (unsigned char *ip, unsigned char *ref, unsigned int maxlen)
{
  unsigned int len = 2;
  do {
  len++;
  }while(len < maxlen && ip[len] == ref[len]);
  return len;
}


ivopts:

   [local count: 1014686026]:
  _3 = MEM[(unsigned char *)ip_10(D) + ivtmp.16_15 * 1];
  ivtmp.16_16 = ivtmp.16_15 + 1;
  _19 = ref_12(D) + 18446744073709551615;
  _6 = MEM[(unsigned char *)_19 + ivtmp.16_16 * 1];
  if (_3 == _6)
goto ; [94.50%]
  else
goto ; [5.50%]

Disable adjust_iv_update_pos will produce:

   [local count: 1014686026]:
  _3 = MEM[(unsigned char *)ip_10(D) + ivtmp.16_15 * 1];
  _6 = MEM[(unsigned char *)ref_12(D) + ivtmp.16_15 * 1];
  ivtmp.16_16 = ivtmp.16_15 + 1;
  if (_3 == _6)
goto ; [94.50%]
  else
goto ; [5.50%]


discussions:
https://gcc.gnu.org/pipermail/gcc-patches/2021-June/573709.html

[Bug lto/105133] New: lto/gold: lto failed to link --start-lib/--end-lib in gold

2022-04-01 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105133

Bug ID: 105133
   Summary: lto/gold: lto failed to link --start-lib/--end-lib in
gold
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: lto
  Assignee: unassigned at gcc dot gnu.org
  Reporter: luoxhu at gcc dot gnu.org
CC: marxin at gcc dot gnu.org
  Target Milestone: ---

Hi, linker gold supports --start-lib and --end-lib to "mimics the
semantics of static libraries, but without needing to actually create
the archive file."(https://reviews.llvm.org/D66848).  Sometimes large
application may introduce multiple libraries from different repositories with
same source code, they would be linked into one binary finally, recently I
suffered from a link error with gold as linker and reduced an example as below:

cat hello.c
extern int hello(int a);
int main(void)
{
  return 0; /* hello(10); */
}

cat ./B/libhello.c
#include 
int hello(int a)
{
   puts("Hello");
   return 0;
}

cat ./C/libhello.c
#include 
int hello(int a)
{
   puts("Hello");
   return 0;
}


(1) NON lto link with gold is OK:

gcc -O2 -o ./B/libhello.c.o   -c ./B/libhello.c
gcc-ar qc ./B/libhello.a  ./B/libhello.c.o
gcc-ranlib ./B/libhello.a
gcc -O2 -o ./C/libhello.c.o   -c ./C/libhello.c
gcc-ar qc ./C/libhello.a  ./C/libhello.c.o
gcc-ranlib ./C/libhello.a
gcc hello.c -o hello.o -c -O2
gcc -o hellow hello.o -Wl,--start-lib ./B/libhello.c.o  -Wl,--end-lib
-Wl,--start-lib ./C/libhello.c.o -Wl,--end-lib -O2 -fuse-ld=gold


(2) lto link with gold fails with redefinition:
gcc -O2 -flto  -o ./B/libhello.c.o   -c ./B/libhello.c
gcc-ar qc ./B/libhello.a  ./B/libhello.c.o
gcc-ranlib ./B/libhello.a
gcc -O2 -flto  -o ./C/libhello.c.o   -c ./C/libhello.c
gcc-ar qc ./C/libhello.a  ./C/libhello.c.o
gcc-ranlib ./C/libhello.a
gcc hello.c -o hello.o -c -O2 -flto
gcc -o hellow hello.o -Wl,--start-lib ./B/libhello.c.o  -Wl,--end-lib
-Wl,--start-lib ./C/libhello.c.o -Wl,--end-lib -O2 -flto -fuse-ld=gold


./B/libhello.c:5:5: error: 'hello' has already been defined
5 | int hello(int a)
  | ^
./B/libhello.c:5:5: note: previously defined here
lto1: fatal error: errors during merging of translation units
compilation terminated.
lto-wrapper: fatal error: gcc returned 1 exit status
compilation terminated.
/usr/bin/ld.gold: fatal error: lto-wrapper failed
collect2: error: ld returned 1 exit status

This error happens at function gcc/lto/lto-symtab.c:lto_symtab_resolve_symbols,
simply remove the error_at line could work, but this may be not a reasonable
fix.  

  /* Find the single non-replaceable prevailing symbol and
 diagnose ODR violations.  */
  for (e = first; e; e = e->next_sharing_asm_name)
{
  if (!lto_symtab_resolve_can_prevail_p (e))
continue;

  /* If we have a non-replaceable definition it prevails.  */
  if (!lto_symtab_resolve_replaceable_p (e))
{
  if (prevailing)
{
  error_at (DECL_SOURCE_LOCATION (e->decl),
"%qD has already been defined", e->decl);
  inform (DECL_SOURCE_LOCATION (prevailing->decl),
  "previously defined here");
}
  prevailing = e;
}
}


cat hellow.res
3
hello.o 2
192 ccb9165e03755470 PREVAILING_DEF main
197 ccb9165e03755470 PREVAILING_DEF_IRONLY s
./B/libhello.c.o 1
205 68e0b97e93a52d7a PREEMPTED_REG hello
./C/libhello.c.o 1
205 18fe2d3482bfb511 PREEMPTED_REG hello


Secondly, If call hello(10) in hello.c , there will be NO error reported out.
The difference is the resolution type is changed from PREEMPTED_REG to
RESOLVED_IR/PREVAILING_DEF_IRONLY.  

3
hello.o 3
192 19ef867d12f62129 PREVAILING_DEF main
197 19ef867d12f62129 PREVAILING_DEF_IRONLY s
201 19ef867d12f62129 RESOLVED_IR hello
./B/libhello.c.o 1
205 23c5c855935478ce PREVAILING_DEF_IRONLY hello
./C/libhello.c.o 1
205 abbf050f5c23b448 PREEMPTED_REG hello


Is this a valid bug? Thanks.

[Bug lto/105133] lto/gold: lto failed to link --start-lib/--end-lib in gold for duplicate libraries

2022-04-05 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105133

--- Comment #2 from luoxhu at gcc dot gnu.org ---
(In reply to Richard Biener from comment #1)
> (In reply to luoxhu from comment #0)
> > 
> > cat hellow.res
> > 3
> > hello.o 2
> > 192 ccb9165e03755470 PREVAILING_DEF main
> > 197 ccb9165e03755470 PREVAILING_DEF_IRONLY s
> > ./B/libhello.c.o 1
> > 205 68e0b97e93a52d7a PREEMPTED_REG hello
> > ./C/libhello.c.o 1
> > 205 18fe2d3482bfb511 PREEMPTED_REG hello
> 
> This looks like a gold bug - we have 'hello' pre-empted twice but no
> prevailing
> symbol in the IR - are you ending up with fat LTO objects?

It is not fat LTO objects since I didn't add -ffat-lto-objects when generating
lib:

nm libhello.a

libhello.c.o:
nm: libhello.c.o: plugin needed to handle lto object
0001 C __gnu_lto_slim


> 
> OTOH PREEMPTED_REG seems then handled wrongly by LTO as well - it should
> throw away both copies since the linker told us it found a preempting
> definition in a non-IR object file.  So I'd expect a unresolved reference
> to 'hello' rather than LTO complaining about multiple definitions ...

Will you fix it? :)

> 
> Note gold is really unmaintained, so you should probably avoid using it.

Thanks. Will try lld instead.

[Bug tree-optimization/106293] [13 Regression] 456.hmmer at -Ofast -march=native regressed by 19% on zen2 and zen3 in July 2022

2022-07-25 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106293

--- Comment #4 from luoxhu at gcc dot gnu.org ---
Could you try revert (In reply to Richard Biener from comment #2)
> I can reproduce a regression with -Ofast -march=znver2 running on Haswell as
> well.  -fopt-info doesn't reveal anything interesting besides
> 
> -fast_algorithms.c:133:19: optimized: loop with 2 iterations completely
> unrolled (header execution count 32987933)
> +fast_algorithms.c:133:19: optimized: loop with 2 iterations completely
> unrolled (header execution count 129072791)
> 
> obviously the slowdown is in P7Viterbi.  There's only minimal changes on the
> GIMPLE side, one notable:
> 
>   niters_vector_mult_vf.205_2406 = niters.203_442 & 429496729 |   _2041 =
> niters.203_438 & 3;
>   _2408 = (int) niters_vector_mult_vf.205_2406;   |   if (_2041
> == 0)
>   tmp.206_2407 = k_384 + _2408;   | goto  66>; [25.00%]
>   _2300 = niters.203_442 & 3; <
>   if (_2300 == 0) <
> goto ; [25.00%]<
>   elseelse
> goto ; [75.00%]  goto  36>; [75.00%]
> 
>[local count: 41646173]:|   
> [local count: 177683003]:
>   # k_2403 = PHI  |  
> niters_vector_mult_vf.205_2409 = niters.203_438 & 429496729
>   # DEBUG k => k_2403 |   _2411 =
> (int) niters_vector_mult_vf.205_2409;
>   >  
> tmp.206_2410 = k_382 + _2411;
>   >
>   >   
> [local count: 162950122]:
>   >   # k_2406 =
> PHI 
> 
> the sink pass now does the transform where it did not do so before.
> 
> That's appearantly because of
> 
>   /* If BEST_BB is at the same nesting level, then require it to have
>  significantly lower execution frequency to avoid gratuitous movement. 
> */
>   if (bb_loop_depth (best_bb) == bb_loop_depth (early_bb)
>   /* If result of comparsion is unknown, prefer EARLY_BB.
>  Thus use !(...>=..) rather than (...<...)  */
>   && !(best_bb->count * 100 >= early_bb->count * threshold))
> return best_bb;
> 
>   /* No better block found, so return EARLY_BB, which happens to be the
>  statement's original block.  */
>   return early_bb;
> 
> where the SRC count is 96726596 before, 236910671 after and the
> destination count is 72544947 before, 177683003 at the destination after.
> The edge probabilities are 75% vs 25% and param_sink_frequency_threshold
> is exactly 75 as well.  Since 236910671*0.75
> is rounded down it passes the test while the previous state has an exact
> match defeating it.
> 
> It's a little bit of an arbitrary choice,
> 
> diff --git a/gcc/tree-ssa-sink.cc b/gcc/tree-ssa-sink.cc
> index 2e744d6ae50..9b368e13463 100644
> --- a/gcc/tree-ssa-sink.cc
> +++ b/gcc/tree-ssa-sink.cc
> @@ -230,7 +230,7 @@ select_best_block (basic_block early_bb,
>if (bb_loop_depth (best_bb) == bb_loop_depth (early_bb)
>/* If result of comparsion is unknown, prefer EARLY_BB.
>  Thus use !(...>=..) rather than (...<...)  */
> -  && !(best_bb->count * 100 >= early_bb->count * threshold))
> +  && !(best_bb->count * 100 > early_bb->count * threshold))
>  return best_bb;
>  
>/* No better block found, so return EARLY_BB, which happens to be the
> 
> fixes the missed sinking but not the regression :/
> 
> The count differences start to appear in when LC PHI blocks are added
> only for virtuals and then pre-existing 'Invalid sum of incoming counts'
> eventually lead to mismatches.  The 'Invalid sum of incoming counts'
> start with the loop splitting pass.
> 
> fast_algorithms.c:145:10: optimized: loop split
> 
> Xionghu Lou did profile count updates there, not sure if that made things
> worse in this case.
> 
> At least with broken BB counts splitting/unsplitting an edge can propagate
> bogus counts elsewhere it seems.

:(, Could you please try revert cd5ae148c47c6dee05adb19acd6a523f7187be7f and
see whether performance is back?

[Bug tree-optimization/106293] [13 Regression] 456.hmmer at -Ofast -march=native regressed by 19% on zen2 and zen3 in July 2022

2022-07-25 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106293

--- Comment #5 from luoxhu at gcc dot gnu.org ---
r12-6086

[Bug target/106069] [12/13 Regression] wrong code with -O -fno-tree-forwprop -maltivec on ppc64le

2022-07-25 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106069

--- Comment #12 from luoxhu at gcc dot gnu.org ---
Created attachment 53352
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53352&action=edit
combine

[Bug target/106069] [12/13 Regression] wrong code with -O -fno-tree-forwprop -maltivec on ppc64le

2022-07-25 Thread luoxhu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106069

--- Comment #13 from luoxhu at gcc dot gnu.org ---
Created attachment 53353
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53353&action=edit
after combine

  1   2   >