from:"linkw at gcc dot gnu.org"

[Bug tree-optimization/90332] New test case gcc.dg/vect/slp-reduc-sad-2.c in r270847 fails

2020-03-11 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90332

Kewen Lin  changed:

   What|Removed |Added

 CC||linkw at gcc dot gnu.org

--- Comment #7 from Kewen Lin  ---
(In reply to Richard Biener from comment #5)
> I don't see a vec_initv16qiv8qi on power either, so that might be it -
> there's no
> effective target for building a vector from halves (and I wonder how
> code-generation fares here).
> 
> So an option is to simply xfail for all but x86_64-*-* and i?86-*-* ...
> 
> Or try more fancy code-generation options (build from two large integer
> modes,
> but I don't see vec_initv2didi either).

It's wired, I found rs6000 has supported vec_initv2didi.
gcc/insn-opinit.c:  { 0x2f0a36, CODE_FOR_vec_initv2didi },

[Bug testsuite/94023] [9 regression] gcc.dg/vect/slp-perm-12.c fails starting with r9-5008

2020-03-16 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94023

Kewen Lin  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #5 from Kewen Lin  ---
Fixed on trunk and backported.

[Bug testsuite/94019] [9 regression] gcc.dg/vect/vect-over-widen-17.c fails starting with g:370c2ebe8fa20e0812cd2d533d4ed38ee2d37c85, r9-1590

2020-03-16 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94019

Kewen Lin  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #7 from Kewen Lin  ---
Fixed on trunk and backported.

[Bug tree-optimization/94043] [9/10 Regression] ICE in superloop_at_depth, at cfgloop.c:78

2020-03-17 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94043

Kewen Lin  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |linkw at gcc dot gnu.org
 Status|NEW |ASSIGNED

--- Comment #3 from Kewen Lin  ---
Yes, very likely to just expose one latent bug, anyway I'll have a first look.

[Bug tree-optimization/94043] [9/10 Regression] ICE in superloop_at_depth, at cfgloop.c:78

2020-03-19 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94043

--- Comment #4 from Kewen Lin  ---
This was just exposed from my commit, it can also be reproduced without my
commit but with -fno-vect-cost-model.

Some loops we have for this case:
;; Loop 1
;;  header 3, latch 10
;;  depth 1, outer 0
;;  nodes: 3 10 8 23 25 34 35 26 29 32 33 38 4 11 37 31

;; Loop 2
;;  header 4, latch 11
;;  depth 2, outer 1
;;  nodes: 4 11

;; Loop 4
;;  header 26, latch 29
;;  depth 2, outer 1
;;  nodes: 26 29


When we are doing versioning for loop4 required for aliasing check, the related
 scalar_loop_iters is based on e2.2_31, which is defined in BB 4, that is:

   [local count: 4343773762]:
  # e2.2_31 = PHI <_15(11), 1(37)>
  # ivtmp_14 = PHI 


For the codes:

if ((def_bb = gimple_bb (SSA_NAME_DEF_STMT (USE_FROM_PTR (use_p
&& flow_bb_inside_loop_p (outermost, def_bb))
  outermost = superloop_at_depth (loop, bb_loop_depth (def_bb) + 1)

bb_loop_depth is 2, the +1 make the assertion in superloop_at_depth fail since
the current loop 4 only has the depth 2. I think the existing code has the
assumption that all operands in stmts of cond_expr_stmt_list are defined in
some outer loop of current, but the assumption breaks in this case.

I guess the current scalar_loop_iters is valid? Then the fix can be:

--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -3312,7 +3312,13 @@ vect_loop_versioning (loop_vec_info loop_vinfo,
   FOR_EACH_SSA_USE_OPERAND (use_p, stmt, iter, SSA_OP_USE)
if ((def_bb = gimple_bb (SSA_NAME_DEF_STMT (USE_FROM_PTR (use_p
&& flow_bb_inside_loop_p (outermost, def_bb))
- outermost = superloop_at_depth (loop, loop_depth (outermost) + 1);
+ {
+   /* Def block can be in either one outer loop of loop_to_version or
+  one sibling of outer loop of loop_to_version.  */
+   class loop *common_loop
+ = find_common_loop (def_bb->loop_father, loop);
+   outermost = superloop_at_depth (loop, loop_depth (common_loop) +
1);
+ }
 }

[Bug tree-optimization/94043] [9/10 Regression] ICE in superloop_at_depth, at cfgloop.c:78

2020-03-20 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94043

--- Comment #6 from Kewen Lin  ---
(In reply to rguent...@suse.de from comment #5)
> On Fri, 20 Mar 2020, linkw at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94043
> > 
> > --- Comment #4 from Kewen Lin  ---
> > This was just exposed from my commit, it can also be reproduced without my
> > commit but with -fno-vect-cost-model.
> > 
> > Some loops we have for this case:
> > ;; Loop 1
> > ;;  header 3, latch 10
> > ;;  depth 1, outer 0
> > ;;  nodes: 3 10 8 23 25 34 35 26 29 32 33 38 4 11 37 31
> > 
> > ;; Loop 2
> > ;;  header 4, latch 11
> > ;;  depth 2, outer 1
> > ;;  nodes: 4 11
> > 
> > ;; Loop 4
> > ;;  header 26, latch 29
> > ;;  depth 2, outer 1
> > ;;  nodes: 26 29
> > 
> > 
> > When we are doing versioning for loop4 required for aliasing check, the 
> > related
> >  scalar_loop_iters is based on e2.2_31, which is defined in BB 4, that is:
> > 
> >[local count: 4343773762]:
> >   # e2.2_31 = PHI <_15(11), 1(37)>
> >   # ivtmp_14 = PHI 
> > 
> > 
> > For the codes:
> > 
> > if ((def_bb = gimple_bb (SSA_NAME_DEF_STMT (USE_FROM_PTR (use_p
> > && flow_bb_inside_loop_p (outermost, def_bb))
> >   outermost = superloop_at_depth (loop, bb_loop_depth (def_bb) + 1)
> > 
> > bb_loop_depth is 2, the +1 make the assertion in superloop_at_depth fail 
> > since
> > the current loop 4 only has the depth 2. I think the existing code has the
> > assumption that all operands in stmts of cond_expr_stmt_list are defined in
> > some outer loop of current, but the assumption breaks in this case.
> > 
> > I guess the current scalar_loop_iters is valid? Then the fix can be:
> 
> What is not valid I think is that e2.2_31 should have a loop-closed PHI
> node which would place it in an outer loop.  You'd have to see why
> either the loop-closed PHI is not present or why the aliasing check
> doesn't use that (it's more likely this)
> 

Thanks for the confirmation Richi! There is a loop-closed PHI for it in bb 33:

   [local count: 35145078524]:
  # e2.2_31 = PHI <_15(11), 1(31)>
  # ivtmp_14 = PHI 
  _11 = (integer(kind=8)) e2.2_31;
  _12 = _10 + _11;
  _13 = _12 + -7;
  hx[_13] = 0;
  _15 = e2.2_31 + 1;
  ivtmp_23 = ivtmp_14 - 1;
  if (ivtmp_23 == 0)
goto ; [11.00%]
  else
goto ; [89.00%]

   [local count: 3865958617]:
  # _51 = PHI <_15(4)>

I'll further investigate why the scalar_loop_iters is constructed directly from
e2.2_31 instead of _51.

[Bug tree-optimization/94043] [9/10 Regression] ICE in superloop_at_depth, at cfgloop.c:78

2020-03-22 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94043

--- Comment #8 from Kewen Lin  ---
> It's most likely either SCEV or expand_simple_operations looking throuhg
> the single-arg PHI (which we should avoid for LC PHI nodes)

Thanks Richi, I found the loop-closed PHI form was broken after we finished the
vectorization on the loop 2, BB 38 was inserted, the function
gimple_find_edge_insert_loc will get one new BB if the dest has phis, even it's
unrelated.

;; basic block 4, loop depth 2
;;  pred:   11
;;  37
...
_15 = e2.2_31 + 1;
...
if (ivtmp_59 >= 1)
  goto ; [100.00%]
else
  goto ; [0.00%]
;;  succ:   38
;;  11

;; basic block 38, loop depth 1
;;  pred:   4
_40 = BIT_FIELD_REF ;
;;  succ:   33

;; basic block 33, loop depth 1
;;  pred:   38
# _51 = PHI <_15(38)> 
;;  succ:   34

The alternatives seems could be 1) extend the current
gimple_find_edge_insert_loc to handle the phi nodes, if the phis aren't
related, just insert there, otherwise, insert some phis for uses of those stmts
and remove the related phis and create new assignments after those new stmts,
or 2) call rewrite_into_loop_closed_ssa for each successful vectorization.

[Bug tree-optimization/94043] [9/10 Regression] ICE in superloop_at_depth, at cfgloop.c:78

2020-03-23 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94043

--- Comment #10 from Kewen Lin  ---
(In reply to Richard Biener from comment #9)
> OK, so it's indeed vectorizable_live_operation not paying attention to
> loop-closed SSA form.
> 
> What it should do before building the lane extract is create a _new_
> loop-closed PHI node for the vectorized def (vec_lhs), and then
> demote the loop-closed PHI node for the scalar def (lhs) which should
> _always_ exist to a copy.  So from
> 
> 
>  loop;
> 
> # lhs' = PHI 
> 
> 
> go to
> 
>   loop;
> 
> # vec_lhs' = PHI 
> new_tree = BIT_FIELD_REF ;
> lhs' = new_tree;
> 
> I think you can assert that the block of the loop-closed PHI
> (single_exit()->dest) always has a single predecessor, otherwise
> things will be more complicated.
> 
> Can you try rework the code in this way?  If that's too much just tell
> me and I'll take care of this.

Thanks Richi, I'll give it a shot!
I'd like to ensure my understanding: with the proposed fix, we ensure the
single_exit()->dest should be the correct BB to be inserted, no chance like
gimple_find_edge_insert_loc to get one new BB to be inserted, is it right?

[Bug testsuite/93935] [9/10 regression] gcc.dg/vect/bb-slp-over-widen-2.c fails starting with r262371 (r10-6856)

2020-03-24 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93935

Kewen Lin  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #5 from Kewen Lin  ---
Should be fixed on both trunk and gcc-9.

[Bug tree-optimization/94043] [9/10 Regression] ICE in superloop_at_depth, at cfgloop.c:78

2020-03-25 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94043

--- Comment #12 from Kewen Lin  ---
Created attachment 48122
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48122&action=edit
ppc64le tested patch

Thanks Richi!

A patch draft attached to ensure on the right track, also
bootstrapped/regresstested. I tried to reproduce the case that the stmts for
lane extracting is empty (due to folding) with test cases associated in that
old commit but failed. I think we don't need to deal with it? The new copy
assignment instead of the phi could not be caught by the LC-PHI check in
expand_simple_operations.

[Bug tree-optimization/94043] [9/10 Regression] ICE in superloop_at_depth, at cfgloop.c:78

2020-03-26 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94043

--- Comment #14 from Kewen Lin  ---
(In reply to Richard Biener from comment #13)

> 
> +  /* Find all SSA NAMEs in stmts which is defined in current loop,
> create
> +PHIs for them, and replace them with phi results accordingly.  */
> +  for (gsi = gsi_start (stmts); !gsi_end_p (gsi); gsi_next (&gsi))
> +   {
> + gimple *stmt = gsi_stmt (gsi);
> + update_stmt (stmt);
> +
> ...
> 
> should not be necessary.  What's missing in your patch is that when the
> current code has computed vec_lhs it needs to create a LC PHI node for it
> _before_ computing the lane extraction and instead use vec_lhs' there.

OK, I was thinking the mask for LOOP_VINFO_FULLY_MASKED_P part is probably a
SSA name and can live out, as your comments, it looks impossible. Will update
it and send for review after testing. Thanks again!

[Bug tree-optimization/94043] [9/10 Regression] ICE in superloop_at_depth, at cfgloop.c:78

2020-03-26 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94043

Kewen Lin  changed:

   What|Removed |Added

  Attachment #48122|0   |1
is obsolete||

--- Comment #16 from Kewen Lin  ---
Created attachment 48125
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48125&action=edit
untested patch

[Bug tree-optimization/90332] New test case gcc.dg/vect/slp-reduc-sad-2.c in r270847 fails

2020-03-27 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90332

Kewen Lin  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #9 from Kewen Lin  ---
Should be fixed by latest trunk on ppc64le P9.

[Bug tree-optimization/94401] pr92420.c fails on aarch64 since r10-7415

2020-03-30 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94401

Kewen Lin  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |linkw at gcc dot gnu.org
 CC||linkw at gcc dot gnu.org

--- Comment #1 from Kewen Lin  ---
Thanks for reporting this! Do I need some special arch configuration options
for the gcc build to reproduce this on some aarch machine in CFarm? or fine
with default?

[Bug tree-optimization/94401] [10 Regression] pr92420.c fails on aarch64 since r10-7415

2020-03-30 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94401

Kewen Lin  changed:

   What|Removed |Added

 CC||segher at gcc dot gnu.org,
   ||wschmidt at gcc dot gnu.org
 Status|NEW |ASSIGNED

--- Comment #4 from Kewen Lin  ---
My commit extends the current scalar epilogue peeling for gaps 
elimination, it makes the case can make use of int for the construction. But it
reveals the existing handlings misses to handle VMAT_CONTIGUOUS_REVERSE case,
currently it assumes overrun happens on high address end, it's true for almost
all cases, but this case is on the low address end. So if we have to load the
high part and put it in the latter part of constructed vector for
VMAT_CONTIGUOUS_REVERSE.

The IR before/after the commit looks

good:
  vect__9.16_80 = MEM  [(int *)vectp_y.14_78];
  vect__9.17_81 = VEC_PERM_EXPR ;
  vect__9.18_82 = VEC_PERM_EXPR ;

bad:
  _30 = MEM[(int *)vectp_y.12_34];
  _20 = {_30, 0};
  vect__9.14_19 = VIEW_CONVERT_EXPR(_20);
  vect__9.15_61 = VEC_PERM_EXPR ;
  vect__9.16_54 = VEC_PERM_EXPR ;

[Bug tree-optimization/94401] [10 Regression] pr92420.c fails on aarch64 since r10-7415

2020-03-30 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94401

--- Comment #5 from Kewen Lin  ---
Created attachment 48150
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48150&action=edit
untested patch

This can fix the REG failures on aarch64.

[Bug tree-optimization/94043] [9/10 Regression] ICE in superloop_at_depth, at cfgloop.c:78

2020-03-31 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94043

Kewen Lin  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #19 from Kewen Lin  ---
should be fixed on trunk now.

[Bug tree-optimization/94043] [9 Regression] ICE in superloop_at_depth, at cfgloop.c:78

2020-04-01 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94043

--- Comment #21 from Kewen Lin  ---
(In reply to Richard Biener from comment #20)
> Re-open.  It's marked as broken in GCC 9 so a backport is in oder (if the
> issue really reproduces there).

Thanks for pointing it out.  I'll backport it two weeks later with no
regressions found in trunk.

[Bug tree-optimization/94443] [10 Regression] 510.parest_r and 526.blender_r ICE: verify_ssa failed since r10-7491-gbd0f22a8d5caea8905f38ff1fafce31c1b7d33ad

2020-04-01 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94443

Kewen Lin  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |linkw at gcc dot gnu.org

--- Comment #3 from Kewen Lin  ---
Thanks for reporting this, confirmed.

[Bug middle-end/94449] [10 Regression] FAIL: gcc.c-torture/execute/pr92904.c gcc.dg/torture/pr48731.c

2020-04-01 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94449

Kewen Lin  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |linkw at gcc dot gnu.org

--- Comment #7 from Kewen Lin  ---
Thanks for reporting this, looks duplicated of pr94443

[Bug tree-optimization/94451] [10 Regression] April 1st 2020 GCC does not compile spec 2017 gcc_r benchmark with -O3

2020-04-01 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94451

Kewen Lin  changed:

   What|Removed |Added

 CC||linkw at gcc dot gnu.org
 Status|UNCONFIRMED |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |linkw at gcc dot gnu.org
   Last reconfirmed||2020-04-02
 Ever confirmed|0   |1

--- Comment #3 from Kewen Lin  ---
Thanks for reporting this, Mike.  It looks duplicated of pr94443.

[Bug tree-optimization/94443] [10 Regression] 510.parest_r and 526.blender_r ICE: verify_ssa failed since r10-7491-gbd0f22a8d5caea8905f38ff1fafce31c1b7d33ad

2020-04-01 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94443

--- Comment #4 from Kewen Lin  ---
This case has one conversion insn generated after bit_field_ref, the patch
introduces one stupid mistake to use gsi_insert_before instead of
gsi_insert_seq_before, it leads to miss the conversion insn.  The below patch
makes it work. It also polishes copy related code a bit although not really
necessary to make this case pass.

diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index c9b6534..4c2c9f7 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -8050,7 +8050,7 @@ vectorizable_live_operation (stmt_vec_info stmt_info,
   if (stmts)
 {
   gimple_stmt_iterator exit_gsi = gsi_after_labels (exit_bb);
-  gsi_insert_before (&exit_gsi, stmts, GSI_CONTINUE_LINKING);
+  gsi_insert_seq_before (&exit_gsi, stmts, GSI_SAME_STMT);

   /* Remove existing phi from lhs and create one copy from new_tree.  */
   tree lhs_phi = NULL_TREE;
@@ -8060,10 +8060,10 @@ vectorizable_live_operation (stmt_vec_info stmt_info,
  gimple *phi = gsi_stmt (gsi);
  if ((gimple_phi_arg_def (phi, 0) == lhs))
{
- remove_phi_node (&gsi, false);
  lhs_phi = gimple_phi_result (phi);
  gimple *copy = gimple_build_assign (lhs_phi, new_tree);
- gsi_insert_after (&exit_gsi, copy, GSI_CONTINUE_LINKING);
+ gsi_insert_after (&exit_gsi, copy, GSI_NEW_STMT);
+ remove_phi_node (&gsi, false);
  break;
}
}

[Bug middle-end/94449] [10 Regression] FAIL: gcc.c-torture/execute/pr92904.c gcc.dg/torture/pr48731.c

2020-04-01 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94449

Kewen Lin  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED

--- Comment #8 from Kewen Lin  ---
May I ask for the configuration option? 

I used x86_64 machine in CFarm with cpuinfo

processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model   : 45
model name  : Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
...

I was unable to reproduce it with default configuration setting, but I was able
to 
see the ICE with -march=znver2 specified for the failures. 
I suspected there was some basic arch setting in your configuration? If so, I'm
wondering one more reasonable configuration option for E5 machine, it would
help to catch regression failures like this. Thanks in advance!

[Bug middle-end/94449] [10 Regression] FAIL: gcc.c-torture/execute/pr92904.c gcc.dg/torture/pr48731.c

2020-04-01 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94449

--- Comment #10 from Kewen Lin  ---
(In reply to H.J. Lu from comment #9)
> (In reply to Kewen Lin from comment #8)
> > May I ask for the configuration option? 
> > 
> > I used x86_64 machine in CFarm with cpuinfo
> > 
> 
> I used
> 
> --prefix=/usr/10.0.1 --enable-clocale=gnu --with-system-zlib --enable-shared
> --enable-cet --with-demangler-in-ld --with-fpmath=sse

Thanks, but it didn't work on my side. I guessed it's due to different native.

gcc -march=native -Q --help=target|grep march
  -march=   corei7-avx

$ ./t.sh -march=znver2
internal compiler error: verify_ssa failed

$ ./t.sh -march=icelake-server
internal compiler error: verify_ssa failed

$ ./t.sh -march=corei7-avx  ==> works fine.

I guess I can't just specify the arch option like --with-arch=znver2 for
configure, since native arch probably misses the support of some instructions
for znver2? I have no idea on x86 arch, is it possible?

[Bug tree-optimization/94443] [10 Regression] 510.parest_r and 526.blender_r ICE: verify_ssa failed since r10-7491-gbd0f22a8d5caea8905f38ff1fafce31c1b7d33ad

2020-04-01 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94443

Kewen Lin  changed:

   What|Removed |Added

 CC||hjl.tools at gmail dot com

--- Comment #5 from Kewen Lin  ---
*** Bug 94449 has been marked as a duplicate of this bug. ***

[Bug middle-end/94449] [10 Regression] FAIL: gcc.c-torture/execute/pr92904.c gcc.dg/torture/pr48731.c

2020-04-01 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94449

Kewen Lin  changed:

   What|Removed |Added

 Resolution|--- |DUPLICATE
 Status|ASSIGNED|RESOLVED

--- Comment #11 from Kewen Lin  ---
Verified that the patch in pr94443 fix these failures as well.

*** This bug has been marked as a duplicate of bug 94443 ***

[Bug middle-end/94449] [10 Regression] FAIL: gcc.c-torture/execute/pr92904.c gcc.dg/torture/pr48731.c

2020-04-01 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94449

--- Comment #12 from Kewen Lin  ---
Sorry, correction: corei7-avx is from system gcc. With my built gcc, it's
sandybridge. But no difference for the pass/fail result.

[Bug tree-optimization/94443] [10 Regression] 510.parest_r and 526.blender_r ICE: verify_ssa failed since r10-7491-gbd0f22a8d5caea8905f38ff1fafce31c1b7d33ad

2020-04-02 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94443

--- Comment #7 from Kewen Lin  ---
Yes, thanks Richi! I had the same update locally but didn't update here. The
latest whole patch is

diff --git a/gcc/testsuite/gcc.dg/vect/pr94443.c
b/gcc/testsuite/gcc.dg/vect/pr94443.c
new file mode 100644
index 000..f8cbaf1
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr94443.c
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-march=znver2" { target { x86_64-*-* i?86-*-* } }
} */
+
+/* Check it to be compiled successfully without any ICE.  */
+
+int a;
+unsigned *b;
+
+void foo()
+{
+  for (unsigned i; i <= a; ++i, ++b)
+;
+}
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index c9b6534..b621f89 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -8050,7 +8050,7 @@ vectorizable_live_operation (stmt_vec_info stmt_info,
   if (stmts)
 {
   gimple_stmt_iterator exit_gsi = gsi_after_labels (exit_bb);
-  gsi_insert_before (&exit_gsi, stmts, GSI_CONTINUE_LINKING);
+  gsi_insert_seq_before (&exit_gsi, stmts, GSI_SAME_STMT);

   /* Remove existing phi from lhs and create one copy from new_tree.  */
   tree lhs_phi = NULL_TREE;
@@ -8060,10 +8060,10 @@ vectorizable_live_operation (stmt_vec_info stmt_info,
  gimple *phi = gsi_stmt (gsi);
  if ((gimple_phi_arg_def (phi, 0) == lhs))
{
- remove_phi_node (&gsi, false);
  lhs_phi = gimple_phi_result (phi);
  gimple *copy = gimple_build_assign (lhs_phi, new_tree);
- gsi_insert_after (&exit_gsi, copy, GSI_CONTINUE_LINKING);
+ gsi_insert_before (&exit_gsi, copy, GSI_SAME_STMT);
+ remove_phi_node (&gsi, false);
  break;
}
}

Still waiting for regression testing result.

[Bug tree-optimization/94443] [10 Regression] 510.parest_r and 526.blender_r ICE: verify_ssa failed since r10-7491-gbd0f22a8d5caea8905f38ff1fafce31c1b7d33ad

2020-04-02 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94443

--- Comment #8 from Kewen Lin  ---
> 
> > + remove_phi_node (&gsi, false);
> 
> I prefer to have the PHI removed before you re-use its LHS.
> 

Oops, missed this, will move it back when posting to email list.

[Bug tree-optimization/94451] [10 Regression] April 1st 2020 GCC does not compile spec 2017 gcc_r benchmark with -O3

2020-04-02 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94451

Kewen Lin  changed:

   What|Removed |Added

 Resolution|DUPLICATE   |FIXED

--- Comment #6 from Kewen Lin  ---
Reproduced and verified with the proposed fix in pr94443, sorry for the
trouble.

[Bug tree-optimization/94443] [10 Regression] 510.parest_r and 526.blender_r ICE: verify_ssa failed since r10-7491-gbd0f22a8d5caea8905f38ff1fafce31c1b7d33ad

2020-04-02 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94443

Kewen Lin  changed:

   What|Removed |Added

 CC||clyon at gcc dot gnu.org

--- Comment #10 from Kewen Lin  ---
*** Bug 94456 has been marked as a duplicate of this bug. ***

[Bug tree-optimization/94456] ICE in aarch64/sve/pr87815.c since r10-7491

2020-04-02 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94456

Kewen Lin  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 CC||linkw at gcc dot gnu.org
 Resolution|--- |DUPLICATE

--- Comment #1 from Kewen Lin  ---
Thanks for reporting, should be duplicated as the symptom.

*** This bug has been marked as a duplicate of bug 94443 ***

[Bug tree-optimization/94443] [10 Regression] 510.parest_r and 526.blender_r ICE: verify_ssa failed since r10-7491-gbd0f22a8d5caea8905f38ff1fafce31c1b7d33ad

2020-04-02 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94443

--- Comment #13 from Kewen Lin  ---
(In reply to Khem Raj from comment #11)
> this patch seems to be causing gcc ICE on ARM when compiling lz4 sources in
> kernel, lz4, vlc almost identical ICE is seen
> 
> attached is the test case please compile it with -O3
> 
> during GIMPLE pass: vect
> lz4.c: In function 'LZ4_compress_fast_extState':
> lz4.c:1180:5: internal compiler error: Segmentation fault
>  1180 | int LZ4_compress_fast_extState(void* state, const char* source,
> char* dest, int inputSize, int maxOutputSize, int acceleration)
>   | ^~
> Please submit a full bug report,

Same symptom:

for SSA_NAME: _1689 in statement:
op_1747 = _1689;
during GIMPLE pass: vect
lz4.c:1180:5: internal compiler error: verify_ssa failed
0x100d0ab verify_ssa(bool, bool)

Verified it can be fixed with posted patch in gcc-patch ML:
https://gcc.gnu.org/pipermail/gcc-patches/2020-April/543137.html

[Bug tree-optimization/94401] [10 Regression] pr92420.c fails on aarch64 since r10-7415

2020-04-02 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94401

Kewen Lin  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #7 from Kewen Lin  ---
Should be fixed now.

[Bug tree-optimization/94451] [10 Regression] April 1st 2020 GCC does not compile spec 2017 gcc_r benchmark with -O3

2020-04-03 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94451

Kewen Lin  changed:

   What|Removed |Added

 Resolution|FIXED   |DUPLICATE

--- Comment #7 from Kewen Lin  ---
Correct the status, looks updated mistakenly somehow.

*** This bug has been marked as a duplicate of bug 94443 ***

[Bug tree-optimization/94443] [10 Regression] 510.parest_r and 526.blender_r ICE: verify_ssa failed since r10-7491-gbd0f22a8d5caea8905f38ff1fafce31c1b7d33ad

2020-04-03 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94443

--- Comment #15 from Kewen Lin  ---
*** Bug 94451 has been marked as a duplicate of this bug. ***

[Bug middle-end/26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)

2020-04-03 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
Bug 26163 depends on bug 94443, which changed state.

Bug 94443 Summary: [10 Regression] 510.parest_r and 526.blender_r ICE: 
verify_ssa failed since r10-7491-gbd0f22a8d5caea8905f38ff1fafce31c1b7d33ad
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94443

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

[Bug tree-optimization/94443] [10 Regression] 510.parest_r and 526.blender_r ICE: verify_ssa failed since r10-7491-gbd0f22a8d5caea8905f38ff1fafce31c1b7d33ad

2020-04-03 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94443

Kewen Lin  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #17 from Kewen Lin  ---
(In reply to Martin Liška from comment #16)
> Can we close it as fixed?

I guess so. Although this commit should be one part of backport for PR94393, I
guess I can only leave that bug open and close this one? Please feel free to
correct me.

[Bug testsuite/94079] gfortran.dg/vect/pr83232.f90 fails on power 7

2020-04-08 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94079

Kewen Lin  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|UNCONFIRMED |RESOLVED

commit r10-7646-ge7c4084d11b957d925ba586f86db2f346fb3bfe0
Author: Kewen Lin 
Date:   Wed Apr 8 21:52:00 2020 -0500

[testsuite] Fix PR94079 by respecting vect_hw_misalign [PR94079]

This is another vect case which requires special handling with
vect_hw_misalign.  The alignment of the second part requires
misaligned vector access supports.  This patch is to adjust
the related guard condition and comments.

Verified it on ppc64-redhat-linux (Power7 BE).

2020-04-09  Kewen Lin  

gcc/testsuite/ChangeLog

PR testsuite/94023
* gfortran.dg/vect/pr83232.f90: Expect 2 rather than 3 times SLP on
non-vect_hw_misalign targets.


Wrong PR in the commit, have fixed it. Manually pasted here.

[Bug tree-optimization/94043] [9 Regression] ICE in superloop_at_depth, at cfgloop.c:78

2020-04-17 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94043

Kewen Lin  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #24 from Kewen Lin  ---
Backported via r9-8506 and its related r9-8507.

[Bug tree-optimization/96451] [11 Regression] gcc.dg/pr68766.c ICE since r11-2453

2020-08-04 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96451

Kewen Lin  changed:

   What|Removed |Added

 Ever confirmed|0   |1
   Assignee|unassigned at gcc dot gnu.org  |linkw at gcc dot gnu.org
   Last reconfirmed||2020-08-04
 Status|UNCONFIRMED |ASSIGNED

--- Comment #1 from Kewen Lin  ---
Thanks for reporting! I will have a look at it.

[Bug tree-optimization/96451] [11 Regression] gcc.dg/pr68766.c ICE since r11-2453

2020-08-04 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96451

--- Comment #3 from Kewen Lin  ---
(In reply to Richard Biener from comment #2)
> possibly a latent issue since the patch is supposed to be cost-only

Yes, this case will hit ICE too with -fno-vect-cost-model even without the
culprit commit.

Without that commit, the costing says it's not profitable to vectorize the
epilogue further, while with that we are able to vectorize the epilogue. With
the forced option -fdbg-cnt=vect_loop:1, it only allows us to vectorize one
loop, so it skips the epilogue which has the scalar mask_store statement from
if-cvt and is determined to be vectorized.

I'm not sure what the dbg counter should mean for loop vect. If it's for the
original scalar loop, then the main vectorized loop and the epilogue loop to be
vectorized should be vectorized. The fix could be: 

diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
index 26a1846..150bdcf 100644
--- a/gcc/tree-vectorizer.c
+++ b/gcc/tree-vectorizer.c
@@ -1066,7 +1066,7 @@ try_vectorize_loop_1 (hash_table
*&simduid_to_vf_htab,
   return ret;
 }

-  if (!dbg_cnt (vect_loop))
+  if (!LOOP_VINFO_EPILOGUE_P (loop_vinfo) && !dbg_cnt (vect_loop))
 {
   /* Free existing information if loop is analyzed with some
 assumptions.  */

If the dbg counter is for all kinds of loop (main or epilogue), the fix seems
to be: add one interface for dbg counter framework to query the remaining
allowed count, compare the remaining number and the number of epilogue loops in
vect_do_peeling, then remove the exceeding epilogue loops there.

[Bug tree-optimization/96451] [11 Regression] gcc.dg/pr68766.c ICE since r11-2453

2020-08-05 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96451

--- Comment #5 from Kewen Lin  ---
Created attachment 49000
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49000&action=edit
untested patch

Just noticed the dbgcnt supports several intervals, if we want to count
epilogue loop, we probably need to call dbgcnt in vect_do_peeling. One untested
patch attached to show the idea.

[Bug tree-optimization/96451] [11 Regression] gcc.dg/pr68766.c ICE since r11-2453

2020-08-05 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96451

--- Comment #6 from Kewen Lin  ---
(In reply to rguent...@suse.de from comment #4)
> On Wed, 5 Aug 2020, linkw at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96451
> > 
> > --- Comment #3 from Kewen Lin  ---
> > (In reply to Richard Biener from comment #2)
> > > possibly a latent issue since the patch is supposed to be cost-only
> > 
> > Yes, this case will hit ICE too with -fno-vect-cost-model even without the
> > culprit commit.
> > 
> > Without that commit, the costing says it's not profitable to vectorize the
> > epilogue further, while with that we are able to vectorize the epilogue. 
> > With
> > the forced option -fdbg-cnt=vect_loop:1, it only allows us to vectorize one
> > loop, so it skips the epilogue which has the scalar mask_store statement 
> > from
> > if-cvt and is determined to be vectorized.
> > 
> > I'm not sure what the dbg counter should mean for loop vect. If it's for the
> > original scalar loop, then the main vectorized loop and the epilogue loop 
> > to be
> > vectorized should be vectorized. The fix could be: 
> > 
> > diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c
> > index 26a1846..150bdcf 100644
> > --- a/gcc/tree-vectorizer.c
> > +++ b/gcc/tree-vectorizer.c
> > @@ -1066,7 +1066,7 @@ try_vectorize_loop_1 (hash_table
> > *&simduid_to_vf_htab,
> >return ret;
> >  }
> > 
> > -  if (!dbg_cnt (vect_loop))
> > +  if (!LOOP_VINFO_EPILOGUE_P (loop_vinfo) && !dbg_cnt (vect_loop))
> >  {
> >/* Free existing information if loop is analyzed with some
> >  assumptions.  */
> > 
> > If the dbg counter is for all kinds of loop (main or epilogue), the fix 
> > seems
> > to be: add one interface for dbg counter framework to query the remaining
> > allowed count, compare the remaining number and the number of epilogue 
> > loops in
> > vect_do_peeling, then remove the exceeding epilogue loops there.
> 
> I think the above patch is OK and is what was originally intended.
> 
> Care to push it to master?

Thanks for the confirmation! I'll proceed with one formal patch.

[Bug tree-optimization/96451] [11 Regression] gcc.dg/pr68766.c ICE since r11-2453

2020-08-05 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96451

Kewen Lin  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #8 from Kewen Lin  ---
The proposed fix has been committed in
r11-2585-gea858d09571f3f6dcce92d8bfaf077f9d44c6ad6

Sorry that forgot to put the PR No. to the changelog.

[Bug target/94077] gcc.dg/gomp/pr82374.c fails on power 7

2020-08-11 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94077

Kewen Lin  changed:

   What|Removed |Added

   Last reconfirmed||2020-08-12
 Ever confirmed|0   |1
 CC||linkw at gcc dot gnu.org
 Status|UNCONFIRMED |ASSIGNED

--- Comment #1 from Kewen Lin  ---
This issue only exists on gcc8 and gcc9, it's gone with gcc10 and trunk.

The main difference is listed below:

with gcc8/gcc9, the cost modeling says it's not profitable because of high cost
realign vector load/store for vectorization body, that is:

gcc/testsuite/gcc.dg/gomp/pr82374.c:27:3: note: Cost model analysis:
  Vector inside of loop cost: 32
  Vector prologue cost: 6
  Vector epilogue cost: 0
  Scalar iteration cost: 4
  Scalar outside cost: 0
  Vector outside cost: 6
  prologue iterations: 0
  epilogue iterations: 0
gcc/testsuite/gcc.dg/gomp/pr82374.c:27:3: note: cost model: the vector
iteration cost = 32 divided by the scalar iteration cost = 4 is greater or
equal to the vectorization factor = 4.
gcc/testsuite/gcc.dg/gomp/pr82374.c:27:3: note: not vectorized: vectorization
not profitable.
gcc/testsuite/gcc.dg/gomp/pr82374.c:27:3: note: not vectorized: vector version
will never be profitable.


While with gcc10 and trunk, the information looks like:

gcc/testsuite/gcc.dg/gomp/pr82374.c:27:3: note:  Cost model analysis:
  Vector inside of loop cost: 6
  Vector prologue cost: 0
  Vector epilogue cost: 0
  Scalar iteration cost: 6
  Scalar outside cost: 0
  Vector outside cost: 0
  prologue iterations: 0
  epilogue iterations: 0
  Calculated minimum iters for profitability: 0
gcc/testsuite/gcc.dg/gomp/pr82374.c:27:3: note:Runtime profitability
threshold = 4
gcc/testsuite/gcc.dg/gomp/pr82374.c:27:3: note:Static estimate
profitability threshold = 4

By tracing back, I noticed the difference comes from:

gcc8/gcc9
  can't force alignment of ref: a[i_12]

gcc10/trunk:
  force alignment of a[i_12]

I guess it's not a good idea to backport some patch to get the alignment forced
(probably risky?), instead I think we can append an additional option
-mefficient-unaligned-vsx together with -mvsx to ensure we can use unaligned
vector load/store, or set the target requirement into powerpc_vsx_ok &&
vect_hw_misalign, both meet the original testing purpose.

Hi @Jakub, what do you think of this?

[Bug target/94077] gcc.dg/gomp/pr82374.c fails on power 7

2020-08-12 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94077

--- Comment #2 from Kewen Lin  ---
To be more specific, the reason causing the available alignment forcing is the
default setting of -fcommon, we set -fno-common as default from GCC10, it makes
decl_binds_to_current_def_p return true then.

I can observe this case fail if with explicit -fcommon.

[Bug target/94077] gcc.dg/gomp/pr82374.c fails on power 7

2020-08-12 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94077

--- Comment #3 from Kewen Lin  ---
> 
> I can observe this case fail if with explicit -fcommon.

I mean even with gcc10 or trunk.

[Bug target/94077] gcc.dg/gomp/pr82374.c fails on power 7

2020-08-12 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94077

--- Comment #6 from Kewen Lin  ---
(In reply to Jakub Jelinek from comment #5)
> I mean -fno-common, sorry.

Good idea, that works!  I'll send a patch by adding -fno-common into
dg-options.  Thanks for your suggestion!

[Bug testsuite/94077] gcc.dg/gomp/pr82374.c fails on power 7

2020-08-12 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94077

Kewen Lin  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #9 from Kewen Lin  ---
Should be fixed now.

[Bug tree-optimization/96789] New: x264: sub4x4_dct() improves when vectorization is disabled

2020-08-25 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

Bug ID: 96789
   Summary: x264: sub4x4_dct() improves when vectorization is
disabled
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: linkw at gcc dot gnu.org
  Target Milestone: ---

One of my workmates found that if we disable vectorization for SPEC2017
525.x264_r function sub4x4_dct in source file x264_src/common/dct.c with
explicit function attribute __attribute__((optimize("no-tree-vectorize"))), it
can speed up by 4%.

The option used is: -O3 -mcpu=power9 -fcommon -fno-strict-aliasing
-fgnu89-inline

I confirmed this finding and it can further narrow down to SLP vectorization
with attribute __attribute__((optimize("no-tree-slp-vectorize"))).

I also checked with r11-0 commit for this particular file, the performance keep
unchanged, with/without vectorization attribute. So I think it's a trunk
regression, probably exposes one SLP flaw or one cost modeling issue.

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

2020-08-26 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #2 from Kewen Lin  ---
Created attachment 49124
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49124&action=edit
sub4x4_dct SLP dumping

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

2020-08-26 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #3 from Kewen Lin  ---
Bisection shows it started to fail from r11-205.

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

2020-08-30 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

Kewen Lin  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |linkw at gcc dot gnu.org

--- Comment #6 from Kewen Lin  ---
(In reply to Richard Biener from comment #4)
> This delays some checks to eventually support part of the BB vectorization
> which is what succeeds here.  I suspect that w/o vectorization we manage
> to elide the tmp[] array but with the part vectorization that occurs we
> fail to do that.
> 
> On the cost side there would be a lot needed to make the vectorization
> not profitable:
> 
>   Vector inside of basic block cost: 8
>   Vector prologue cost: 36
>   Vector epilogue cost: 0
>   Scalar cost of basic block: 64
> 
> the thing to double-check is
> 
> 0x123b1ff0  1 times vec_construct costs 17 in prologue
> 
> that is the cost of the V16QI construct
> 
>  _813 = {_437, _448, _459, _470, _490, _501, _512, _523, _543, _554, _565,
> _576, _125, _143, _161, _179}; 
> 

Thanks Richard!  I did some cost adjustment experiment last year and the cost
for v16qi looks off indeed, but at that time with the cost tweaking for this
the SPEC performance doesn't change, I guessed it's just we happened not have
this kind of case to trap into. I'll have a look and re-evaluate it for this.

[Bug target/96933] New: inefficient code for char/short vec CTOR

2020-09-04 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933

Bug ID: 96933
   Summary: inefficient code for char/short vec CTOR
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: linkw at gcc dot gnu.org
  Target Milestone: ---

When I'm investigate the vectorization cost for vec_construct, I happened to
find the generated code for vector construction is inefficient with DIRECT_MOVE
support.

The test case looks like:

vector unsigned char test_char(unsigned char f1, unsigned char f2,
   unsigned char f3, unsigned char f4,
   unsigned char f5, unsigned char f6,
   unsigned char f7, unsigned char f8,
   unsigned char f9, unsigned char f10,
   unsigned char f11, unsigned char f12,
   unsigned char f13, unsigned char f14,
   unsigned char f15, unsigned char f16) {

  vector unsigned char v = {f1, f2,  f3,  f4,  f5,  f6,  f7,  f8,
f9, f10, f11, f12, f13, f14, f15, f16};
  return v;
}

The generated code currently with -mcpu=power9:

 :
   0:   e8 ff a1 fb std r29,-24(r1)
   4:   f0 ff c1 fb std r30,-16(r1)
   8:   f8 ff e1 fb std r31,-8(r1)
   c:   60 00 a1 8b lbz r29,96(r1)
  10:   68 00 c1 8b lbz r30,104(r1)
  14:   70 00 e1 8b lbz r31,112(r1)
  18:   d1 ff 81 98 stb r4,-47(r1)
  1c:   d2 ff a1 98 stb r5,-46(r1)
  20:   78 00 81 89 lbz r12,120(r1)
  24:   80 00 01 88 lbz r0,128(r1)
  28:   88 00 61 89 lbz r11,136(r1)
  2c:   90 00 81 88 lbz r4,144(r1)
  30:   98 00 a1 88 lbz r5,152(r1)
  34:   d0 ff 61 98 stb r3,-48(r1)
  38:   d3 ff c1 98 stb r6,-45(r1)
  3c:   d4 ff e1 98 stb r7,-44(r1)
  40:   d8 ff a1 9b stb r29,-40(r1)
  44:   d5 ff 01 99 stb r8,-43(r1)
  48:   d6 ff 21 99 stb r9,-42(r1)
  4c:   d7 ff 41 99 stb r10,-41(r1)
  50:   d9 ff c1 9b stb r30,-39(r1)
  54:   da ff e1 9b stb r31,-38(r1)
  58:   db ff 81 99 stb r12,-37(r1)
  5c:   dc ff 01 98 stb r0,-36(r1)
  60:   dd ff 61 99 stb r11,-35(r1)
  64:   de ff 81 98 stb r4,-34(r1)
  68:   df ff a1 98 stb r5,-33(r1)
  6c:   e8 ff a1 eb ld  r29,-24(r1)
  70:   f0 ff c1 eb ld  r30,-16(r1)
  74:   f8 ff e1 eb ld  r31,-8(r1)
  78:   d9 ff 41 f4 lxv vs34,-48(r1)
  7c:   20 00 80 4e blr

But it can be more efficient with direct move and vector merge, such as:

   0:   67 01 43 7c mtvsrd  vs34,r3
   4:   68 00 61 80 lwz r3,104(r1)
   8:   60 00 61 81 lwz r11,96(r1)
   c:   67 01 64 7c mtvsrd  vs35,r4
  10:   70 00 81 80 lwz r4,112(r1)
  14:   67 01 03 7d mtvsrd  vs40,r3
  18:   78 00 61 80 lwz r3,120(r1)
  1c:   67 01 85 7c mtvsrd  vs36,r5
  20:   67 01 a6 7c mtvsrd  vs37,r6
  24:   67 01 07 7c mtvsrd  vs32,r7
  28:   67 01 28 7c mtvsrd  vs33,r8
  2c:   67 01 24 7d mtvsrd  vs41,r4
  30:   80 00 81 80 lwz r4,128(r1)
  34:   0c 10 43 10 vmrghb  v2,v3,v2
  38:   67 01 63 7c mtvsrd  vs35,r3
  3c:   88 00 61 80 lwz r3,136(r1)
  40:   67 01 eb 7c mtvsrd  vs39,r11
  44:   0c 20 85 10 vmrghb  v4,v5,v4
  48:   67 01 a4 7c mtvsrd  vs37,r4
  4c:   90 00 81 80 lwz r4,144(r1)
  50:   0c 00 01 10 vmrghb  v0,v1,v0
  54:   67 01 23 7c mtvsrd  vs33,r3
  58:   98 00 61 80 lwz r3,152(r1)
  5c:   67 01 c9 7c mtvsrd  vs38,r9
  60:   0c 38 e8 10 vmrghb  v7,v8,v7
  64:   67 01 04 7d mtvsrd  vs40,r4
  68:   0c 48 63 10 vmrghb  v3,v3,v9
  6c:   67 01 23 7d mtvsrd  vs41,r3
  70:   0c 28 a1 10 vmrghb  v5,v1,v5
  74:   67 01 2a 7c mtvsrd  vs33,r10
  78:   0c 40 09 11 vmrghb  v8,v9,v8
  7c:   0c 30 21 10 vmrghb  v1,v1,v6
  80:   4c 11 44 10 vmrglh  v2,v4,v2
  84:   4c 39 63 10 vmrglh  v3,v3,v7
  88:   4c 29 88 10 vmrglh  v4,v8,v5
  8c:   4c 01 a1 10 vmrglh  v5,v1,v0
  90:   8c 19 64 10 vmrglw  v3,v4,v3
  94:   8c 11 45 10 vmrglw  v2,v5,v2
  98:   57 13 43 f0 xxmrgld vs34,vs35,vs34

[Bug target/96933] rs6000: inefficient code for char/short vec CTOR

2020-09-04 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933

Kewen Lin  changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
 CC||bergner at gcc dot gnu.org,
   ||linkw at gcc dot gnu.org,
   ||segher at gcc dot gnu.org,
   ||wschmidt at gcc dot gnu.org
Summary|inefficient code for|rs6000: inefficient code
   |char/short vec CTOR |for char/short vec CTOR
   Last reconfirmed||2020-09-04
 Target||powerpc
 Ever confirmed|0   |1
   Assignee|unassigned at gcc dot gnu.org  |linkw at gcc dot gnu.org

[Bug target/96933] rs6000: inefficient code for char/short vec CTOR

2020-09-04 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933

--- Comment #2 from Kewen Lin  ---
(In reply to Segher Boessenkool from comment #1)
> Is that actually faster though?  The original has shorter dependency
> chains.  Or is this to avoid some LHS/SHL?

Yes, I tested it with one constructed case, the original version takes 18.20s
while the optimized version takes 8.40s. And yes, I guess it's due to LHS/SHL
similar to the vec_insert issue xionghu is working on.

[Bug target/96933] rs6000: inefficient code for char/short vec CTOR

2020-09-06 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933

--- Comment #5 from Kewen Lin  ---
(In reply to Segher Boessenkool from comment #4)
> Yes, timing suggests there is some SHL/LHS flush.
> 
> On p9 and later we can use mtvsrdd instead of mtvsrd (moving two
> bytes into place at one), which reduces the number of moves from
> 16 to 8, and the number of merges from 15 to 7 (and reduces path
> length by 1).  This sounds like a no-brainer win with that :-)

Good idea, it looks better on P9. One thing to double confirm, currently there
are no instructions like vmrgob and vmrgoh, so of the mergings you mentioned
from vector bytes to vector short and vector short to vector word needs
artificial control vector?

[Bug target/96933] rs6000: inefficient code for char/short vec CTOR

2020-09-07 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933

--- Comment #6 from Kewen Lin  ---
(In reply to Kewen Lin from comment #5)
> (In reply to Segher Boessenkool from comment #4)
> > Yes, timing suggests there is some SHL/LHS flush.
> > 
> > On p9 and later we can use mtvsrdd instead of mtvsrd (moving two
> > bytes into place at one), which reduces the number of moves from
> > 16 to 8, and the number of merges from 15 to 7 (and reduces path
> > length by 1).  This sounds like a no-brainer win with that :-)
> 
> Good idea, it looks better on P9. One thing to double confirm, currently
> there are no instructions like vmrgob and vmrgoh, so of the mergings you
> mentioned from vector bytes to vector short and vector short to vector word
> needs artificial control vector?

Improve the patch to support mtvsrdd, the asm for char looks like:

 :
   0:   00 00 4c 3c addis   r2,r12,0
0: R_PPC64_REL16_HA .TOC.
   4:   00 00 42 38 addir2,r2,0
4: R_PPC64_REL16_LO .TOC.+0x4
   8:   e8 ff a1 fb std r29,-24(r1)
   c:   00 00 a2 3f addis   r29,r2,0
c: R_PPC64_TOC16_HA .rodata.cst16
  10:   f0 ff c1 fb std r30,-16(r1)
  14:   f8 ff e1 fb std r31,-8(r1)
  18:   67 1b 24 7c mtvsrdd vs33,r4,r3
  1c:   67 3b 28 7d mtvsrdd vs41,r8,r7
  20:   68 00 c1 8b lbz r30,104(r1)
  24:   78 00 e1 8b lbz r31,120(r1)
  28:   00 00 bd 3b addir29,r29,0
28: R_PPC64_TOC16_LO.rodata.cst16
  2c:   60 00 81 89 lbz r12,96(r1)
  30:   70 00 61 89 lbz r11,112(r1)
  34:   80 00 81 88 lbz r4,128(r1)
  38:   88 00 61 88 lbz r3,136(r1)
  3c:   90 00 01 89 lbz r8,144(r1)
  40:   98 00 e1 88 lbz r7,152(r1)
  44:   67 2b 46 7c mtvsrdd vs34,r6,r5
  48:   67 4b aa 7d mtvsrdd vs45,r10,r9
  4c:   09 00 9d f5 lxv vs44,0(r29)
  50:   67 63 5e 7d mtvsrdd vs42,r30,r12
  54:   67 5b 1f 7c mtvsrdd vs32,r31,r11
  58:   e8 ff a1 eb ld  r29,-24(r1)
  5c:   f0 ff c1 eb ld  r30,-16(r1)
  60:   67 23 63 7d mtvsrdd vs43,r3,r4
  64:   f8 ff e1 eb ld  r31,-8(r1)
  68:   3b 0b 42 10 vpermr  v2,v2,v1,v12
  6c:   67 43 27 7c mtvsrdd vs33,r7,r8
  70:   3b 4b ad 11 vpermr  v13,v13,v9,v12
  74:   3b 53 00 10 vpermr  v0,v0,v10,v12
  78:   3b 5b 21 10 vpermr  v1,v1,v11,v12
  7c:   97 11 4d f0 xxmrglw vs34,vs45,vs34
  80:   97 01 01 f0 xxmrglw vs32,vs33,vs32
  84:   57 13 40 f0 xxmrgld vs34,vs32,vs34
  88:   20 00 80 4e blr

For:
  1) mtvsrdd under TARGET_DIRECT_MOVE_128
  2) mtvsrd under  TARGET_DIRECT_MOVE
  3) original

The time evaluation on Power9 looks like
  1) 7.28s
  2) 7.41s
  3) 18.19s

[Bug target/96933] rs6000: inefficient code for char/short vec CTOR

2020-09-07 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933

--- Comment #8 from Kewen Lin  ---
(In reply to Segher Boessenkool from comment #7)
> There are vmrglb and vrghb etc.?

But these are only for low/high part separately, with mtvsrdd both low/high
parts (doubleword) have the values, we don't have Vector Merge Even/Odd for
char or short to merge them. Now I used one artificial control vector for the
merging, correct me if I miss something.

[Bug target/96933] rs6000: inefficient code for char/short vec CTOR

2020-09-08 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933

--- Comment #10 from Kewen Lin  ---
(In reply to Segher Boessenkool from comment #9)
> I'm not sure what you mean.
> 
> vmrglb merges the vectors
>   abcdefghijklmnop
> and
>   ABCDEFGHIJKLMNOP
> to
>   iIjJkKlLmMnNoOpP
> 
> ... ah, I see what you mean I guess.
> 
> So, use something else instead?  How about vpku*um?
> 
> First vpkudum, xforming
>   xxxAxxxB
> and
>   xxxCxxxD
> into
>   xxxAxxxBxxxCxxxD
> 
> and then vpkuwum:
>   xxxAxxxBxxxCxxxD
> and
>   xxxExxxFxxxGxxxH
> into
>   xAxBxCxDxExFxGxH
> 
> and finally vpkuhum:
>   xAxBxCxDxExFxGxH
> and
>   xIxJxKxLxMxNxOxP
> into
>   ABCDEFGHIJKLMNOP
> 
> ?

Great, it works! Thanks for the advice. By testing, for type char, it's on par
with the artificial control vector version, 7.30s vs. 7.28s, while for type
short, it's better, 28.66s vs. 31.52s. Will update the sent patch to V2.

[Bug target/97019] New: rs6000:redundant rldicr fed to lvx/stvx

2020-09-11 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97019

Bug ID: 97019
   Summary: rs6000:redundant rldicr fed to lvx/stvx
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: linkw at gcc dot gnu.org
  Target Milestone: ---

When we do the early expansion for altivec built-in function vec_ld/vec_st, we
can probably leave some redundant rldicr x,y,0,59 which aims to AND (-16) for
the vector access address, since the lvx/stvx will do the aligned and with -16
themselves, they are useless.

= test case 

extern int a, b, c;
extern vector unsigned long long ev5, ev6, ev7, ev8;

int test(unsigned char *pe) {
  vector unsigned long long v1, v2, v3, v4, v9;
  vector unsigned long long v5 = ev5;
  vector unsigned long long v6 = ev6;
  vector unsigned long long v7 = ev7;
  vector unsigned long long v8 = ev8;

  unsigned char *e = pe;

  do {
if (a) {
  asm("memory");
  v1 = __builtin_vec_ld(16, (unsigned long long *)e);
  v2 = __builtin_vec_ld(32, (unsigned long long *)e);
  v3 = __builtin_vec_ld(48, (unsigned long long *)e);
  e = e + 8;
  for (int i = 0; i < a; i++) {
v4 = v5;
v5 = __builtin_crypto_vpmsumd(v1, v6);
v6 = __builtin_crypto_vpmsumd(v2, v7);
v7 = __builtin_crypto_vpmsumd(v3, v8);
e = e + 8;
  }
}
v5 = __builtin_vec_ld(16, (unsigned long long *)e);
v6 = __builtin_vec_ld(32, (unsigned long long *)e);
v7 = __builtin_vec_ld(48, (unsigned long long *)e);
if (c)
  b = 1;
  } while (b);

  v9 = v4;

  int p = __builtin_unpack_vector_int128((vector __int128_t)v9, 0);

  return p;
}

 command 
  -m64 -O2 -mcpu=power8

Currently the function find_alignment_op in RTL swaps pass cares the case where
have one single AND operation definition, we can extend it to check all
definitions are AND operations and aligned with -16B.

[Bug target/97019] rs6000:redundant rldicr fed to lvx/stvx

2020-09-11 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97019

Kewen Lin  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |linkw at gcc dot gnu.org
 Status|UNCONFIRMED |ASSIGNED
 CC||bergner at gcc dot gnu.org,
   ||segher at gcc dot gnu.org,
   ||wschmidt at gcc dot gnu.org
   Keywords||missed-optimization
 Ever confirmed|0   |1
   Last reconfirmed||2020-09-11
 Target||powerpc

[Bug target/97019] rs6000:redundant rldicr fed to lvx/stvx

2020-09-15 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97019

Kewen Lin  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #3 from Kewen Lin  ---
Should be fixed on latest trunk now.

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

2020-09-16 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

Kewen Lin  changed:

   What|Removed |Added

   Last reconfirmed||2020-09-16
 Status|UNCONFIRMED |ASSIGNED
 Ever confirmed|0   |1

--- Comment #7 from Kewen Lin  ---
Two questions in mind, need to dig into it further:
  1) from the assembly of scalar/vector code, I don't see any stores needed
into temp array d (array diff in pixel_sub_wxh), but when modeling we consider
the stores. On Power two vector stores take cost 2 while 16 scalar stores takes
cost 16, it seems wrong to cost model something useless. Later, for the vector
version we need 16 vector halfword extractions from these two halfword vectors,
while scalar version the values are just in GPR register, vector version looks
inefficient.
  2) on Power, the conversion from unsigned char to unsigned short is nop
conversion, when we counting scalar cost, it's counted, then add costs 32
totally onto scalar cost. Meanwhile, the conversion from unsigned short to
signed short should be counted but it's not (need to check why further).  The
nop conversion costing looks something we can handle in function
rs6000_adjust_vect_cost_per_stmt, I tried to use the generic function
tree_nop_conversion_p, but it's only for same mode/precision conversion. Will
find/check something else.

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

2020-09-16 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #9 from Kewen Lin  ---
(In reply to Richard Biener from comment #8)
> (In reply to Kewen Lin from comment #7)
> > Two questions in mind, need to dig into it further:
> >   1) from the assembly of scalar/vector code, I don't see any stores needed
> > into temp array d (array diff in pixel_sub_wxh), but when modeling we
> > consider the stores.
> 
> Because when modeling they are still there.  There's no good way around this.
> 

I noticed the stores get eliminated during FRE.  Can we consider running FRE
once just before SLP? a bad idea due to compilation time?

[Bug tree-optimization/97075] [11 regression] powerpc64 vector tests fails after r11-3230

2020-09-16 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97075

Kewen Lin  changed:

   What|Removed |Added

 Ever confirmed|0   |1
 Status|UNCONFIRMED |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |linkw at gcc dot gnu.org
 CC||linkw at gcc dot gnu.org
   Last reconfirmed||2020-09-17

--- Comment #1 from Kewen Lin  ---
I'll take a look at this.

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

2020-09-16 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #11 from Kewen Lin  ---
(In reply to Richard Biener from comment #10)
> (In reply to Kewen Lin from comment #9)
> > (In reply to Richard Biener from comment #8)
> > > (In reply to Kewen Lin from comment #7)
> > > > Two questions in mind, need to dig into it further:
> > > >   1) from the assembly of scalar/vector code, I don't see any stores 
> > > > needed
> > > > into temp array d (array diff in pixel_sub_wxh), but when modeling we
> > > > consider the stores.
> > > 
> > > Because when modeling they are still there.  There's no good way around 
> > > this.
> > > 
> > 
> > I noticed the stores get eliminated during FRE.  Can we consider running FRE
> > once just before SLP? a bad idea due to compilation time?
> 
> Yeah, we already run FRE a lot and it is one of the more expensive passes.
> 
> Note there's one point we could do better which is the embedded SESE FRE
> run from cunroll which is only run before we consider peeling an outer loop
> and thus not for the outermost unrolled/peeled code (but the question would
> be from where / up to what to apply FRE to).  On x86_64 this would apply to
> the unvectorized but then unrolled outer loop from pixel_sub_wxh which feeds
> quite bad IL to the SLP pass (but that shouldn't matter too much, maybe it
> matters for costing though).

Thanks for the explanation! I'll look at it after checking 2). IIUC, the
advantage to eliminate stores here looks able to get those things which is fed
to stores and stores' consumers bundled, then get more things SLP-ed if
available?

> 
> I think I looked at this or a related testcase some time ago and split out
> some PRs (can't find those right now).  For example we are not considering
> to simplify
> 

> 
> the load permutations suggest that splitting the group into 4-lane pieces
> would avoid doing permutes but then that would require target support
> for V4QI and V4HI vectors.  At least the loads could be considered
> to be vectorized with strided-SLP, yielding 'int' loads and a vector
> build from 4 ints.  I'd need to analyze why we do not consider this.

Good idea! Curious that is there some port where int load can not work well on
1-byte aligned address like trap?

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

2020-09-16 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #12 from Kewen Lin  ---

> Thanks for the explanation! I'll look at it after checking 2). IIUC, the
> advantage to eliminate stores here looks able to get those things which is
> fed to stores and stores' consumers bundled, then get more things SLP-ed if
> available?

Hmm, I think I was wrong, if both the feeding chain and consuming chain of the
stores are SLP-ed, later FRE would be able to fuse them.

[Bug tree-optimization/97075] [11 regression] powerpc64 vector tests fails after r11-3230

2020-09-17 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97075

--- Comment #3 from Kewen Lin  ---
(In reply to akrl from comment #2)
> Thanks Kewen, unfortunately I've no Power setup.  Sorry for the
> inconvenience.

My pleasure! If you have interests to run on Power machines, you can apply and
use some Power8/Power9 machines in CFarm machine pool
https://cfarm.tetaneutral.net/machines/list/.

[Bug tree-optimization/97075] [11 regression] powerpc64 vector tests fails after r11-3230

2020-09-17 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97075

--- Comment #4 from Kewen Lin  ---
> gcc.target/powerpc/p9-vec-length-full-6.c

This is a test case issue, 64bit/32bit pairs will use full vector instead of
partial vector as Andrea's improvement.

> gcc.target/powerpc/p9-vec-length-epil-7.c

It exposed one problem: when we call vect_need_peeling_or_partial_vectors_p in
function vect_analyze_loop_2, it's in analysis stage, if the loop is one
epilogue loop, the loop_vinfo hasn't been fixed up, like LOOP_VINFO_INT_NITERS,
the function can probably give the wrong answer.  For some 64bit type functions
of this failed case, it will return false for the epilogue loops but actually
the remaining iteration can't cover the full vector.

One simple fix is to exclude epilogue loop for this check.

diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index ab627fbf029..7273e998a99 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -2278,7 +2278,8 @@ start_over:
 {
   /* Don't use partial vectors if we don't need to peel the loop.  */
   if (param_vect_partial_vector_usage == 0
-  || !vect_need_peeling_or_partial_vectors_p (loop_vinfo))
+  || (!LOOP_VINFO_EPILOGUE_P (loop_vinfo)
+  && !vect_need_peeling_or_partial_vectors_p (loop_vinfo)))
 LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) = false;
   else if (vect_verify_full_masking (loop_vinfo)
|| vect_verify_loop_lens (loop_vinfo))

Testing is ongoing.

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

2020-09-18 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #13 from Kewen Lin  ---
>   2) on Power, the conversion from unsigned char to unsigned short is nop
> conversion, when we counting scalar cost, it's counted, then add costs 32
> totally onto scalar cost. Meanwhile, the conversion from unsigned short to
> signed short should be counted but it's not (need to check why further). 

UH to SH conversion is true when calling vect_nop_conversion_p, so it's not
even put into the cost vector. 

tree_nop_conversion_p's comments saying:

/* Return true iff conversion from INNER_TYPE to OUTER_TYPE generates
   no instruction.  */

I may miss something here, but UH to SH conversion does need one explicit
extend instruction *extsh*, the precision/mode equality check looks wrong for
this conversion.

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

2020-09-18 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789

--- Comment #15 from Kewen Lin  ---
(In reply to rguent...@suse.de from comment #14)
> On Fri, 18 Sep 2020, linkw at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789
> > 
> > --- Comment #13 from Kewen Lin  ---
> > >   2) on Power, the conversion from unsigned char to unsigned short is nop
> > > conversion, when we counting scalar cost, it's counted, then add costs 32
> > > totally onto scalar cost. Meanwhile, the conversion from unsigned short to
> > > signed short should be counted but it's not (need to check why further). 
> > 
> > UH to SH conversion is true when calling vect_nop_conversion_p, so it's not
> > even put into the cost vector. 
> > 
> > tree_nop_conversion_p's comments saying:
> > 
> > /* Return true iff conversion from INNER_TYPE to OUTER_TYPE generates
> >no instruction.  */
> > 
> > I may miss something here, but UH to SH conversion does need one explicit
> > extend instruction *extsh*, the precision/mode equality check looks wrong 
> > for
> > this conversion.
> 
> Well, it isn't a RTL predicate and it only needs extension because
> there's never a HImode pseudo but always SImode subregs.

Thanks Richi! Should we take care of this case? or neglect this kind of
extension as "no instruction"? I was intent to handle it in target specific
code, but it isn't recorded into cost vector while it seems too heavy to do the
bb_info slp_instances revisits in finish_cost.

[Bug target/92132] new test case gcc.dg/vect/vect-cond-reduc-4.c fails with its introduction in r277067

2019-10-21 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92132

--- Comment #3 from Kewen Lin  ---
Powerpc already support vcond where A and B are in the same mode or the
same size mode. As Richard pointed out, this case requires some packs, it
requires powerpc supports vec_cmpv2dfv2di and vcond_mask_v4siv4si, the
comparison generates the mask then convert to V4SI to use in condition
selection.

[Bug tree-optimization/92185] New: ICE when perform condition reduction vectorization on uchar ind var

2019-10-23 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92185

Bug ID: 92185
   Summary: ICE when perform condition reduction vectorization on
uchar ind var
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: linkw at gcc dot gnu.org
  Target Milestone: ---

TESTCASE:

#include "tree-vect.h"

extern void abort (void) __attribute__ ((noreturn));

#define N 27


unsigned char
condition_reduction (short *a, short min_v)
{
  unsigned char last = 0;

  for (unsigned char i = 0; i < 27; i++)
if (a[i] < min_v)
  last = i;

  return last;
}

int
main (void)
{
  short a[27] = {
  11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
  1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
  21, 22, 23, 24, 25, 26, 27
  };

  check_vect ();

  int ret = condition_reduction (a, 10);
  if (ret != 18)
abort ();

  return 0;
}

BTW, tree-vect.h is from gcc/testsuite/gcc.dg/vect/tree-vect.h

Options: -Ofast -fno-inline -fdump-tree-vect-details
-fvect-cost-model=unlimited

ICE backtrace:
   13 | condition_reduction (short *a, short min_v)
  | ^~~
0x115016ff vect_create_epilog_for_reduction
/home/linkw/gcc/gcc-git-fix/gcc/tree-vect-loop.c:4252
0x1150ccb3 vectorizable_live_operation(_stmt_vec_info*, gimple_stmt_iterator*,
_slp_tree*, _slp_instance*, int, bool, vec*)
/home/linkw/gcc/gcc-git-fix/gcc/tree-vect-loop.c:7478
0x114df9bf can_vectorize_live_stmts
/home/linkw/gcc/gcc-git-fix/gcc/tree-vect-stmts.c:10578
0x114e1933 vect_transform_stmt(_stmt_vec_info*, gimple_stmt_iterator*,
_slp_tree*, _slp_instance*)
/home/linkw/gcc/gcc-git-fix/gcc/tree-vect-stmts.c:11031
0x1150e9d7 vect_transform_loop_stmt
/home/linkw/gcc/gcc-git-fix/gcc/tree-vect-loop.c:7918
0x1150f73f vect_transform_loop(_loop_vec_info*)
/home/linkw/gcc/gcc-git-fix/gcc/tree-vect-loop.c:8133
0x1154acc7 try_vectorize_loop_1
/home/linkw/gcc/gcc-git-fix/gcc/tree-vectorizer.c:982
0x1154aff3 try_vectorize_loop
/home/linkw/gcc/gcc-git-fix/gcc/tree-vectorizer.c:1035
0x1154b243 vectorize_loops()
/home/linkw/gcc/gcc-git-fix/gcc/tree-vectorizer.c:1115
0x1132976f execute
/home/linkw/gcc/gcc-git-fix/gcc/tree-ssa-loop.c:414

[Bug tree-optimization/92185] ICE when perform condition reduction vectorization on uchar ind var

2019-10-23 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92185

--- Comment #3 from Kewen Lin  ---
(In reply to Richard Biener from comment #2)
> Hmm, I can't reproduce this, I tried ppc64le and x86_64.

Sorry, my local codebase is on r277221, trying latest trunk.

[Bug tree-optimization/92162] [10 Regression] ICE in vect_create_epilog_for_reduction, at tree-vect-loop.c:4252

2019-10-23 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92162

Kewen Lin  changed:

   What|Removed |Added

 CC||linkw at gcc dot gnu.org

--- Comment #6 from Kewen Lin  ---
*** Bug 92185 has been marked as a duplicate of this bug. ***

[Bug tree-optimization/92185] ICE when perform condition reduction vectorization on uchar ind var

2019-10-23 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92185

Kewen Lin  changed:

   What|Removed |Added

 Status|RESOLVED|CLOSED
 Resolution|FIXED   |DUPLICATE

--- Comment #5 from Kewen Lin  ---
Confirmed that latest trunk already fixed it and bisect shows the same result
as what Martin pointed out (Thanks Martin).

*** This bug has been marked as a duplicate of bug 92162 ***

[Bug middle-end/26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)

2019-10-23 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
Bug 26163 depends on bug 92074, which changed state.

Bug 92074 Summary: [10 regression] 26% performance regression on Spec2017 
548.exchange2_r
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92074

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

[Bug ipa/92074] [10 regression] 26% performance regression on Spec2017 548.exchange2_r

2019-10-23 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92074

Kewen Lin  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED
   Assignee|unassigned at gcc dot gnu.org  |hubicka at gcc dot 
gnu.org

--- Comment #7 from Kewen Lin  ---
Verified and confirm the commit can recover the number.

[Bug testsuite/92127] [10 regression] gcc.dg/vect/costmodel/ppc/costmodel-fast-math-vect-pr29925.c fails after r276645 on power7

2019-10-30 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92127

Kewen Lin  changed:

   What|Removed |Added

 CC||linkw at gcc dot gnu.org

--- Comment #3 from Kewen Lin  ---
(In reply to Richard Biener from comment #2)
> I suggest to make the test less dependent on unrolling by placing
> 
> #pragma GCC unroll 0
> 
> before the inner loop which is likely unrolled now.  I wonder whether
> the test tests profitability of outer loop vectorization (likely
> not profitable)?  I see rs6000 adjusts unroll parameters as well.

Confirmed that the inner loop is completely unrolled after the suspected
commit. I checked the dump details, the test is to test the inner loop
profitable or not, the outer loop vectorization fail far ahead of profit
determination.

/home/linkw/gcc/gcc-git-base/gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-fast-math-vect-pr29925.c:18:20:
missed:   versioning for alias required: can't determine dependence between *_7
and *_11
consider run-time aliasing test between *_7 and *_11
/home/linkw/gcc/gcc-git-base/gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-fast-math-vect-pr29925.c:18:20:
missed:   runtime alias check not supported for outer loop.
/home/linkw/gcc/gcc-git-base/gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-fast-math-vect-pr29925.c:13:4:
missed:  bad data dependence.
/home/linkw/gcc/gcc-git-base/gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-fast-math-vect-pr29925.c:13:4:
missed: couldn't vectorize loop

[Bug testsuite/92127] [10 regression] gcc.dg/vect/costmodel/ppc/costmodel-fast-math-vect-pr29925.c fails after r276645 on power7

2019-11-01 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92127

--- Comment #4 from Kewen Lin  ---
Author: linkw
Date: Fri Nov  1 07:11:12 2019
New Revision: 277704

URL: https://gcc.gnu.org/viewcvs?rev=277704&root=gcc&view=rev
Log:
2019-11-01  Kewen Lin  

  PR testsuite/92127
  * gcc.dg/vect/costmodel/ppc/costmodel-pr37194.c: Disable unroll.
  * gcc.dg/vect/costmodel/ppc/costmodel-fast-math-vect-pr29925.c: Likewise.


Modified:
trunk/gcc/testsuite/ChangeLog
   
trunk/gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-fast-math-vect-pr29925.c
trunk/gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-pr37194.c

[Bug testsuite/92127] [10 regression] gcc.dg/vect/costmodel/ppc/costmodel-fast-math-vect-pr29925.c fails after r276645 on power7

2019-11-01 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92127

Kewen Lin  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #5 from Kewen Lin  ---
Test case fix has been committed.

[Bug ipa/92074] [10 regression] 26% performance regression on Spec2017 548.exchange2_r

2019-11-01 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92074

Kewen Lin  changed:

   What|Removed |Added

 Status|RESOLVED|CLOSED

--- Comment #8 from Kewen Lin  ---
Closed it.

[Bug testsuite/87306] test case gcc.dg/vect/bb-slp-pow-1.c fails with its introduction in r263290

2019-11-04 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87306

--- Comment #6 from Kewen Lin  ---
Author: linkw
Revision: 268003
Modified property: svn:log

Modified: svn:log at Tue Nov  5 02:26:38 2019
--
--- svn:log (original)
+++ svn:log Tue Nov  5 02:26:38 2019
@@ -1,3 +1,5 @@
+[PATCH, rs6000, testsuite] Fix PR87306
+
 PR target/87306
 * gcc.dg/vect/bb-slp-pow-1.c: Modify to reflect that
 the loop is not vectorized on POWER unless hardware

[Bug testsuite/92127] [10 regression] gcc.dg/vect/costmodel/ppc/costmodel-fast-math-vect-pr29925.c fails after r276645 on power7

2019-11-04 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92127

--- Comment #6 from Kewen Lin  ---
Author: linkw
Revision: 277704
Modified property: svn:log

Modified: svn:log at Tue Nov  5 02:36:58 2019
--
--- svn:log (original)
+++ svn:log Tue Nov  5 02:36:58 2019
@@ -1,4 +1,6 @@
-2019-11-01  Kewen Lin  
+  PR testsuite/92127: Disable unrolling for some vect code model cases
+
+  2019-11-01  Kewen Lin  

   PR testsuite/92127
   * gcc.dg/vect/costmodel/ppc/costmodel-pr37194.c: Disable unroll.

[Bug target/92132] new test case gcc.dg/vect/vect-cond-reduc-4.c fails with its introduction in r277067

2019-11-07 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92132

--- Comment #4 from Kewen Lin  ---
Author: linkw
Date: Fri Nov  8 07:37:07 2019
New Revision: 277947

URL: https://gcc.gnu.org/viewcvs?rev=277947&root=gcc&view=rev
Log:
  [rs6000]Fix PR92132 by adding vec_cmp and vcond_mask supports

  To support full condition reduction vectorization, we have to define
  vec_cmp* and vcond_mask_*.  This patch is to add related expands.
  Also add the missing vector fp comparison RTL pattern supports
  like: ungt, unge, unlt, unle, ne, lt and le.

gcc/ChangeLog

2019-11-08  Kewen Lin  

PR target/92132
* config/rs6000/predicates.md
(signed_or_equality_comparison_operator): New predicate.
(unsigned_or_equality_comparison_operator): Likewise.
* config/rs6000/rs6000.md (one_cmpl2): Remove expand.
(one_cmpl3_internal): Rename to one_cmpl2.
* config/rs6000/vector.md
(vcond_mask_ for VEC_I and VEC_I): New expand.
(vec_cmp for VEC_I and VEC_I): Likewise.
(vec_cmpu for VEC_I and VEC_I): Likewise.
(vcond_mask_ for VEC_F): New expand for float
vector modes and same-size integer vector modes.
(vec_cmp for VEC_F): Likewise.
(vector_lt for VEC_F): New expand.
(vector_le for VEC_F): Likewise.
(vector_ne for VEC_F): Likewise.
(vector_unge for VEC_F): Likewise.
(vector_ungt for VEC_F): Likewise.
(vector_unle for VEC_F): Likewise.
(vector_unlt for VEC_F): Likewise.
(vector_uneq): Expose name.
(vector_ltgt): Likewise.
(vector_unordered): Likewise.
(vector_ordered): Likewise.

gcc/testsuite/ChangeLog

2019-11-08  Kewen Lin  

PR target/92132
* gcc.target/powerpc/pr92132-fp-1.c: New test.
* gcc.target/powerpc/pr92132-fp-2.c: New test.
* gcc.target/powerpc/pr92132-int-1.c: New test.
* gcc.target/powerpc/pr92132-int-2.c: New test.


Added:
trunk/gcc/testsuite/gcc.target/powerpc/pr92132-fp-1.c
trunk/gcc/testsuite/gcc.target/powerpc/pr92132-fp-2.c
trunk/gcc/testsuite/gcc.target/powerpc/pr92132-int-1.c
trunk/gcc/testsuite/gcc.target/powerpc/pr92132-int-2.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/config/rs6000/predicates.md
trunk/gcc/config/rs6000/rs6000.md
trunk/gcc/config/rs6000/vector.md
trunk/gcc/testsuite/ChangeLog

[Bug target/92132] new test case gcc.dg/vect/vect-cond-reduc-4.c fails with its introduction in r277067

2019-11-07 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92132

Kewen Lin  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #5 from Kewen Lin  ---
Fixed on trunk.

[Bug testsuite/92464] [10 regression] r278033 breaks gcc.dg/vect/costmodel/ppc/costmodel-vect-76b.c

2019-11-12 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92464

Kewen Lin  changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2019-11-12
 Ever confirmed|0   |1

--- Comment #1 from Kewen Lin  ---
Before the regressed commit, the cost view looks like:
  0x13135eb0 ic[i_35] 2 times vector_stmt costs 2 in prologue
  0x13135eb0 ic[i_35] 1 times vector_stmt costs 1 in prologue
  0x13135eb0 ic[i_35] 1 times vector_load costs 1 in body
  0x13135eb0 ic[i_35] 1 times vec_perm costs 3 in body
  0x13135eb0 _5 1 times vector_store costs 1 in body
  .c:21:3: note:  not using a fully-masked loop.
  cost model: prologue peel iters set to vf/2.
  cost model: epilogue peel iters set to vf/2 because peeling for alignment is
unknown.
  0x13135eb0  1 times cond_branch_taken costs 3 in prologue
  0x13135eb0  1 times cond_branch_not_taken costs 1 in prologue
  0x13135eb0  1 times cond_branch_taken costs 3 in epilogue
  0x13135eb0  1 times cond_branch_not_taken costs 1 in epilogue
  0x13135eb0 ic[i_35] 2 times scalar_load costs 2 in prologue
  0x13135eb0 ic[i_35] 2 times scalar_load costs 2 in epilogue
  0x13135eb0 _5 2 times scalar_store costs 2 in prologue
  0x13135eb0 _5 2 times scalar_store costs 2 in epilogue
  .c:21:3: note:  Cost model analysis:
Vector inside of loop cost: 5
Vector prologue cost: 11
Vector epilogue cost: 8
Scalar iteration cost: 2
Scalar outside cost: 0
Vector outside cost: 19
prologue iterations: 2
epilogue iterations: 2
Calculated minimum iters for profitability: 19

With the commit, the cost view is changed to:
  0x13135eb0 ic[i_35] 2 times vector_stmt costs 2 in prologue
  0x13135eb0 ic[i_35] 1 times vector_stmt costs 1 in prologue
  0x13135eb0 ic[i_35] 1 times vector_load costs 2 in body
  0x13135eb0 ic[i_35] 1 times vec_perm costs 3 in body
  0x13135eb0 _5 1 times vector_store costs 1 in body
  .c:21:3: note:  not using a fully-masked loop.
  cost model: prologue peel iters set to vf/2.
  cost model: epilogue peel iters set to vf/2 because peeling for alignment is
unknown.
  0x13135eb0  1 times cond_branch_taken costs 3 in prologue
  0x13135eb0  1 times cond_branch_not_taken costs 1 in prologue
  0x13135eb0  1 times cond_branch_taken costs 3 in epilogue
  0x13135eb0  1 times cond_branch_not_taken costs 1 in epilogue
  0x13135eb0 ic[i_35] 2 times scalar_load costs 4 in prologue
  0x13135eb0 ic[i_35] 2 times scalar_load costs 4 in epilogue
  0x13135eb0 _5 2 times scalar_store costs 2 in prologue
  0x13135eb0 _5 2 times scalar_store costs 2 in epilogue
  .c:21:3: note:  Cost model analysis:
Vector inside of loop cost: 6
Vector prologue cost: 13
Vector epilogue cost: 10
Scalar iteration cost: 3
Scalar outside cost: 0
Vector outside cost: 23
prologue iterations: 2
epilogue iterations: 2
Calculated minimum iters for profitability: 12

The cost changes are expected, scalar and vector load cost more. It leads the
profitable min iter count become small.

I ran both before- and after-executable with 10 invocations at 10 times,
the evaluated time are very close, both average time are 65.23s. It means the
cost adjustment doesn't make this case worse.

One fix idea is to adjust the test case iteration count to 11 lower than the
current profitable min iters count.

[Bug testsuite/92464] [10 regression] r278033 breaks gcc.dg/vect/costmodel/ppc/costmodel-vect-76b.c

2019-11-12 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92464

--- Comment #3 from Kewen Lin  ---
(In reply to Segher Boessenkool from comment #2)
> What is the testcase testing?  Whether we can properly vectorize this
> code, right?  And for p7 we now do it correctly, but thought it was
> too expensive before?

On Power7, it's to verify whether the cost model can take the loop as not
profitable due to high overhead of peeling to get vector aligned address and
not to vectorize the loop. The related patch changes the cost of load insns on
Power7, it leads the profitable min iteration count change from 19 to 12. We
are not lucky that the case happens to use 12 as iteration count (N-OFF), it
hits the threshold. As actual runtime performance evaluation on this case
(result mentioned above), the vectorized version works on par with
non-vectorized version (before), so I believe the cost change is innocent for
this case. One simple fix can be lowered the loop bound N to 15 instead of 16.

[Bug testsuite/92464] [10 regression] r278033 breaks gcc.dg/vect/costmodel/ppc/costmodel-vect-76b.c

2019-11-12 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92464

--- Comment #4 from Kewen Lin  ---
By the way, if I removed the check_vect and result verification code, the
vectorized version perform very slightly better than non-vectorized version.
And yes, I think it was a bit off before.

[Bug testsuite/92464] [10 regression] r278033 breaks gcc.dg/vect/costmodel/ppc/costmodel-vect-76b.c

2019-11-13 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92464

--- Comment #5 from Kewen Lin  ---
Author: linkw
Date: Thu Nov 14 05:57:12 2019
New Revision: 278195

URL: https://gcc.gnu.org/viewcvs?rev=278195&root=gcc&view=rev
Log:
  [testsuite] Fix PR92464 by adjust test case loop bound

  The recent vectorization cost adjustment on load leads
  the profitable min iteration count to change from 19 to 12.
  The case happens to hit the threshold.  This patch is to
  adjust the loop bound from 16 to 14.

  gcc/testsuite/ChangeLog

  2019-11-14  Kewen Lin  

PR target/92464
* gcc.dg/vect/costmodel/ppc/costmodel-vect-76b.c: Adjust
loop bound due to load cost adjustment.


Modified:
trunk/gcc/testsuite/ChangeLog
trunk/gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-vect-76b.c

[Bug testsuite/92464] [10 regression] r278033 breaks gcc.dg/vect/costmodel/ppc/costmodel-vect-76b.c

2019-11-13 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92464

Kewen Lin  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #6 from Kewen Lin  ---
Fixed on trunk by r278195.

[Bug target/92566] rs6000_preferred_simd_mode isn't very good

2019-11-18 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92566

Kewen Lin  changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2019-11-19
 CC||linkw at gcc dot gnu.org
   Assignee|unassigned at gcc dot gnu.org  |linkw at gcc dot gnu.org
 Ever confirmed|0   |1

--- Comment #1 from Kewen Lin  ---
Currently we guard V2DImode under TARGET_VSX && TARGET_P8_VECTOR in rs6000.c.

[Bug target/92566] rs6000_preferred_simd_mode isn't very good

2019-11-18 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92566

--- Comment #2 from Kewen Lin  ---
Created attachment 47295
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47295&action=edit
Guard V2DImode and V1TImode under VSX and P8VECTOR

[Bug target/92534] [10 regression] gcc.dg/vect/bb-slp-42.c fails after r278262

2019-11-19 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92534

Kewen Lin  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |linkw at gcc dot gnu.org

--- Comment #3 from Kewen Lin  ---
I'd like to triage this one.

[Bug target/92566] rs6000_preferred_simd_mode isn't very good

2019-11-19 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92566

Kewen Lin  changed:

   What|Removed |Added

  Attachment #47295|0   |1
is obsolete||

--- Comment #4 from Kewen Lin  ---
Created attachment 47306
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47306&action=edit
Get possible mode and query by VECTOR_UNIT_NONE_P

Updated as Segher's comment.

[Bug target/92534] [10 regression] gcc.dg/vect/bb-slp-42.c fails after r278262

2019-11-20 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92534

Kewen Lin  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED

--- Comment #4 from Kewen Lin  ---
This is related to the realign vector load. It only fail with
-mno-allow-movmisalign (which is disabled from Power8), I can't see the abort
if I specified the option explicitly on Power8.

The generated IR below is incorrect:

  vectp.43_73 = &MEM[(int *)b_17(D) + 4B];
  vect__160.44_21 = __builtin_altivec_mask_for_load (vectp.43_73);
==> Here we use the vectp.43_73 (b+4), this is unexpected.
  vectp.46_20 = &MEM[(int *)b_17(D) + 4B];
  vectp.46_19 = vectp.46_20 + 18446744073709551612;
==> Here we use the vectp.46_19 (b)
  vectp.46_18 = vectp.46_19 & -16B;
  vect__160.47_206 = MEM  [(int *)vectp.46_18];
  vectp.46_207 = vectp.46_19 + 15;
==> Here we use the vectp.46_19 (b) + 15
  vectp.46_208 = vectp.46_207 & -16B;
  vect__160.48_209 = MEM  [(int *)vectp.46_208];
  vect__160.49_210 = REALIGN_LOAD ;
  vect__144.50_211 = VEC_PERM_EXPR ;

If I adjusted it as the below code, it can pass.

  msq = vect_setup_realignment (first_stmt_info_for_drptr && !slp_perm
? first_stmt_info_for_drptr
: first_stmt_info, gsi, &realignment_token,
alignment_support_scheme, NULL_TREE,
&at_loop);

Need more time to figure out it's reasonable.

[Bug target/92534] [10 regression] gcc.dg/vect/bb-slp-42.c fails after r278262

2019-11-20 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92534

Kewen Lin  changed:

   What|Removed |Added

 CC||rguenth at gcc dot gnu.org

--- Comment #5 from Kewen Lin  ---
Seurer told me this has passed since one recent commit r278495 (Thanks
Seurer!). I noticed it guards the uniform_vector_p, the case doesn't try to
vectorize any more, I'm wondering that for the other cases into that code path,
the below code is safe enough?

  msq = vect_setup_realignment (first_stmt_info_for_drptr
? first_stmt_info_for_drptr
: first_stmt_info

that is the situation here expecting first_stmt_info even
first_stmt_info_for_drptr gets assigned (the behavior of this test case before
commit r278495) would never happen? 

I may suffer from imaginary fears but my concern is that possibly commit
r278495 just conceal one bug which gets exposed by this case before.

Hi Richard B., since you are also the author of commit r275798, you might be
the best person who can answer that?  Thanks in advance!

[Bug target/92534] [10 regression] gcc.dg/vect/bb-slp-42.c fails after r278262

2019-11-21 Thread linkw at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92534

--- Comment #7 from Kewen Lin  ---
Thanks for your confirmation and notes! Yes, the realignment codes won't take
effect from Power8 which supports unaligned vector load/store. I'll learn the
code, follow your suggestion and cook some patches later.

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 967 matches

Mail list logo