[Bug tree-optimization/95199] New: Remove extra variable created for memory reference in loop vectorization.

2020-05-18 Thread zhoukaipeng3 at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95199

Bug ID: 95199
   Summary: Remove extra variable created for memory reference in
loop vectorization.
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: zhoukaipeng3 at huawei dot com
  Target Milestone: ---

The function vect_create_data_ref_ptr created two equal variable for two equal
memory references.

gcc version 11.0.0 20200515 (experimental) (GCC)
Target: aarch64-unknown-linux-gnu
Configured with: ../configure
Command: gcc -O2 -march=armv8.2-a+fp+sve -ftree-vectorize test.c -S

Testcase:
void
foo (double *a, double *b, double m, int inc_x, int inc_y) {
  int ix = 0, iy = 0;
  for (int i = 0; i < 1000; ++i)
{
  a[ix] += m * b[iy];
  ix += inc_x;
  iy += inc_y;
}
  return ;
}

Assembly code
.L5:
  ld1d z3.d, p0/z, [x5, z2.d, lsl 3]
  ld1d z1.d, p0/z, [x3, z4.d, lsl 3]
  fmad z1.d, p1/m, z0.d, z3.d
  st1d z1.d, p0, [x2, z2.d, lsl 3]
  incd x1
  add  x5, x5, x6
  add  x3, x3, x4
  add  x2, x2, x6
  whilelo  p0.d, x1, x0
  b.any.L5

x2 is the same as x5.
vectorizable_load and vectorizable_store called vect_create_data_ref_ptr twice
for a[ix].

Dump Log in test.c.161.vect
test.c:4:2: note:  create real_type-pointer variable to type: double 
vectorizing a pointer ref: *a_16(D)
test.c:4:2: note:  created a_16(D)
test.c:4:2: note:  add new stmt: vect__4.5_94 = .MASK_GATHER_LOAD
(vectp_a.3_91, _90, 8, { 0.0, ... }, loop_mask_93);
...
test.c:4:2: note:  create real_type-pointer variable to type: double 
vectorizing a pointer ref: *a_16(D)
test.c:4:2: note:  created a_16(D)
test.c:4:2: note:  add new stmt: .MASK_SCATTER_STORE (vectp_a.11_117, _116, 8,
vect__10.10_108, loop_mask_93);

I plan to add a hash_map to loop_vec_info for dr and the corresponding pointer
created by vect_create_data_ref_ptr. If the dr->ref has been handled, return
the corresponding pointer.

Optimized assembly code
.L3:
  ld1dz2.d, p0/z, [x0, z1.d, lsl 3]
  ld1dz0.d, p0/z, [x1, z4.d, lsl 3]
  fmadz0.d, p1/m, z3.d, z2.d
  st1dz0.d, p0, [x0, z1.d, lsl 3]
  incdx2
  add x0, x0, x5
  add x1, x1, x4
  whilelo p0.d, w2, w3
  b.any   .L3
  ret

[Bug tree-optimization/95199] Remove extra variable created for memory reference in loop vectorization.

2020-05-19 Thread zhoukaipeng3 at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95199

--- Comment #2 from Kaipeng Zhou  ---
It seems that IVOPTs has no ability to handle the case where TREE_CODE(iv_step)
is SSA_NAME.

[Bug tree-optimization/95199] Remove extra variable created for memory reference in loop vectorization.

2020-05-21 Thread zhoukaipeng3 at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95199

--- Comment #4 from Kaipeng Zhou  ---
Sorry for not expressing clearly.

I have debugged the testcase you provided.  Not eliminating them is not caused
by IFN.  The relevant code is in the "get_computation_aff_1" function.

In IVOPTs the IV_STEPs must be checked by function "constant_multiple_of"
before using an IV variable to eliminate the other.  But if the tree_code of
input IV_STEP is SSA_NAME, the function will return false.  In your testcase,
the tree_code of IV_STEP is MULT_EXPR, so it return true.

Gimple for my testcase:
   [local count: 8589933]:
  _83 = (sizetype) inc_y_22(D);
  _84 = _83 * POLY_INT_CST [16, 16];
  _85 = (long unsigned int) inc_y_22(D);
  _86 = _85 * 8;
  _87 = (ssizetype) _86;
  _88 = _87 /[ex] 8;
  _89 = (long unsigned int) _88;
  _90 = VEC_SERIES_EXPR <0, _89>;
  vect_cst__95 = [vec_duplicate_expr] m_17(D);
  _97 = (sizetype) inc_x_20(D);
  _98 = _97 * POLY_INT_CST [16, 16];
  _99 = (long unsigned int) inc_x_20(D);
  _100 = _99 * 8;
  _101 = (ssizetype) _100;
  _102 = _101 /[ex] 8;
  _103 = (long unsigned int) _102;
  _104 = VEC_SERIES_EXPR <0, _103>;
  _109 = (sizetype) inc_x_20(D);
  _110 = _109 * POLY_INT_CST [16, 16];
  _111 = (long unsigned int) inc_x_20(D);
  _112 = _111 * 8;
  _113 = (ssizetype) _112;
  _114 = _113 /[ex] 8;
  _115 = (long unsigned int) _114;
  _116 = VEC_SERIES_EXPR <0, _115>;
  max_mask_123 = .WHILE_ULT (0, 1000, { 0, ... });

   [local count: 429496649]:
  # vectp_b.3_91 = PHI 
  # vectp_a.7_105 = PHI 
  # vectp_a.11_117 = PHI 
  # ivtmp_120 = PHI 
  # loop_mask_93 = PHI 
  vect__4.5_94 = .MASK_GATHER_LOAD (vectp_b.3_91, _90, 8, { 0.0, ... },
loop_mask_93);
  vect__5.6_96 = vect__4.5_94 * vect_cst__95;
  vect__9.9_107 = .MASK_GATHER_LOAD (vectp_a.7_105, _104, 8, { 0.0, ... },
loop_mask_93);
  vect__10.10_108 = vect__5.6_96 + vect__9.9_107;
  .MASK_SCATTER_STORE (vectp_a.11_117, _116, 8, vect__10.10_108, loop_mask_93);
  vectp_b.3_92 = vectp_b.3_91 + _84;
  vectp_a.7_106 = vectp_a.7_105 + _98;
  vectp_a.11_118 = vectp_a.11_117 + _110;
  ivtmp_121 = ivtmp_120 + POLY_INT_CST [2, 2];
  _122 = (unsigned int) ivtmp_121;
  next_mask_124 = .WHILE_ULT (_122, 1000, { 0, ... });
  if (next_mask_124 != { 0, ... })
goto ; [98.00%]
  else
goto ; [2.00%]

_98 and _110 are IV_STEPs.  They are both SSA_NAME, so they cannot currently be
eliminated in IVOPTs.

I am not sure about my opinion.  If wrong, please correct me.  And can you
provide some suggestions on how to solve this problem?  Should I try to enhance
the "constant_multiple_of" function?

[Bug tree-optimization/95199] Remove extra variable created for memory reference in loop vectorization.

2020-05-22 Thread zhoukaipeng3 at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95199

--- Comment #8 from Kaipeng Zhou  ---
(In reply to bin cheng from comment #7)
> (In reply to rguent...@suse.de from comment #6)
> > On Thu, 21 May 2020, zhoukaipeng3 at huawei dot com wrote:
> > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95199
> > > 
> > > --- Comment #4 from Kaipeng Zhou  ---
> > > Sorry for not expressing clearly.
> > > 
> > > I have debugged the testcase you provided.  Not eliminating them is not 
> > > caused
> > > by IFN.  The relevant code is in the "get_computation_aff_1" function.
> > > 
> > > In IVOPTs the IV_STEPs must be checked by function "constant_multiple_of"
> > > before using an IV variable to eliminate the other.  But if the tree_code 
> > > of
> > > input IV_STEP is SSA_NAME, the function will return false.  In your 
> > > testcase,
> > > the tree_code of IV_STEP is MULT_EXPR, so it return true.
> > > 
> > > Gimple for my testcase:
> > >[local count: 8589933]:
> > >   _83 = (sizetype) inc_y_22(D);
> > >   _84 = _83 * POLY_INT_CST [16, 16];
> > >   _85 = (long unsigned int) inc_y_22(D);
> > >   _86 = _85 * 8;
> > >   _87 = (ssizetype) _86;
> > >   _88 = _87 /[ex] 8;
> > >   _89 = (long unsigned int) _88;
> > >   _90 = VEC_SERIES_EXPR <0, _89>;
> > >   vect_cst__95 = [vec_duplicate_expr] m_17(D);
> > >   _97 = (sizetype) inc_x_20(D);
> > >   _98 = _97 * POLY_INT_CST [16, 16];
> > >   _99 = (long unsigned int) inc_x_20(D);
> > >   _100 = _99 * 8;
> > >   _101 = (ssizetype) _100;
> > >   _102 = _101 /[ex] 8;
> > >   _103 = (long unsigned int) _102;
> > >   _104 = VEC_SERIES_EXPR <0, _103>;
> > >   _109 = (sizetype) inc_x_20(D);
> > >   _110 = _109 * POLY_INT_CST [16, 16];
> > >   _111 = (long unsigned int) inc_x_20(D);
> > 
> > The issue is you have two copies of
> > (sizetype) inc_x_20(D) * POLY_INT_CST [16, 16];
> > and IVOPTs does not perform CSE.  vinfo->ivexpr_map is supposed to
> > catch those "IV base and/or step expressions".  So look where
> > they are inserted and check the CSE map is used.  Alternatively
> > fixup hashing/comparing to handle POLY_INT_CST [16, 16] if that
> > is the reason for the missed CSE.
> > 
> Yes, it's because cse_and_gimplify_to_preheader is not called for
> gathering/scattering.  Should be easily fixed by following patch:
> 
> diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
> index e7822c44951..ba9ee5c4996 100644
> --- a/gcc/tree-vect-stmts.c
> +++ b/gcc/tree-vect-stmts.c
> @@ -2961,6 +2961,7 @@ vect_get_strided_load_store_ops (stmt_vec_info
> stmt_info,
>tree bump = size_binop (MULT_EXPR,
>   fold_convert (sizetype, unshare_expr (DR_STEP
> (dr))),
>   size_int (TYPE_VECTOR_SUBPARTS (vectype)));
> +  bump = cse_and_gimplify_to_preheader (loop_vinfo, bump);
>*dataref_bump = force_gimple_operand (bump, &stmts, true, NULL_TREE);
>if (stmts)
>  gsi_insert_seq_on_edge_immediate (loop_preheader_edge (loop), stmts);

I tested this patch and it worked fine on this testcase.

Thanks a lot.

[Bug tree-optimization/95199] Remove extra variable created for memory reference in loop vectorization.

2020-06-11 Thread zhoukaipeng3 at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95199

--- Comment #9 from Kaipeng Zhou  ---
Created attachment 48717
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48717&action=edit
Remove extra variable created for memory reference in loop vectorization.

Looks like no one is preparing this patch.  I tried to make it.

Bootstrap and deja tested on x86 and aarch64.  No new problem.

Is that ok?

[Bug tree-optimization/95854] New: ICE in find_bswap_or_nop_1 of pass store-merging

2020-06-23 Thread zhoukaipeng3 at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95854

Bug ID: 95854
   Summary: ICE in find_bswap_or_nop_1 of pass store-merging
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: zhoukaipeng3 at huawei dot com
  Target Milestone: ---

ICE log:

during GIMPLE pass: store-merging
pr87132.c: In function ‘main’:
pr87132.c:5:5: internal compiler error: in tree_to_uhwi, at tree.c:7332
5 | int main()
  | ^~~~
0xfc2fcf tree_to_uhwi(tree_node const*)
../.././gcc/tree.c:7332
0xfc2fcf tree_to_uhwi(tree_node const*)
../.././gcc/tree.c:7330
0x169420b find_bswap_or_nop_1
../.././gcc/gimple-ssa-store-merging.c:602
0x1696c1b find_bswap_or_nop_1
../.././gcc/gimple-ssa-store-merging.c:589
0x1696c1b process_store
../.././gcc/gimple-ssa-store-merging.c:4773
0x1696c1b execute
../.././gcc/gimple-ssa-store-merging.c:4996


Command: gcc -S -march=armv8.5-a+sve2 -fno-vect-cost-model -fno-tree-scev-cprop
 -O3 -ftracer pr87132.c

GCC version: gcc version 11.0.0 20200618 (experimental) (GCC)

The problem occurs in find_bswap_or_nop_1.
The stmt is "_27 = BIT_FIELD_REF ". 
So "tree_to_uhwi (TREE_OPERAND (rhs1, 2))" failed.

I plan to add a judgement before to make sure both TREE_OPERAND (rhs1, 1) and
TREE_OPERAND (rhs1, 2) are INTEGER_CST.

[Bug tree-optimization/96053] New: Miss optimization:Finding SLP sequences from reductions sometimes is better than finding from reduction chains

2020-07-03 Thread zhoukaipeng3 at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96053

Bug ID: 96053
   Summary: Miss optimization:Finding SLP sequences from
reductions sometimes is better than finding from
reduction chains
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: zhoukaipeng3 at huawei dot com
  Target Milestone: ---

command:
gcc -S -O2 -ftree-vectorize test.c -funsafe-math-optimizations 
-fno-tree-reassoc -march=armv8.2-a+sve -msve-vector-bits=128

gcc version 11.0.0 20200629

In vectorization, finding SLP sequences from reduction chains has priority over
from reductions.  But sometimes, finding SLP sequences from reductions is a
better way to do vectorization than from reduction chains.

testcase:
double f(double *a, double *b)
{
  double res1 = 0;
  double res0 = 0;
  for (int i = 0 ; i < 1000; i+=4) {
res0 += a[i] * b[i];
res1 += a[i+1] * b[i*1];
res0 += a[i+2] * b[i+2];
res1 += a[i+3] * b[i+3];
  }
  return res0 + res1;
}

I have two imperfect solutions, one is to add a control option, and the other
is to use the cost model to evaluate which is better.  The first one is very
difficult for users to use, and the second one is difficult to implement.

Does anyone have a better suggestion?

[Bug tree-optimization/96053] Miss optimization:Finding SLP sequences from reductions sometimes is better than finding from reduction chains

2020-07-06 Thread zhoukaipeng3 at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96053

--- Comment #2 from Kaipeng Zhou  ---
For now, I will try to make a patch to give more control option like

#pragma GCC vect [no-]reduc-chain

to the user.

If there is any new progress or problem, I will update here.

[Bug tree-optimization/96053] Miss optimization:Finding SLP sequences from reductions sometimes is better than finding from reduction chains

2020-07-20 Thread zhoukaipeng3 at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96053

--- Comment #3 from Kaipeng Zhou  ---
Created attachment 48896
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48896&action=edit
Patch to add #pragma GCC no_reduc_chain

[Bug tree-optimization/96053] Miss optimization:Finding SLP sequences from reductions sometimes is better than finding from reduction chains

2020-07-20 Thread zhoukaipeng3 at huawei dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96053

--- Comment #4 from Kaipeng Zhou  ---
This patch add #pragma GCC no_reduc_chain and only completes the front end of C
language.

For the testcase, it successfully skipped doing slp by finding sequences from
reduction chains.  Without #pragma GCC no_reduc_chain, it will fail to do
vectorization.

Please help to check if there is any problem. If there is no problem, I will
continue to complete the front end of the remaining languages.