[x86,PATCH] Simple performance tuning for SLM.

2013-09-10 Thread Yuri Rumyantsev
Hi All,

Here is a simple fix which improves SLM performance by avoiding memory
operand in conditional move instructions. Our measurements have shown
significant speedup (~+10%) for routelookup bench from eembc2.0 suite.

Bootstrapping and regression testing were successful for x86-64.

Is it OK for trunk?

ChangeLog:

2013-09-10  Yuri Rumyantsev  

* config/i386/x86-tune.def: Turn on X86_TUNE_AVOID_MEM_OPND_FOR_CMOVE
for SLM.


patch
Description: Binary data


[x86,PATCH] Simple fix for Atom LEA splitting.

2013-09-16 Thread Yuri Rumyantsev
Hi All,

Here is 1-line fix which improves Atom performance by better distance
estimation. We got ~20% speedup on one of bench from eembc2.0 with
this fix.

Bootstrapping and regression testing were successful for x86-64.

Is it OK for trunk?

ChangeLog:

2013-09-16  Yuri Rumyantsev  

* config/i386/i386.c (distance_agu_use_in_bb) : Proper initialization
of 'prev' var to get better distance estimation.


patch
Description: Binary data


Re: [x86,PATCH] Simple fix for Atom LEA splitting.

2013-09-16 Thread Yuri Rumyantsev
Uros,

Thanks for your review.

I attached modified patch - is it OK for you now?

2013/9/16 Uros Bizjak :
> On Mon, Sep 16, 2013 at 2:50 PM, Yuri Rumyantsev  wrote:
>> Hi All,
>>
>> Here is 1-line fix which improves Atom performance by better distance
>> estimation. We got ~20% speedup on one of bench from eembc2.0 with
>> this fix.
>>
>> Bootstrapping and regression testing were successful for x86-64.
>>
>> Is it OK for trunk?
>>
>> ChangeLog:
>>
>> 2013-09-16  Yuri Rumyantsev  
>>
>> * config/i386/i386.c (distance_agu_use_in_bb) : Proper initialization
>> of 'prev' var to get better distance estimation.
>
> Please rewrite this part to:
>
>   basic_block bb = NULL;
>   rtx next = start;
>   rtx prev = NULL;
>
>   if (start != NULL_RTX)
> {
>   bb = BLOCK_FOR_INSN (start);
>   if (start != BB_HEAD (bb))
> prev = insn;
> }
>
> A comment would be also nice here. It took me quite some time to
> decipher the code.
>
> OK with these changes.
>
> Thanks,
> Uros.


patch1
Description: Binary data


Re: [x86,PATCH] Simple fix for Atom LEA splitting.

2013-09-17 Thread Yuri Rumyantsev
Here is a final patch with fixed commentary.

2013/9/16 Uros Bizjak :
> On Mon, Sep 16, 2013 at 5:01 PM, Yuri Rumyantsev  wrote:
>
>> Does this comment looks good to you:
>>
>>   if (start != NULL_RTX)
>> {
>>   bb = BLOCK_FOR_INSN (start);
>>   if (start != BB_HEAD (bb))
>> /* Initialize prev to insn if insn and start belong to the same bb;
>>   in this case increase_distance can increment distance to 1.  */
>> prev = insn;
>
> I'd say something in the lines of:
>
> If insn and start belong to the same bb, set prev to insn, so the call
> to increase_distance will increase the distance between insns by 1.
>
> Best regards,
> Uros.


fixed_patch
Description: Binary data


Additional fix for pre-reload schedule on x86 targets.

2012-10-23 Thread Yuri Rumyantsev
Hi All,

This fix is aimed to remove stability issues with using pre-reload
scheduler for x86 targets caused by cross-block motion of function
arguments passed in likely-spilled HW registers. We found one more
issue in a process of more detail testing pre-reload scheduler for all
x86 platforms.

The fix was fully tested on all acceptable suites and gcc
bootstrapping with turned on pre-reload scheduler.

Tested for i386 and x86-64, ok for trunk?

ChangeLog:

2012-10-23  Yuri Rumyantsev  

* config/i386/i386.c (insn_is_function_arg) : Add check on CALL
instruction.
(ix86_dependencies_evaluation_hook): Insert dependencies in all
predecessors of call block for non-trivial region avoiding creation
of loop-carried dependency to avoid cross-block motion of HW registers.


i386-pre-reload-scheduler-fix.diff
Description: Binary data


[gomp4 simd, RFC] Simple fix to override vectorization cost estimation.

2013-10-31 Thread Yuri Rumyantsev
Hi All,

Here is a simple fix which allows to vectorize loop marked with
'pragma omp simd' even if cost model tells us that vectorization is
not profitable.
I checked that on simple test-case is works as expected.

Is it Ok for trunk?

ChangeLog:

2013-10-31  Yuri Rumyantsev  

* tree-vect-loop.c (vect_estimate_min_profitable_iters): Override
cost estimation for loops marked as vectorizable.


patch
Description: Binary data


Re: [PATCH PR68542]

2016-01-22 Thread Yuri Rumyantsev
Richard,

I fixed all remarks pointed by you in vectorizer part of patch. Could
you take a look on modified patch.

Uros,

Could you please review i386 part of patch related to support of
conditional branches with vector comparison.

Bootstrap and regression testing did not show any new failures.
Is it OK for trunk?

Thanks.
Yuri.

ChangeLog:

2016-01-22  Yuri Rumyantsev  

PR middle-end/68542
* config/i386/i386.c (ix86_expand_branch): Add support for conditional
brnach with vector comparison.
* config/i386/sse.md (define_expand "cbranch4): Add define-expand
for vector comparion with eq/ne only.
(optimize_mask_stores): New function.
* tree-vect-stmts.c (vectorizable_mask_load_store): Initialize
has_mask_store field of vect_info.
* tree-vectorizer.c (vectorize_loops): Invoke optimaze_mask_stores for
vectorized loops having masked stores after vec_info destroy.
* tree-vectorizer.h (loop_vec_info): Add new has_mask_store field and
correspondent macros.
(optimize_mask_stores): Add prototype.

gcc/testsuite/ChangeLog:
* gcc.dg/vect/vect-mask-store-move-1.c: New test.
* testsuite/gcc.target/i386/avx2-vect-mask-store-move1.c: New test.

2016-01-20 15:24 GMT+03:00 Richard Biener :
> On Mon, Jan 18, 2016 at 3:50 PM, Yuri Rumyantsev  wrote:
>> Richard,
>>
>> Here is the second part of patch which really preforms mask stores and
>> all statements related to it to new basic block guarded by test on
>> zero mask. Hew test is also added.
>>
>> Is it OK for trunk?
>
> +  /* Pick up all masked stores in loop if any.  */
> +  for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
> +{
> +  stmt = gsi_stmt (gsi);
>
> you fail to iterate over all BBs of the loop here.  Please follow
> other uses in the
> vectorizer.
>
> +  while (!worklist.is_empty ())
> +{
> +  gimple *last, *last_store;
> +  edge e, efalse;
> +  tree mask;
> +  basic_block store_bb, join_bb;
> +  gimple_stmt_iterator gsi_to;
> +  /* tree arg3; */
>
> remove
>
> +  tree vdef, new_vdef;
> +  gphi *phi;
> +  bool first_dump;
> +  tree vectype;
> +  tree zero;
> +
> +  last = worklist.pop ();
> +  mask = gimple_call_arg (last, 2);
> +  /* Create new bb.  */
>
> bb should be initialized from gimple_bb (last), not loop->header
>
> +  e = split_block (bb, last);
>
> +   gsi_from = gsi_for_stmt (stmt1);
> +   gsi_to = gsi_start_bb (store_bb);
> +   gsi_move_before (&gsi_from, &gsi_to);
> +   update_stmt (stmt1);
>
> I think the update_stmt is redundant and you should be able to
> keep two gsis throughout the loop, from and to, no?
>
> +   /* Put other masked stores with the same mask to STORE_BB.  */
> +   if (worklist.is_empty ()
> +   || gimple_call_arg (worklist.last (), 2) != mask
> +   || !is_valid_sink (worklist.last (), last_store))
>
> as I understand the code the last check is redundant with the invariant
> you track if you verify the stmt you breaked from the inner loop is
> actually equal to worklist.last () and you add a flag to track whether
> you did visit a load (vuse) in the sinking loop you didn't end up sinking.
>
> + /* Issue different messages depending on FIRST_DUMP.  */
> + if (first_dump)
> +   {
> + dump_printf_loc (MSG_NOTE, vect_location,
> +  "Move MASK_STORE to new bb#%d\n",
> +  store_bb->index);
> + first_dump = false;
> +   }
> + else
> +   dump_printf_loc (MSG_NOTE, vect_location,
> +"Move MASK_STORE to created bb\n");
>
> just add a separate dump when you create the BB, "Created new bb#%d for ..."
> to avoid this.
>
> Note that I can't comment on the x86 backend part so that will need to
> be reviewed by somebody
> else.
>
> Thanks,
> Richard.
>
>> Thanks.
>> Yuri.
>>
>> 2016-01-18  Yuri Rumyantsev  
>>
>> PR middle-end/68542
>> * config/i386/i386.c (ix86_expand_branch): Implement integral vector
>> comparison with boolean result.
>> * config/i386/sse.md (define_expand "cbranch4): Add define-expand
>> for vector comparion with eq/ne only.
>> * tree-vect-loop.c (is_valid_sink): New function.
>> (optimize_mask_stores): Likewise.
>> * tree-vect-stmts.c (vectorizable_mask_load_store): Initialize
>> has_mask_store field of vect_info.
>> * tree-vectorizer.c (vectorize_loops): Invoke optimaze_mask_stores for
>> vectorized loops having masked stores.
&

Re: [PATCH PR68542]

2016-01-28 Thread Yuri Rumyantsev
Thanks Richard.

Uros,

Could you please review back-end part of this patch?

Thanks.
Yuri.

2016-01-28 16:26 GMT+03:00 Richard Biener :
> On Fri, Jan 22, 2016 at 3:29 PM, Yuri Rumyantsev  wrote:
>> Richard,
>>
>> I fixed all remarks pointed by you in vectorizer part of patch. Could
>> you take a look on modified patch.
>
> +   if (is_gimple_call (stmt1))
> + lhs = gimple_call_lhs (stmt1);
> +   else
> + if (gimple_code (stmt1) == GIMPLE_ASSIGN)
> +   lhs = gimple_assign_lhs (stmt1);
> + else
> +   break;
>
> use
>
>   lhs = gimple_get_lhs (stmt1);
>   if (!lhs)
> break;
>
> you forget to free bbs, I suggest to do it after seeding the worklist.
>
> Ok with those changes if the backend changes are approved.
>
> Sorry for the delay in reviewing this.
>
> Thanks,
> Richard.
>
>> Uros,
>>
>> Could you please review i386 part of patch related to support of
>> conditional branches with vector comparison.
>>
>> Bootstrap and regression testing did not show any new failures.
>> Is it OK for trunk?
>>
>> Thanks.
>> Yuri.
>>
>> ChangeLog:
>>
>> 2016-01-22  Yuri Rumyantsev  
>>
>> PR middle-end/68542
>> * config/i386/i386.c (ix86_expand_branch): Add support for conditional
>> brnach with vector comparison.
>> * config/i386/sse.md (define_expand "cbranch4): Add define-expand
>> for vector comparion with eq/ne only.
>> (optimize_mask_stores): New function.
>> * tree-vect-stmts.c (vectorizable_mask_load_store): Initialize
>> has_mask_store field of vect_info.
>> * tree-vectorizer.c (vectorize_loops): Invoke optimaze_mask_stores for
>> vectorized loops having masked stores after vec_info destroy.
>> * tree-vectorizer.h (loop_vec_info): Add new has_mask_store field and
>> correspondent macros.
>> (optimize_mask_stores): Add prototype.
>>
>> gcc/testsuite/ChangeLog:
>> * gcc.dg/vect/vect-mask-store-move-1.c: New test.
>> * testsuite/gcc.target/i386/avx2-vect-mask-store-move1.c: New test.
>>
>> 2016-01-20 15:24 GMT+03:00 Richard Biener :
>>> On Mon, Jan 18, 2016 at 3:50 PM, Yuri Rumyantsev  wrote:
>>>> Richard,
>>>>
>>>> Here is the second part of patch which really preforms mask stores and
>>>> all statements related to it to new basic block guarded by test on
>>>> zero mask. Hew test is also added.
>>>>
>>>> Is it OK for trunk?
>>>
>>> +  /* Pick up all masked stores in loop if any.  */
>>> +  for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
>>> +{
>>> +  stmt = gsi_stmt (gsi);
>>>
>>> you fail to iterate over all BBs of the loop here.  Please follow
>>> other uses in the
>>> vectorizer.
>>>
>>> +  while (!worklist.is_empty ())
>>> +{
>>> +  gimple *last, *last_store;
>>> +  edge e, efalse;
>>> +  tree mask;
>>> +  basic_block store_bb, join_bb;
>>> +  gimple_stmt_iterator gsi_to;
>>> +  /* tree arg3; */
>>>
>>> remove
>>>
>>> +  tree vdef, new_vdef;
>>> +  gphi *phi;
>>> +  bool first_dump;
>>> +  tree vectype;
>>> +  tree zero;
>>> +
>>> +  last = worklist.pop ();
>>> +  mask = gimple_call_arg (last, 2);
>>> +  /* Create new bb.  */
>>>
>>> bb should be initialized from gimple_bb (last), not loop->header
>>>
>>> +  e = split_block (bb, last);
>>>
>>> +   gsi_from = gsi_for_stmt (stmt1);
>>> +   gsi_to = gsi_start_bb (store_bb);
>>> +   gsi_move_before (&gsi_from, &gsi_to);
>>> +   update_stmt (stmt1);
>>>
>>> I think the update_stmt is redundant and you should be able to
>>> keep two gsis throughout the loop, from and to, no?
>>>
>>> +   /* Put other masked stores with the same mask to STORE_BB.  */
>>> +   if (worklist.is_empty ()
>>> +   || gimple_call_arg (worklist.last (), 2) != mask
>>> +   || !is_valid_sink (worklist.last (), last_store))
>>>
>>> as I understand the code the last check is redundant with the invariant
>>> you track if you verify the stmt you breaked from the inner loop is
>>> actually equal to worklist.last () and you add a flag to track whether

Re: [off-list] Re: [PATCH PR68542]

2016-01-29 Thread Yuri Rumyantsev
Uros,

Here is update patch which includes (1) couple changes proposed by
Richard in tree-vect-loop.c and (2) the changes in back-end proposed
by you.

Is it OK for trunk?
Bootstrap and regression testing dis not show any new failures.

ChangeLog:

2016-01-29  Yuri Rumyantsev  

PR middle-end/68542
* config/i386/i386.c (ix86_expand_branch): Add support for conditional
branch with vector comparison.
*config/i386/sse.md (Vi48_AVX): New mode iterator.
(define_expand "cbranch4): Add support for conditional branch
with vector comparison.
* tree-vect-loop.c (optimize_mask_stores): New function.
* tree-vect-stmts.c (vectorizable_mask_load_store): Initialize
has_mask_store field of vect_info.
* tree-vectorizer.c (vectorize_loops): Invoke optimaze_mask_stores for
vectorized loops having masked stores after vec_info destroy.
* tree-vectorizer.h (loop_vec_info): Add new has_mask_store field and
correspondent macros.
(optimize_mask_stores): Add prototype.

gcc/testsuite/ChangeLog:
* gcc.dg/vect/vect-mask-store-move-1.c: New test.
* testsuite/gcc.target/i386/avx2-vect-mask-store-move1.c: Likewise.

2016-01-29 15:26 GMT+03:00 Uros Bizjak :
> On Fri, Jan 29, 2016 at 1:20 PM, Yuri Rumyantsev  wrote:
>> Uros,
>>
>> Thanks for your comments.
>> I deleted swap of operands as you told.
>> Let me explain my point in adding support for conditional branches
>> with vector comparison.
>> This feature is used to put  vectorized masked stores  and its
>> producers under guard that checks that mask is not zero, i.e. if mask
>> which is result of other vector computations is zero we don't need to
>> execute correspondent masked store and its producers if they don't
>> have other uses. It means that only integer 128-bit and 256-bit
>> vectors must be accepted as operands of cbranch. I did not introduce
>> new iterator but simply used existence iterator V48_AVX2. BTW you
>> proposed to add new iterator VI_AVX but it would be better to ad
>> VI48_AVX as
>>
>> (define_mode_iterator Vi48_AVX
>> [(V4SI "TARGET_AVX") (V2DI "TARGET_AVX")
>> (V8SI "TARGET_AVX") (V4DI "TARGET_AVX")])
>>
>> I also don't think that we need to add support in expand_compare since
>> such comparisons are not  generated.
>
> OK with me. If there is no need for cstore pattern, then the
> comparison can be integrated with existing code in expand_branch by
> using ""goto simple" as is already the case there.
>
> BR,
> Uros.
>
>> 2016-01-28 20:08 GMT+03:00 Uros Bizjak :
>>> Yuri,
>>>
>>> please find attached a target-dependent patch that illustrates my
>>> review remarks. The patch is lightly tested, and it produces desired
>>> ptest insns on the testcases you provided.
>>>
>>> Some further remarks:
>>>
>>> +  tmp = gen_rtx_fmt_ee (code, VOIDmode, flag, const0_rtx);
>>> +  if (code == EQ)
>>> +tmp = gen_rtx_IF_THEN_ELSE (VOIDmode, tmp,
>>> +gen_rtx_LABEL_REF (VOIDmode, label), pc_rtx);
>>> +  else
>>> +tmp = gen_rtx_IF_THEN_ELSE (VOIDmode, tmp,
>>> +pc_rtx, gen_rtx_LABEL_REF (VOIDmode, label));
>>> +  emit_jump_insn (gen_rtx_SET (pc_rtx, tmp));
>>> +  return;
>>>
>>> The above code is IMO wrong. You don't need to swap the arms of the
>>> target, since "code" will generate je or jne. Please see the attached
>>> patch.
>>>
>>> BTW: Maybe we can introduce corresponding cstore pattrn to use ptest
>>> in order to more efficiently vectorize code like:
>>>
>>> --cut here--
>>> int a[256];
>>>
>>> int foo (void)
>>> {
>>>   int ret = 0;
>>>   int i;
>>>
>>>   for (i = 0; i < 256; i++)
>>> {
>>>   if (a[i] != 0)
>>> ret = 1;
>>> }
>>>   return ret;
>>> }
>>> --cut here--
>>>
>>> Uros.


PR68542.patch.3
Description: Binary data


[PATCH PR69652, Regression]

2016-02-04 Thread Yuri Rumyantsev
Hi All,

Here is a patch that cures the issues with non-correct vuse for scalar
statements during code motion, i.e. if vuse of scalar statement is
vdef of masked store which has been sunk to new basic block, we must
fix it up.  The patch also fixed almost all remarks pointed out by
Jacub.

Bootstrapping and regression testing on v86-64 did not show any new failures.
Is it OK for trunk?

ChangeLog:
2016-02-04  Yuri Rumyantsev  

PR tree-optimization/69652
* tree-vect-loop.c (optimize_mask_stores): Move declaration of STMT1
to nested loop, introduce new SCALAR_VUSE vector to keep vuse of all
skipped scalar statements, introduce variable LAST_VUSE that has
vuse of LAST_STORE, add assertion that SCALAR_VUSE is empty in the
begining of current masked store processing, did source re-formatting,
skip parsing of debug gimples, stop processing when call or gimple
with volatile operand habe been encountered, save scalar statement
with vuse in SCALAR_VUSE, skip processing debug statements in IMM_USE
iterator, change vuse of all saved scalar statements to LAST_VUSE if
it makes sence.

gcc/testsuite/ChangeLog:
* gcc.dg/torture/pr69652.c: New test.


patch
Description: Binary data


Re: [PATCH PR69652, Regression]

2016-02-05 Thread Yuri Rumyantsev
Hi All,

Here is updated patch - I came back to move call statements also since
 masked loads are presented by internal call. I also assume that for
the following simple loop
  for (i = 0; i < n; i++)
if (b1[i])
  a1[i] = sqrtf(a2[i] * a2[i] + a3[i] * a3[i]);
motion must be done for all vector statements in semi-hammock including SQRT.

Bootstrap and regression testing did not show any new failures.
Is it OK for trunk?

ChangeLog:

2016-02-05  Yuri Rumyantsev  

PR tree-optimization/69652
* tree-vect-loop.c (optimize_mask_stores): Move declaration of STMT1
to nested loop, introduce new SCALAR_VUSE vector to keep vuse of all
skipped scalar statements, introduce variable LAST_VUSE to keep
vuse of LAST_STORE, add assertion that SCALAR_VUSE is empty in the
begining of current masked store processing, did source re-formatting,
skip parsing of debug gimples, stop processing if a gimple with
volatile operand has been encountered, save scalar statement
with vuse in SCALAR_VUSE, skip processing debug statements in IMM_USE
iterator, change vuse of all saved scalar statements to LAST_VUSE if
it makes sence.

gcc/testsuite/ChangeLog:
* gcc.dg/torture/pr69652.c: New test.

2016-02-04 19:40 GMT+03:00 Jakub Jelinek :
> On Thu, Feb 04, 2016 at 05:46:27PM +0300, Yuri Rumyantsev wrote:
>> Here is a patch that cures the issues with non-correct vuse for scalar
>> statements during code motion, i.e. if vuse of scalar statement is
>> vdef of masked store which has been sunk to new basic block, we must
>> fix it up.  The patch also fixed almost all remarks pointed out by
>> Jacub.
>>
>> Bootstrapping and regression testing on v86-64 did not show any new failures.
>> Is it OK for trunk?
>>
>> ChangeLog:
>> 2016-02-04  Yuri Rumyantsev  
>>
>> PR tree-optimization/69652
>> * tree-vect-loop.c (optimize_mask_stores): Move declaration of STMT1
>> to nested loop, introduce new SCALAR_VUSE vector to keep vuse of all
>> skipped scalar statements, introduce variable LAST_VUSE that has
>> vuse of LAST_STORE, add assertion that SCALAR_VUSE is empty in the
>> begining of current masked store processing, did source re-formatting,
>> skip parsing of debug gimples, stop processing when call or gimple
>> with volatile operand habe been encountered, save scalar statement
>> with vuse in SCALAR_VUSE, skip processing debug statements in IMM_USE
>> iterator, change vuse of all saved scalar statements to LAST_VUSE if
>> it makes sence.
>>
>> gcc/testsuite/ChangeLog:
>> * gcc.dg/torture/pr69652.c: New test.
>
> Your mailer breaks ChangeLog formatting, so it is hard to check the
> formatting of the ChangeLog entry.
>
> diff --git a/gcc/testsuite/gcc.dg/torture/pr69652.c 
> b/gcc/testsuite/gcc.dg/torture/pr69652.c
> new file mode 100644
> index 000..91f30cf
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/torture/pr69652.c
> @@ -0,0 +1,14 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -ffast-math -ftree-vectorize " } */
> +/* { dg-additional-options "-mavx" { target { i?86-*-* x86_64-*-* } } } */
> +
> +void fn1(double **matrix, int column, int row, int n)
> +{
> +  int k;
> +  for (k = 0; k < n; k++)
> +if (matrix[row][k] != matrix[column][k])
> +  {
> +   matrix[column][k] = -matrix[column][k];
> +   matrix[row][k] = matrix[row][k] - matrix[column][k];
> +  }
> +}
> \ No newline at end of file
>
> Please make sure the last line of the test is a new-line.
>
> @@ -6971,6 +6972,8 @@ optimize_mask_stores (struct loop *loop)
>gsi_next (&gsi))
> {
>   stmt = gsi_stmt (gsi);
> + if (is_gimple_debug (stmt))
> +   continue;
>   if (is_gimple_call (stmt)
>   && gimple_call_internal_p (stmt)
>   && gimple_call_internal_fn (stmt) == IFN_MASK_STORE)
>
> This is not needed, you do something only for is_gimple_call,
> which is never true if is_gimple_debug, so the code used to be fine as is.
>
> + /* Skip debug sstatements.  */
>
> s/ss/s/
>
> + if (is_gimple_debug (gsi_stmt (gsi)))
> +   continue;
> + stmt1 = gsi_stmt (gsi);
> + /* Do not consider writing to memory,volatile and call
>
> Missing space after ,
>
> + /* Skip scalar statements.  */
> + if (!VECTOR_TYPE_P (TREE_TYPE (lhs)))
> +   {
> + /* If scalar statement has vuse we need to modify it
> +when another masked store will be sunk.  */
> + if (gimple_vuse (stmt1))
> +   scalar_vuse.safe_push (stmt1);
>   continue;
> + 

Re: [Patch] Gate vect-mask-store-move-1.c correctly, and actually output the dump

2016-02-08 Thread Yuri Rumyantsev
Hi James,

Thanks for reporting this issue.
I prepared slightly different patch since we don't need to add
tree-vect dump option - it is on by default for all tests in /vect
directory.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/vect-mask-store-move-1.c: Gate dump with x86 target.

2016-02-08 16:07 GMT+03:00 James Greenhalgh :
>
> Hi,
>
> As far as I can tell, this testcase will only vectorize for x86_64/i?86
> targets, so it should be gated to only check for vectorization on those.
>
> Additionally, this test wants to scan the vectorizer dumps, so we ought
> to add -fdump-tree-vect-all to the options.
>
> Checked on aarch64 (cross/native) and x86 with no issues.
>
> OK?
>
> Thanks,
> James
>
> ---
> 2016-02-08  James Greenhalgh  
>
> * gcc.dg/vect/vect-mask-store-move-1.c: Add dump option, and gate
> check on x86_64/i?86.
>


patch-test
Description: Binary data


Re: [Patch] Gate vect-mask-store-move-1.c correctly, and actually output the dump

2016-02-08 Thread Yuri Rumyantsev
Sorry for troubles.
One line must be excluded from test:
-/* { dg-options "-O3" } */

Here is updated patch.

Best regards.
Yuri.

2016-02-08 16:40 GMT+03:00 James Greenhalgh :
> On Mon, Feb 08, 2016 at 04:29:31PM +0300, Yuri Rumyantsev wrote:
>> Hi James,
>>
>> Thanks for reporting this issue.
>> I prepared slightly different patch since we don't need to add
>> tree-vect dump option - it is on by default for all tests in /vect
>> directory.
>
> Hm, I added that line as my test runs were showing:
>
>   UNRESOLVED: gcc.dg/vect/vect-mask-store-move-1.c: dump file does not exist
>
> I would guess the explicit
>
>   /* { dg-options "-O3" } */
>
> is clobbering the vect.exp setup of flags?
>
> This also affects the x86-64 results H.J. Lu is sending out:
>
>   https://gcc.gnu.org/ml/gcc-testresults/2016-02/msg00824.html
>
> Thanks,
> James
>
>>
>> gcc/testsuite/ChangeLog:
>>
>> * gcc.dg/vect/vect-mask-store-move-1.c: Gate dump with x86 target.
>>
>> 2016-02-08 16:07 GMT+03:00 James Greenhalgh :
>> >
>> > Hi,
>> >
>> > As far as I can tell, this testcase will only vectorize for x86_64/i?86
>> > targets, so it should be gated to only check for vectorization on those.
>> >
>> > Additionally, this test wants to scan the vectorizer dumps, so we ought
>> > to add -fdump-tree-vect-all to the options.
>> >
>> > Checked on aarch64 (cross/native) and x86 with no issues.
>> >
>> > OK?
>> >
>> > Thanks,
>> > James
>> >
>> > ---
>> > 2016-02-08  James Greenhalgh  
>> >
>> > * gcc.dg/vect/vect-mask-store-move-1.c: Add dump option, and gate
>> > check on x86_64/i?86.
>> >
>
>


patch-test
Description: Binary data


Re: [PATCH PR69652, Regression]

2016-02-10 Thread Yuri Rumyantsev
Thanks Richard for your comments.
I changes algorithm to remove dead scalar statements as you proposed.

Bootstrap and regression testing did not show any new failures on x86-64.
Is it OK for trunk?

Changelog:
2016-02-10  Yuri Rumyantsev  

PR tree-optimization/69652
* tree-vect-loop.c (optimize_mask_stores): Move declaration of STMT1
to nested loop, did source re-formatting, skip debug statements,
add check on statement with volatile operand, remove dead scalar
statements.

gcc/testsuite/ChangeLog:
* gcc.dg/torture/pr69652.c: New test.


2016-02-09 15:33 GMT+03:00 Richard Biener :
> On Fri, Feb 5, 2016 at 3:54 PM, Yuri Rumyantsev  wrote:
>> Hi All,
>>
>> Here is updated patch - I came back to move call statements also since
>>  masked loads are presented by internal call. I also assume that for
>> the following simple loop
>>   for (i = 0; i < n; i++)
>> if (b1[i])
>>   a1[i] = sqrtf(a2[i] * a2[i] + a3[i] * a3[i]);
>> motion must be done for all vector statements in semi-hammock including SQRT.
>>
>> Bootstrap and regression testing did not show any new failures.
>> Is it OK for trunk?
>
> The patch is incredibly hard to parse due to the re-indenting.  Please
> consider sending
> diffs with -b.
>
> This issue exposes that you are moving (masked) stores across loads without
> checking aliasing.  In the specific case those loads are dead and thus
> this is safe
> but in general I thought we were checking that we are using the same VUSE
> during the sinking operation.
>
> Thus, I'd rather have
>
> + /* Check that LHS does not have uses outside of STORE_BB.  */
> + res = true;
> + FOR_EACH_IMM_USE_FAST (use_p, imm_iter, lhs)
> +   {
> + gimple *use_stmt;
> + use_stmt = USE_STMT (use_p);
> + if (is_gimple_debug (use_stmt))
> +   continue;
> + if (gimple_bb (use_stmt) != store_bb)
> +   {
> + res = false;
> + break;
> +   }
> +   }
>
> also check for the dead code case and DCE those stmts here.  Like so:
>
>    if (has_zero_uses (lhs))
> {
>   gsi_remove (&gsi_from, true);
>   continue;
> }
>
> before the above loop.
>
> Richard.
>
>> ChangeLog:
>>
>> 2016-02-05  Yuri Rumyantsev  
>>
>> PR tree-optimization/69652
>> * tree-vect-loop.c (optimize_mask_stores): Move declaration of STMT1
>> to nested loop, introduce new SCALAR_VUSE vector to keep vuse of all
>> skipped scalar statements, introduce variable LAST_VUSE to keep
>> vuse of LAST_STORE, add assertion that SCALAR_VUSE is empty in the
>> begining of current masked store processing, did source re-formatting,
>> skip parsing of debug gimples, stop processing if a gimple with
>> volatile operand has been encountered, save scalar statement
>> with vuse in SCALAR_VUSE, skip processing debug statements in IMM_USE
>> iterator, change vuse of all saved scalar statements to LAST_VUSE if
>> it makes sence.
>>
>> gcc/testsuite/ChangeLog:
>> * gcc.dg/torture/pr69652.c: New test.
>>
>> 2016-02-04 19:40 GMT+03:00 Jakub Jelinek :
>>> On Thu, Feb 04, 2016 at 05:46:27PM +0300, Yuri Rumyantsev wrote:
>>>> Here is a patch that cures the issues with non-correct vuse for scalar
>>>> statements during code motion, i.e. if vuse of scalar statement is
>>>> vdef of masked store which has been sunk to new basic block, we must
>>>> fix it up.  The patch also fixed almost all remarks pointed out by
>>>> Jacub.
>>>>
>>>> Bootstrapping and regression testing on v86-64 did not show any new 
>>>> failures.
>>>> Is it OK for trunk?
>>>>
>>>> ChangeLog:
>>>> 2016-02-04  Yuri Rumyantsev  
>>>>
>>>> PR tree-optimization/69652
>>>> * tree-vect-loop.c (optimize_mask_stores): Move declaration of STMT1
>>>> to nested loop, introduce new SCALAR_VUSE vector to keep vuse of all
>>>> skipped scalar statements, introduce variable LAST_VUSE that has
>>>> vuse of LAST_STORE, add assertion that SCALAR_VUSE is empty in the
>>>> begining of current masked store processing, did source re-formatting,
>>>> skip parsing of debug gimples, stop processing when call or gimple
>>>> with volatile operand habe been encountered, save scalar statement
>>>> with vuse in SCALAR_VUSE, skip processing debug statements in IMM_USE
>>>> iterator, change vuse of all saved scalar stat

[PATCH PR69942] Fix test problem

2016-02-29 Thread Yuri Rumyantsev
Hi All,

Here is a simple patch for gcc.dg/ifcvt5.c test - detect "6 basic
blocks" string in rtl dump also to accept speculative motion of
else-part of if-stmt before test-part aka IF-CASE-2.

Is it OK for trunk?

ChanageLog:
2016-02-29  Yuri Rumyantsev  

PR rtl-optimization/69942
gcc/testsuite/ChangeLog:
* gcc.dg/ifcvt5.c: Detect '6 basic blocks' in rtl dump also.


patch
Description: Binary data


Re: [PATCH PR69652, Regression]

2016-02-29 Thread Yuri Rumyantsev
This test simply checks that ICE is not occurred but not any
vectorization issues.

Best regards.
Yuri.

2016-02-28 20:29 GMT+03:00 H.J. Lu :
> On Wed, Feb 10, 2016 at 2:26 AM, Yuri Rumyantsev  wrote:
>> Thanks Richard for your comments.
>> I changes algorithm to remove dead scalar statements as you proposed.
>>
>> Bootstrap and regression testing did not show any new failures on x86-64.
>> Is it OK for trunk?
>>
>> Changelog:
>> 2016-02-10  Yuri Rumyantsev  
>>
>> PR tree-optimization/69652
>> * tree-vect-loop.c (optimize_mask_stores): Move declaration of STMT1
>> to nested loop, did source re-formatting, skip debug statements,
>> add check on statement with volatile operand, remove dead scalar
>> statements.
>>
>> gcc/testsuite/ChangeLog:
>> * gcc.dg/torture/pr69652.c: New test.
>>
>>
>
> /* { dg-do compile } */
> /* { dg-options "-O2 -ffast-math -ftree-vectorize " } */
> ^^^
>
> Any particular reason why it should be in gcc.dg/torture, not in
> gcc.dg/vect? -O2 here is overridden by -Ox.
>
> /* { dg-additional-options "-mavx" { target { i?86-*-* x86_64-*-* } } } */
>
>
>
> --
> H.J.


Re: [PATCH PR69652, Regression]

2016-02-29 Thread Yuri Rumyantsev
Jacub!

Here is patch and ChangeLog to move pr69652.c to /vect directory.

Is it OK for trunk.

Thanks.
Yuri.

ChangeLog:
2016-02-29  Yuri Rumyantsev  

PR tree-optimization/69652
gcc/testsuite/ChangeLog:
* gcc.dg/torture/pr69652.c: Delete test.
* gcc.dg/vect/pr69652.c: New test.




2016-02-29 16:05 GMT+03:00 Jakub Jelinek :
> On Mon, Feb 29, 2016 at 05:01:52AM -0800, H.J. Lu wrote:
>> On Mon, Feb 29, 2016 at 3:53 AM, Yuri Rumyantsev  wrote:
>> > This test simply checks that ICE is not occurred but not any
>> > vectorization issues.
>>
>> Can we remove
>>
>>  /* { dg-options "-O2 -ffast-math -ftree-vectorize " } */
>>
>> then?
>
> Well, I bet -ffast-math -ftree-vectorize are needed to reproduce the ICE
> with broken compiler.  But, e.g. -O0 -ftree-vectorize doesn't make much
> sense to test.
> So, either put it into gcc.dg/pr69652.c with the above mentioned options,
> or into gcc.dg/vect/ with dg-additional-options "-ffast-math".
>
> Jakub


patch.2
Description: Binary data


[ipa PATCH] Fix bug introduced by r202567

2013-12-02 Thread Yuri Rumyantsev
Hi All,

Attached is evident fix found in process of investigation of PR 58721.
Note that this fix does not resolve it.

Is it OK for trunk?

ChangeLog:

2013-11-02  Yuri Rumyantsev  

* gcc/ipa-inline.c (check_callers) : Add missed pointer de-reference.


ipa-inline-fix
Description: Binary data


[x86 PATCH] Fix cost model for Silvermont.

2013-12-13 Thread Yuri Rumyantsev
Hi All,

Attached is one-line fix which reduced a cost of integer
multiplication for HI mode since  imul instruction design was improved
 in SLM (in comparison with Atom).

Is it OK for trunk?

ChangeLog:

2013-12-13  Yuri Rumyantsev  
* config/i386/i386.c (slm_cost): Fix imul cost for HI.


x86-SLM-imul-fix
Description: Binary data


[PATCH] Unswitching outer loops.

2015-07-10 Thread Yuri Rumyantsev
Hi All,

Here is presented simple transformation which tries to hoist out of
outer-loop a check on zero trip count for inner-loop. This is very
restricted transformation since it accepts outer-loops with very
simple cfg, as for example:
acc = 0;
   for (i = 1; i <= m; i++) {
  for (j = 0; j < n; j++)
 if (l[j] == i) { v[j] = acc; acc++; };
  acc <<= 1;
   }

Note that degenerative outer loop (without inner loop) will be
completely deleted as dead code.
The main goal of this transformation was to convert outer-loop to form
accepted by outer-loop vectorization (such test-case is also included
to patch).

Bootstrap and regression testing did not show any new failures.

Is it OK for trunk?

ChangeLog:
2015-07-10  Yuri Rumyantsev  

* tree-ssa-loop-unswitch.c: Include "tree-cfgcleanup.h" and
"gimple-iterator.h", add prototype for tree_unswitch_outer_loop.
(tree_ssa_unswitch_loops): Add invoke of tree_unswitch_outer_loop.
(tree_unswitch_outer_loop): New function.

gcc/testsuite/ChangeLog:
* gcc.dg/tree-ssa/unswitch-outer-loop-1.c: New test.
* gcc.dg/vect/vect-outer-simd-3.c: New test.


patch.3
Description: Binary data


Re: [PATCH] Simple optimization for MASK_STORE.

2015-07-20 Thread Yuri Rumyantsev
Hi Jeff!

Thanks for your details comments.

You asked:
How are conditionals on vectors usually handled?  You should try to
mimick that and report, with detail, why that's not working.

In current implementation of vectorization pass all comparisons are
handled through COND_EXPR, i.e. vect-pattern pass transforms all
comparisons producing bool values to conditional expressions like a[i]
!= 0  --> a[i]!=0? 1: 0 which vectorizers transforms to
vect-cond-expr. So we don't have operations with vector operands and
scalar (bool) result.
To implement such operations I introduced target-hook. Could you
propose another solution implementing it?

Thanks.
Yuri.

2015-07-10 8:51 GMT+03:00 Jeff Law :
> On 06/18/2015 08:32 AM, Yuri Rumyantsev wrote:
>>
>> Richard,
>>
>> Here is updated patch which does not include your proposal related to
>> the target hook deletion.
>> You wrote:
>>>
>>> I still don't understand why you need the new target hook.  If we have a
>>> masked
>>> load/store then the mask is computed by an assignment with a
>>> VEC_COND_EXPR
>>> (in your example) and thus a test for a zero mask is expressible as just
>>>
>>   > if (vect__ifc__41.17_167 == { 0, 0, 0, 0... })
>>>
>>>
>>> or am I missing something?
>>
>> Such vector compare produces vector and does not set up cc flags
>> required for conditional branch (GIMPLE_COND).
>> If we use such comparison for GIMPLE_COND we got error message, so I
>> used target hook which does set up cc flags aka ptest instruction and
>> I left this part.
>
> I think we need to look for something better.  I really don't like the idea
> of using a target hook like this within the gimple optimizers unless it's
> absolutely necessary.
>
> How are conditionals on vectors usually handled?  You should try to mimick
> that and report, with detail, why that's not working.
>
> I'm also not happy with the mechanisms to determine whether or not we should
> make this transformation.  I'm willing to hash through other details and
> come back to this issue once we've got some of the other stuff figured out.
> I guess a new flag with the target adjusting is the fallback if we can't
> come up with some reasonable way to select this on or off.
>
> The reason I don't like having the target files make this kind of decision
> is it makes more gimple transformations target dependent. Based on the
> history with RTL, that ultimately leads to an unwieldy mess.
>
> And yes, I know gimple isn't 100% target independent -- but that doesn't
> mean we should keep adding more target dependencies.  Each one we add needs
> to be well vetted.
>
>
> patch.3
>
> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> index 44a8624..e90de32 100644
> --- a/gcc/config/i386/i386.c
> +++ b/gcc/config/i386/i386.c
> @@ -100,6 +100,8 @@ along with GCC; see the file COPYING3.  If not see
>  #include "tree-iterator.h"
>  #include "tree-chkp.h"
>  #include "rtl-chkp.h"
> +#include "stringpool.h"
> +#include "tree-ssanames.h"
> So ideally we'll figure out why you're unable to generate a suitable
> conditional at the gimple level and the need for the x86 backend to generate
> the vector zero test will go away.  And if it does go away, we want those
> #includes to disappear.
>
> Additionally, instead of including stringpool.h & tree-ssanames.h, include
> "ssa.h" -- as a general rule.
>
>
>  static rtx legitimize_dllimport_symbol (rtx, bool);
>  static rtx legitimize_pe_coff_extern_decl (rtx, bool);
> @@ -41100,6 +41102,47 @@ ix86_vectorize_builtin_gather (const_tree
> mem_vectype,
>return ix86_get_builtin (code);
>  }
>
> +/* Returns true if SOURCE type is supported by builtin ptest.
> +   NAME is lhs of created ptest call.  All created statements are added
> +   to GS.  */
> +
> +static bool
> +ix86_vectorize_build_zero_vector_test (tree source, tree name,
>
> Given the stated goal of not doing this in the target files, I'm not going
> to actually look at this routine or any of the infrastructure for this
> target hook.
>
>
> diff --git a/gcc/params.def b/gcc/params.def
> index 3e4ba3a..9e8de11 100644
> --- a/gcc/params.def
> +++ b/gcc/params.def
> @@ -1069,6 +1069,12 @@ DEFPARAM (PARAM_MAX_STORES_TO_SINK,
>"Maximum number of conditional store pairs that can be sunk",
>2, 0, 0)
>
> +/* Enable inserion test on zero mask for masked stores if non-zero.  */
> s/inserion/insertion/
>
> +DEFPARAM (PARAM_ZERO_TEST_FOR_STORE_MASK,
> +  

[PATH PR66926,PR66951} simple fix for ICE.

2015-07-22 Thread Yuri Rumyantsev
Hi All,

Here is simple fix which fixes PR66926 and PR66951 - fix condition for
renaming virtual operands to determine that statement is outside of
loop.

Bootstrap and regression testing did not show any new failures.

Is it OK for trunk?

gcc/ChangeLog
2015-07-22  Yuri Rumyantsev  

PR tree-optimization/66926,66951
* tree-vect-loop-manip.c (slpeel_tree_peel_loop_to_edge): Delete
INNER_LOOP and fix up condition for renaming virtual operands.


gcc/testsuite/ChangeLog
* gcc.dg/vect/pr66951.c: New test.


patch
Description: Binary data


Re: [PATCH] Unswitching outer loops.

2015-07-23 Thread Yuri Rumyantsev
Hi Richard,

I checked that both test-cases from 23855 are sucessfully unswitched
by proposed patch. I understand that it does not catch deeper loop
nest as
   for (i=0; i<10; i++)
 for (j=0;j:
> On Fri, Jul 10, 2015 at 12:02 PM, Yuri Rumyantsev  wrote:
>> Hi All,
>>
>> Here is presented simple transformation which tries to hoist out of
>> outer-loop a check on zero trip count for inner-loop. This is very
>> restricted transformation since it accepts outer-loops with very
>> simple cfg, as for example:
>> acc = 0;
>>for (i = 1; i <= m; i++) {
>>   for (j = 0; j < n; j++)
>>  if (l[j] == i) { v[j] = acc; acc++; };
>>   acc <<= 1;
>>}
>>
>> Note that degenerative outer loop (without inner loop) will be
>> completely deleted as dead code.
>> The main goal of this transformation was to convert outer-loop to form
>> accepted by outer-loop vectorization (such test-case is also included
>> to patch).
>>
>> Bootstrap and regression testing did not show any new failures.
>>
>> Is it OK for trunk?
>
> I think this is
>
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=23855
>
> as well.  It has a patch adding a invariant loop guard hoisting
> phase to loop-header copying.  Yeah, it needs updating to
> trunk again I suppose.  It's always non-stage1 when I come
> back to that patch.
>
> Your patch seems to be very specific and only handles outer
> loops of innermost loops.
>
> Richard.
>
>> ChangeLog:
>> 2015-07-10  Yuri Rumyantsev  
>>
>> * tree-ssa-loop-unswitch.c: Include "tree-cfgcleanup.h" and
>> "gimple-iterator.h", add prototype for tree_unswitch_outer_loop.
>> (tree_ssa_unswitch_loops): Add invoke of tree_unswitch_outer_loop.
>> (tree_unswitch_outer_loop): New function.
>>
>> gcc/testsuite/ChangeLog:
>> * gcc.dg/tree-ssa/unswitch-outer-loop-1.c: New test.
>> * gcc.dg/vect/vect-outer-simd-3.c: New test.


Re: [PATCH] Simple optimization for MASK_STORE.

2015-07-24 Thread Yuri Rumyantsev
Jeff,

The goal of this transformation is to not execute masked store if
possible,i.e. when is is equal to zero-vector. and conditional
expression is not suitable for it - we must generate semi-hammock with
conditional branch.

Yuri.

2015-07-23 23:03 GMT+03:00 Jeff Law :
> On 07/20/2015 09:05 AM, Yuri Rumyantsev wrote:
>>
>> Hi Jeff!
>>
>> Thanks for your details comments.
>>
>> You asked:
>> How are conditionals on vectors usually handled?  You should try to
>> mimick that and report, with detail, why that's not working.
>>
>> In current implementation of vectorization pass all comparisons are
>> handled through COND_EXPR, i.e. vect-pattern pass transforms all
>> comparisons producing bool values to conditional expressions like a[i]
>> != 0  --> a[i]!=0? 1: 0 which vectorizers transforms to
>> vect-cond-expr. So we don't have operations with vector operands and
>> scalar (bool) result.
>> To implement such operations I introduced target-hook. Could you
>> propose another solution implementing it?
>
> Is there any rationale given anywhere for the transformation into
> conditional expressions?  ie, is there any reason why we can't have a
> GIMPLE_COND where the expression is a vector condition?
>
> Thanks,
>
> Jeff
>


Re: [PATCH] Unswitching outer loops.

2015-07-31 Thread Yuri Rumyantsev
Hi Richard,

I learned your updated patch for 23825 and it is more general in
comparison with my.
I'd like to propose you a compromise - let's consider my patch only
for force-vectorize outer loop only to allow outer-loop
vecctorization. Note that your approach will not hoist invariant
guards if loops contains something else except for inner-loop, i.e. it
won't be empty for taken branch.
I also would like to answer on your last question - CFG cleanup is
invoked to perform deletion of single-argument phi nodes from tail
block through substitution - such phi's prevent outer-loop
vectorization. But it is clear that such transformation can be done
other pass.

What is your opinion?

Yuri.

2015-07-28 13:50 GMT+03:00 Richard Biener :
> On Thu, Jul 23, 2015 at 4:45 PM, Yuri Rumyantsev  wrote:
>> Hi Richard,
>>
>> I checked that both test-cases from 23855 are sucessfully unswitched
>> by proposed patch. I understand that it does not catch deeper loop
>> nest as
>>for (i=0; i<10; i++)
>>  for (j=0;j> for (k=0;k<20;k++)
>>   ...
>> but duplication of middle-loop does not look reasonable.
>>
>> Here is dump for your second test-case:
>>
>> void foo(int *ie, int *je, double *x)
>> {
>>   int i, j;
>>   for (j=0; j<*je; ++j)
>> for (i=0; i<*ie; ++i)
>>   x[i+j] = 0.0;
>> }
>> grep -i unswitch t6.c.119t.unswitch
>> ;; Unswitching outer loop
>
> I was saying that why go with a limited approach when a patch (in
> unknown state...)
> is available that does it more generally?  Also unswitching is quite
> expensive compared
> to "moving" the invariant condition.
>
> In your patch:
>
> +  if (!nloop->force_vectorize)
> +nloop->force_vectorize = true;
> +  if (loop->safelen != 0)
> +nloop->safelen = loop->safelen;
>
> I see no guard on force_vectorize so = true looks bogus here.  Please just use
> copy_loop_info.
>
> +  if (integer_nonzerop (cond_new))
> +gimple_cond_set_condition_from_tree (cond_stmt, boolean_true_node);
> +  else if (integer_zerop (cond_new))
> +gimple_cond_set_condition_from_tree (cond_stmt, boolean_false_node);
>
> gimple_cond_make_true/false (cond_stmt);
>
> btw, seems odd that we have to recompute which loop is the true / false 
> variant
> when we just fed a guard condition to loop_version.  Can't we statically
> determine whether loop or nloop has the in-loop condition true or false?
>
> +  /* Clean-up cfg to remove useless one-argument phi in exit block of
> + outer-loop.  */
> +  cleanup_tree_cfg ();
>
> I know unswitching is already O(number-of-unswitched-loops * size-of-function)
> because it updates SSA form after each individual unswitching (and it does 
> that
> because it invokes itself recursively on unswitched loops).  But do you really
> need to invoke CFG cleanup here?
>
> Richard.
>
>> Yuri.
>>
>> 2015-07-14 14:06 GMT+03:00 Richard Biener :
>>> On Fri, Jul 10, 2015 at 12:02 PM, Yuri Rumyantsev  
>>> wrote:
>>>> Hi All,
>>>>
>>>> Here is presented simple transformation which tries to hoist out of
>>>> outer-loop a check on zero trip count for inner-loop. This is very
>>>> restricted transformation since it accepts outer-loops with very
>>>> simple cfg, as for example:
>>>> acc = 0;
>>>>for (i = 1; i <= m; i++) {
>>>>   for (j = 0; j < n; j++)
>>>>  if (l[j] == i) { v[j] = acc; acc++; };
>>>>   acc <<= 1;
>>>>}
>>>>
>>>> Note that degenerative outer loop (without inner loop) will be
>>>> completely deleted as dead code.
>>>> The main goal of this transformation was to convert outer-loop to form
>>>> accepted by outer-loop vectorization (such test-case is also included
>>>> to patch).
>>>>
>>>> Bootstrap and regression testing did not show any new failures.
>>>>
>>>> Is it OK for trunk?
>>>
>>> I think this is
>>>
>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=23855
>>>
>>> as well.  It has a patch adding a invariant loop guard hoisting
>>> phase to loop-header copying.  Yeah, it needs updating to
>>> trunk again I suppose.  It's always non-stage1 when I come
>>> back to that patch.
>>>
>>> Your patch seems to be very specific and only handles outer
>>> loops of innermost loops.
>>>
>>> Richard.
>>>
>>>> ChangeLog:
>>>> 2015-07-10  Yuri Rumyantsev  
>>>>
>>>> * tree-ssa-loop-unswitch.c: Include "tree-cfgcleanup.h" and
>>>> "gimple-iterator.h", add prototype for tree_unswitch_outer_loop.
>>>> (tree_ssa_unswitch_loops): Add invoke of tree_unswitch_outer_loop.
>>>> (tree_unswitch_outer_loop): New function.
>>>>
>>>> gcc/testsuite/ChangeLog:
>>>> * gcc.dg/tree-ssa/unswitch-outer-loop-1.c: New test.
>>>> * gcc.dg/vect/vect-outer-simd-3.c: New test.


Re: [PATCH] Simple optimization for MASK_STORE.

2015-08-06 Thread Yuri Rumyantsev
HI All,

Here is updated patch which implements Richard proposal to use vector
comparison with boolean result instead of target hook. Support for it
was added to ix86_expand_branch.

Any comments will be appreciated.

Bootstrap and regression testing did not show any new failures.

ChangeLog:
2015-08-06  Yuri Rumyantsev  

* config/i386/i386.c (ix86_expand_branch): Implement vector
comparison with boolean result.
* config/i386/sse.md (define_expand "cbranch4): Add define
for vector comparion.
* fold-const.c (fold_relational_const): Add handling of vector
comparison with boolean result.
* params.def (PARAM_ZERO_TEST_FOR_STORE_MASK): New DEFPARAM.
* params.h (ENABLE_ZERO_TEST_FOR_STORE_MASK): new macros.
* tree-cfg.c (verify_gimple_comparison): Add test for vector
comparion with boolean result.
* tree-ssa-forwprop.c (forward_propagate_into_comparison_1): Do not
propagate vector comparion with boolean result.
* tree-vect-stmts.c (vectorizable_mask_load_store): Initialize
has_mask_store field of vect_info.
* tree-vectorizer.c: Include files ssa.h, cfghooks.h and params.h.
(is_valid_sink): New function.
(optimize_mask_stores): New function.
(vectorize_loops): Invoke optimaze_mask_stores for loops having masked
stores.
* tree-vectorizer.h (loop_vec_info): Add new has_mask_store field and
correspondent macros.

gcc/testsuite/ChangeLog:
* gcc.target/i386/avx2-vect-mask-store-move1.c: New test.


2015-07-27 11:48 GMT+03:00 Richard Biener :
> On Fri, Jul 24, 2015 at 9:11 PM, Jeff Law  wrote:
>> On 07/24/2015 03:16 AM, Richard Biener wrote:
>>>>
>>>> Is there any rationale given anywhere for the transformation into
>>>> conditional expressions?  ie, is there any reason why we can't have a
>>>> GIMPLE_COND where the expression is a vector condition?
>>>
>>>
>>> No rationale for equality compare which would have the semantic of
>>> having all elements equal or not equal.  But you can't define a sensible
>>> ordering (that HW implements) for other compare operators and you
>>> obviously need a single boolean result, not a vector of element comparison
>>> results.
>>
>> Right.  EQ/NE only as others just don't have any real meaning.
>>
>>
>>> I've already replied that I'm fine allowing ==/!= whole-vector compares.
>>> But one needs to check whether expansion does anything sensible
>>> with them (either expand to integer subreg compares or add optabs
>>> for the compares).
>>
>> Agreed, EQ/NE for whole vector compares only would be fine for me too under
>> the same conditions.
>
> Btw, you can already do this on GIMPLE by doing
>
>   TImode vec_as_int = VIEW_CONVERT_EXPR  (vec_2);
>   if (vec_as_int == 0)
> ...
>
> which is what the RTL will look like in the end.  So not sure if making this
> higher-level in GIMPLE is good or required.
>
> Richard.
>
>> jeff


patch.4
Description: Binary data


Re: [PATCH] Unswitching outer loops.

2015-09-30 Thread Yuri Rumyantsev
Hi Richard,

I re-designed outer loop unswitching using basic idea of 23855 patch -
hoist invariant guard if loop is empty without guard. Note that this
was added to loop unswitching pass with simple modifications - using
another loop iterator etc.

Bootstrap and regression testing did not show any new failures.
What is your opinion?

Thanks.

ChangeLog:
2015-09-30  Yuri Rumyantsev  

* tree-ssa-loop-unswitch.c: Include "gimple-iterator.h" and
"cfghooks.h", add prototypes for introduced new functions.
(tree_ssa_unswitch_loops): Use from innermost loop iterator, move all
checks on ability of loop unswitching to tree_unswitch_single_loop;
invoke tree_unswitch_single_loop or tree_unswitch_outer_loop depending
on innermost loop check.
(tree_unswitch_single_loop): Add all required checks on ability of
loop unswitching under zero recursive level guard.
(tree_unswitch_outer_loop): New function.
(find_loop_guard): Likewise.
(empty_bb_without_guard_p): Likewise.
(used_outside_loop_p): Likewise.
(hoist_guard): Likewise.
(check_exit_phi): Likewise.

   gcc/testsuite/ChangeLog:
* gcc.dg/loop-unswitch-2.c: New test.

2015-09-16 11:26 GMT+03:00 Richard Biener :
> Yeah, as said, the patch wasn't fully ready and it also felt odd to do
> this hoisting in loop header copying.  Integrating it
> with LIM would be a better fit eventually.
>
> Note that we did agree to go forward with your original patch just
> making it more "generically" perform outer loop
> unswitching.  Did you explore that idea further?
>
>
>
> On Tue, Sep 15, 2015 at 6:00 PM, Yuri Rumyantsev  wrote:
>> Thanks Richard.
>>
>> I found one more issue that could not be fixed simply. In 23855 you
>> consider the following test-case:
>> void foo(int *ie, int *je, double *x)
>> {
>>   int i, j;
>>   for (j=0; j<*je; ++j)
>> for (i=0; i<*ie; ++i)
>>   x[i+j] = 0.0;
>> }
>> and proposed to hoist up a check on *ie out of loop. It requires
>> memref alias analysis since in general x and ie can alias (if their
>> types are compatible - int *ie & int * x). Such analysis is performed
>> by pre or lim passes. Without such analysis we can not hoist a test on
>> non-zero for *ie out of loop using 238565 patch.
>>  The second concern is that proposed copy header algorithm changes
>> loop structure significantly and it is not accepted by vectorizer
>> since latch is not empty (such transformation assumes loop peeling for
>> one iteration. So I can propose to implement simple guard hoisting
>> without copying header and tail blocks (if it is possible).
>>
>> I will appreciate you for any advice or help since without such
>> hoisting we are not able to perform outer loop vectorization for
>> important benchmark.
>> and
>>
>> 2015-09-15 14:22 GMT+03:00 Richard Biener :
>>> On Thu, Sep 3, 2015 at 6:32 PM, Yuri Rumyantsev  wrote:
>>>> Hi Richard,
>>>>
>>>> I started learning, tuning and debugging patch proposed in 23855 and
>>>> discovered thta it does not work properly.
>>>> So I wonder is it tested patch and it should work?
>>>
>>> I don't remember, but as it wasn't committed it certainly wasn't ready.
>>>
>>>> Should it accept for hoisting the following loop nest
>>>>   for (i=0; i>>> s = 0;
>>>> for (j=0; j>>>   s += a[i] * b[j];
>>>> c[i] = s;
>>>>   }
>>>> Note that i-loop will nit be empty if m is equal to 0.
>>>
>>> if m is equal to 0 then we still have the c[i] = s store, no?  Of course
>>> we could unswitch the outer loop on m == 0 but simple hoisting wouldn't 
>>> work.
>>>
>>> Richard.
>>>
>>>> 2015-08-03 10:27 GMT+03:00 Richard Biener :
>>>>> On Fri, Jul 31, 2015 at 1:17 PM, Yuri Rumyantsev  
>>>>> wrote:
>>>>>> Hi Richard,
>>>>>>
>>>>>> I learned your updated patch for 23825 and it is more general in
>>>>>> comparison with my.
>>>>>> I'd like to propose you a compromise - let's consider my patch only
>>>>>> for force-vectorize outer loop only to allow outer-loop
>>>>>> vecctorization.
>>>>>
>>>>> I don't see why we should special-case that if the approach in 23825
>>>>> is sensible.
>>>>>
>>>>>> Note that your approach will not hoist invariant
>>>>>> guards if loops contains something else except for inner-loop, i.e. it
>>>>>> won't 

Re: [PATCH] Unswitching outer loops.

2015-10-05 Thread Yuri Rumyantsev
Thanks Richard.
I'd like to answer on your last comment related to using of exit edge
argument for edge that skips loop.
Let's consider the following test-case:

#include 
#define N 32
float *foo(int ustride, int size, float *src)
{
   float *buffer, *p;
   int i, k;

   if (!src)
return NULL;

   buffer = (float *) malloc(N * size * sizeof(float));

   if(buffer)
  for(i=0, p=buffer; i:
  # _6 = PHI <0B(8), buffer_20(16)>
  return _6;

It is clear that we must preserve function semantic and transform it to
_6 = PHI <0B(12), buffer_19(9), buffer_19(4)>


2015-10-05 13:57 GMT+03:00 Richard Biener :
> On Wed, Sep 30, 2015 at 12:46 PM, Yuri Rumyantsev  wrote:
>> Hi Richard,
>>
>> I re-designed outer loop unswitching using basic idea of 23855 patch -
>> hoist invariant guard if loop is empty without guard. Note that this
>> was added to loop unswitching pass with simple modifications - using
>> another loop iterator etc.
>>
>> Bootstrap and regression testing did not show any new failures.
>> What is your opinion?
>
> Overall it looks good.  Some comments below - a few more testcases would
> be nice as well.
>
> +  /* Loop must not be infinite.  */
> +  if (!finite_loop_p (loop))
> +return false;
>
> why's that?
>
> +  body = get_loop_body_in_dom_order (loop);
> +  for (i = 0; i < loop->num_nodes; i++)
> +{
> +  if (body[i]->loop_father != loop)
> +   continue;
> +  if (!empty_bb_without_guard_p (loop, body[i]))
>
> I wonder if there is a better way to iterate over the interesting
> blocks and PHIs
> we need to check for side-effects (and thus we maybe can avoid gathering
> the loop in DOM order).
>
> +  FOR_EACH_SSA_TREE_OPERAND (name, stmt, op_iter, SSA_OP_DEF)
> +   {
> + if (may_be_used_outside
>
> may_be_used_outside can be hoisted above the loop.  I wonder if we can take
> advantage of loop-closed SSA form here (and the fact we have a single exit
> from the loop).  Iterating over exit dest PHIs and determining whether the
> exit edge DEF is inside the loop part it may not be should be enough.
>
> +  gcc_assert (single_succ_p (pre_header));
>
> that should be always true.
>
> +  gsi_remove (&gsi, false);
> +  bb = guard->dest;
> +  remove_edge (guard);
> +  /* Update dominance for destination of GUARD.  */
> +  if (EDGE_COUNT (bb->preds) == 0)
> +{
> +  basic_block s_bb;
> +  gcc_assert (single_succ_p (bb));
> +  s_bb = single_succ (bb);
> +  delete_basic_block (bb);
> +  if (single_pred_p (s_bb))
> +   set_immediate_dominator (CDI_DOMINATORS, s_bb, single_pred (s_bb));
>
> all this massaging should be simplified by leaving it to CFG cleanup by
> simply adjusting the CONDs condition to always true/false.  There is
> gimple_cond_make_{true,false} () for this (would be nice to have a variant
> taking a bool).
>
> +  new_edge = make_edge (pre_header, exit->dest, flags);
> +  if (fix_dom_of_exit)
> +set_immediate_dominator (CDI_DOMINATORS, exit->dest, pre_header);
> +  update_stmt (gsi_stmt (gsi));
>
> the update_stmt should be not necessary, it's done by gsi_insert_after 
> already.
>
> +  /* Add NEW_ADGE argument for all phi in post-header block.  */
> +  bb = exit->dest;
> +  for (gphi_iterator gsi = gsi_start_phis (bb);
> +   !gsi_end_p (gsi); gsi_next (&gsi))
> +{
> +  gphi *phi = gsi.phi ();
> +  /* edge_iterator ei; */
> +  tree arg;
> +  if (virtual_operand_p (gimple_phi_result (phi)))
> +   {
> + arg = PHI_ARG_DEF_FROM_EDGE (phi, loop_preheader_edge (loop));
> + add_phi_arg (phi, arg, new_edge, UNKNOWN_LOCATION);
> +   }
> +  else
> +   {
> + /* Use exit edge argument.  */
> + arg = PHI_ARG_DEF_FROM_EDGE (phi, exit);
> + add_phi_arg (phi, arg, new_edge, UNKNOWN_LOCATION);
>
> Hum.  How is it ok to use the exit edge argument for the edge that skips
> the loop?  Why can't you always use the pre-header edge value?
> That is, if we have
>
>  for(i=0;i{
>  if (n > 0)
> {
>  for (;;)
>    {
>}
>  }
>}
>   ... = i;
>
> then i is used after the loop and the correct value to use if
> n > 0 is false is '0'.  Maybe this way we can also relax
> what check_exit_phi does?  IMHO the only restriction is
> if sth defined inside the loop before the header check for
> the inner loop is used after the loop.
>
> Thanks,
> Richard.
>
>> Thanks.
>>
>> ChangeLog:
>> 2015-09-30  Yuri Rumyantsev  
>>
>> * tree-ssa-loop-unswitch.c: Include "gimple

Re: [PATCH] Unswitching outer loops.

2015-10-06 Thread Yuri Rumyantsev
Richard,

Here is updated patch which reflects almost all your remarks:
1. Use ordinary get_loop_body.
2. Delete useless asserts.
3. Use check on iterated loop instead of finite_loop_p.
4. Do not update CFG by adjusting the CONDs condition to always true/false.
5. Add couple tests.

ChangeLog:
2015-10-06  Yuri Rumyantsev  

* tree-ssa-loop-unswitch.c: Include "gimple-iterator.h" and
"cfghooks.h", add prototypes for introduced new functions.
(tree_ssa_unswitch_loops): Use from innermost loop iterator, move all
checks on ability of loop unswitching to tree_unswitch_single_loop;
invoke tree_unswitch_single_loop or tree_unswitch_outer_loop depending
on innermost loop check.
(tree_unswitch_single_loop): Add all required checks on ability of
loop unswitching under zero recursive level guard.
(tree_unswitch_outer_loop): New function.
(find_loop_guard): Likewise.
(empty_bb_without_guard_p): Likewise.
(used_outside_loop_p): Likewise.
(hoist_guard): Likewise.
(check_exit_phi): Likewise.

   gcc/testsuite/ChangeLog:
* gcc.dg/loop-unswitch-2.c: New test.
* gcc.dg/loop-unswitch-3.c: Likewise.
* gcc.dg/loop-unswitch-4.c: Likewise.

2015-10-06 10:59 GMT+03:00 Richard Biener :
> On Mon, Oct 5, 2015 at 3:13 PM, Yuri Rumyantsev  wrote:
>> Thanks Richard.
>> I'd like to answer on your last comment related to using of exit edge
>> argument for edge that skips loop.
>> Let's consider the following test-case:
>>
>> #include 
>> #define N 32
>> float *foo(int ustride, int size, float *src)
>> {
>>float *buffer, *p;
>>int i, k;
>>
>>if (!src)
>> return NULL;
>>
>>buffer = (float *) malloc(N * size * sizeof(float));
>>
>>if(buffer)
>>   for(i=0, p=buffer; i> for(k=0; k>  *p++ = src[k];
>>
>>return buffer;
>> }
>>
>> Before adding new edge we have in post-header bb:
>>   :
>>   # _6 = PHI <0B(8), buffer_20(16)>
>>   return _6;
>>
>> It is clear that we must preserve function semantic and transform it to
>> _6 = PHI <0B(12), buffer_19(9), buffer_19(4)>
>
> Ah, yeah.  I was confusing the loop exit of the inner vs. the outer loop.
>
> Richard.
>
>>
>> 2015-10-05 13:57 GMT+03:00 Richard Biener :
>>> On Wed, Sep 30, 2015 at 12:46 PM, Yuri Rumyantsev  
>>> wrote:
>>>> Hi Richard,
>>>>
>>>> I re-designed outer loop unswitching using basic idea of 23855 patch -
>>>> hoist invariant guard if loop is empty without guard. Note that this
>>>> was added to loop unswitching pass with simple modifications - using
>>>> another loop iterator etc.
>>>>
>>>> Bootstrap and regression testing did not show any new failures.
>>>> What is your opinion?
>>>
>>> Overall it looks good.  Some comments below - a few more testcases would
>>> be nice as well.
>>>
>>> +  /* Loop must not be infinite.  */
>>> +  if (!finite_loop_p (loop))
>>> +return false;
>>>
>>> why's that?
>>>
>>> +  body = get_loop_body_in_dom_order (loop);
>>> +  for (i = 0; i < loop->num_nodes; i++)
>>> +{
>>> +  if (body[i]->loop_father != loop)
>>> +   continue;
>>> +  if (!empty_bb_without_guard_p (loop, body[i]))
>>>
>>> I wonder if there is a better way to iterate over the interesting
>>> blocks and PHIs
>>> we need to check for side-effects (and thus we maybe can avoid gathering
>>> the loop in DOM order).
>>>
>>> +  FOR_EACH_SSA_TREE_OPERAND (name, stmt, op_iter, SSA_OP_DEF)
>>> +   {
>>> + if (may_be_used_outside
>>>
>>> may_be_used_outside can be hoisted above the loop.  I wonder if we can take
>>> advantage of loop-closed SSA form here (and the fact we have a single exit
>>> from the loop).  Iterating over exit dest PHIs and determining whether the
>>> exit edge DEF is inside the loop part it may not be should be enough.
>>>
>>> +  gcc_assert (single_succ_p (pre_header));
>>>
>>> that should be always true.
>>>
>>> +  gsi_remove (&gsi, false);
>>> +  bb = guard->dest;
>>> +  remove_edge (guard);
>>> +  /* Update dominance for destination of GUARD.  */
>>> +  if (EDGE_COUNT (bb->preds) == 0)
>>> +{
>>> +  basic_block s_bb;
>>> +  gcc_assert (single_succ_p (bb));
>>> +  s_bb = single_succ (bb);
>>> +  delete_basic_block (bb);
>>> +  if (single_pred_p (s_bb))
>

Re: [PATCH] Unswitching outer loops.

2015-10-07 Thread Yuri Rumyantsev
Richard,

I've fixed adding virtual phi argument and add check on irreducible basic block.
New patch is attached.

I checked it for bootstrap and regression testing, no new failures.

ChangeLog:
2015-10-07  Yuri Rumyantsev  

* tree-ssa-loop-unswitch.c: Include "gimple-iterator.h" and
"cfghooks.h", add prototypes for introduced new functions.
(tree_ssa_unswitch_loops): Use from innermost loop iterator, move all
checks on ability of loop unswitching to tree_unswitch_single_loop;
invoke tree_unswitch_single_loop or tree_unswitch_outer_loop depending
on innermost loop check.
(tree_unswitch_single_loop): Add all required checks on ability of
loop unswitching under zero recursive level guard.
(tree_unswitch_outer_loop): New function.
(find_loop_guard): Likewise.
(empty_bb_without_guard_p): Likewise.
(used_outside_loop_p): Likewise.
(get_vop_from_header): Likewise.
(hoist_guard): Likewise.
(check_exit_phi): Likewise.

   gcc/testsuite/ChangeLog:
* gcc.dg/loop-unswitch-2.c: New test.
* gcc.dg/loop-unswitch-3.c: Likewise.
* gcc.dg/loop-unswitch-4.c: Likewise.


2015-10-06 15:21 GMT+03:00 Richard Biener :
> On Tue, Oct 6, 2015 at 1:41 PM, Yuri Rumyantsev  wrote:
>> Richard,
>>
>> Here is updated patch which reflects almost all your remarks:
>> 1. Use ordinary get_loop_body.
>> 2. Delete useless asserts.
>> 3. Use check on iterated loop instead of finite_loop_p.
>> 4. Do not update CFG by adjusting the CONDs condition to always true/false.
>> 5. Add couple tests.
>
> +  /* Add NEW_ADGE argument for all phi in post-header block.  */
> +  bb = exit->dest;
> +  for (gphi_iterator gsi = gsi_start_phis (bb);
> +   !gsi_end_p (gsi); gsi_next (&gsi))
> +{
> +  gphi *phi = gsi.phi ();
> +  /* edge_iterator ei; */
> +  tree arg;
> +  if (virtual_operand_p (gimple_phi_result (phi)))
> +   {
> + arg = PHI_ARG_DEF_FROM_EDGE (phi, loop_preheader_edge (loop));
> + add_phi_arg (phi, arg, new_edge, UNKNOWN_LOCATION);
>
> now I know what confused me - here you are looking at a loop exit PHI
> but querying with the preheader edge index.  I think you need to walk
> the loop header PHIs to find the PHI for the virtual operand and use that
> to get the PHI arg from?
>
> The side-effect / used-outside code is still the same.  What matters
> is side-effects outside of the loop-header protected code region, not
> blocks excluding the inner loop.  Say,
>
>   for (;;)
> {
>   if (invariant-guard)
> {
>printf ("Blah");
>for (;;)
>  ;
> }
> }
>
> would still ok to be unswitched.  So instead of
>
> +  if (body[i]->loop_father != loop)
> +   continue;
>
> it would be
>
>if (dominated_by_p (CDI_DOMINATORS, body[i], header)
>&& !dominated_by_p (CDI_DOMINATORS, body[i], fe->dest))
>
> with the obvious improvement to the patch to not only consider header checks
> in the outer loop header but in the pre-header block of the inner loop.
>
> And I still think you should walk the exit PHIs args to see whether they
> are defined in the non-guarded region of the outer loop instead of walking
> all uses of all defs.
>
> Note that I think you miss endless loops as side-effects if that endless
> loop occurs through a irreducible region (thus not reflected in the
> loop tree).  Thus you should reject BB_IRREDUCIBLE_LOOP blocks
> in the non-guarded region as well.
>
> It seems to me that protecting adjacent loops with a single guard is
> also eligible for hoisting thus the restriction on loop->inner->next
> should become a restriction on no loops (or irreducible regions)
> in the non-guarded region.
>
> Most things can be improved as followup, but at least the
> virtual PHI arg thing needs to be sorted out.
>
> Thanks,
> Richard.
>
>
>> ChangeLog:
>> 2015-10-06  Yuri Rumyantsev  
>>
>> * tree-ssa-loop-unswitch.c: Include "gimple-iterator.h" and
>> "cfghooks.h", add prototypes for introduced new functions.
>> (tree_ssa_unswitch_loops): Use from innermost loop iterator, move all
>> checks on ability of loop unswitching to tree_unswitch_single_loop;
>> invoke tree_unswitch_single_loop or tree_unswitch_outer_loop depending
>> on innermost loop check.
>> (tree_unswitch_single_loop): Add all required checks on ability of
>> loop unswitching under zero recursive level guard.
>> (tree_unswitch_outer_loop): New function.
>> (find_loop_guard): Likewise.
>> (empty_bb_without_guard_p): Likewise.
>> (used_outside_loop_p): Likewise.
>> (hoist_guard): Likewise.
>> (check_exit_phi): Likewise.
>>
&

Re: [PATCH] Unswitching outer loops.

2015-10-07 Thread Yuri Rumyantsev
Richard,

I noticed that 'gimple' type was changed and send you updated patch.

Thanks.
Yuri.

2015-10-07 12:53 GMT+03:00 Yuri Rumyantsev :
> Richard,
>
> I've fixed adding virtual phi argument and add check on irreducible basic 
> block.
> New patch is attached.
>
> I checked it for bootstrap and regression testing, no new failures.
>
> ChangeLog:
> 2015-10-07  Yuri Rumyantsev  
>
> * tree-ssa-loop-unswitch.c: Include "gimple-iterator.h" and
> "cfghooks.h", add prototypes for introduced new functions.
> (tree_ssa_unswitch_loops): Use from innermost loop iterator, move all
> checks on ability of loop unswitching to tree_unswitch_single_loop;
> invoke tree_unswitch_single_loop or tree_unswitch_outer_loop depending
> on innermost loop check.
> (tree_unswitch_single_loop): Add all required checks on ability of
> loop unswitching under zero recursive level guard.
> (tree_unswitch_outer_loop): New function.
> (find_loop_guard): Likewise.
> (empty_bb_without_guard_p): Likewise.
> (used_outside_loop_p): Likewise.
> (get_vop_from_header): Likewise.
> (hoist_guard): Likewise.
> (check_exit_phi): Likewise.
>
>gcc/testsuite/ChangeLog:
> * gcc.dg/loop-unswitch-2.c: New test.
> * gcc.dg/loop-unswitch-3.c: Likewise.
> * gcc.dg/loop-unswitch-4.c: Likewise.
>
>
> 2015-10-06 15:21 GMT+03:00 Richard Biener :
>> On Tue, Oct 6, 2015 at 1:41 PM, Yuri Rumyantsev  wrote:
>>> Richard,
>>>
>>> Here is updated patch which reflects almost all your remarks:
>>> 1. Use ordinary get_loop_body.
>>> 2. Delete useless asserts.
>>> 3. Use check on iterated loop instead of finite_loop_p.
>>> 4. Do not update CFG by adjusting the CONDs condition to always true/false.
>>> 5. Add couple tests.
>>
>> +  /* Add NEW_ADGE argument for all phi in post-header block.  */
>> +  bb = exit->dest;
>> +  for (gphi_iterator gsi = gsi_start_phis (bb);
>> +   !gsi_end_p (gsi); gsi_next (&gsi))
>> +{
>> +  gphi *phi = gsi.phi ();
>> +  /* edge_iterator ei; */
>> +  tree arg;
>> +  if (virtual_operand_p (gimple_phi_result (phi)))
>> +   {
>> + arg = PHI_ARG_DEF_FROM_EDGE (phi, loop_preheader_edge (loop));
>> + add_phi_arg (phi, arg, new_edge, UNKNOWN_LOCATION);
>>
>> now I know what confused me - here you are looking at a loop exit PHI
>> but querying with the preheader edge index.  I think you need to walk
>> the loop header PHIs to find the PHI for the virtual operand and use that
>> to get the PHI arg from?
>>
>> The side-effect / used-outside code is still the same.  What matters
>> is side-effects outside of the loop-header protected code region, not
>> blocks excluding the inner loop.  Say,
>>
>>   for (;;)
>> {
>>   if (invariant-guard)
>> {
>>printf ("Blah");
>>for (;;)
>>  ;
>> }
>> }
>>
>> would still ok to be unswitched.  So instead of
>>
>> +  if (body[i]->loop_father != loop)
>> +   continue;
>>
>> it would be
>>
>>if (dominated_by_p (CDI_DOMINATORS, body[i], header)
>>&& !dominated_by_p (CDI_DOMINATORS, body[i], fe->dest))
>>
>> with the obvious improvement to the patch to not only consider header checks
>> in the outer loop header but in the pre-header block of the inner loop.
>>
>> And I still think you should walk the exit PHIs args to see whether they
>> are defined in the non-guarded region of the outer loop instead of walking
>> all uses of all defs.
>>
>> Note that I think you miss endless loops as side-effects if that endless
>> loop occurs through a irreducible region (thus not reflected in the
>> loop tree).  Thus you should reject BB_IRREDUCIBLE_LOOP blocks
>> in the non-guarded region as well.
>>
>> It seems to me that protecting adjacent loops with a single guard is
>> also eligible for hoisting thus the restriction on loop->inner->next
>> should become a restriction on no loops (or irreducible regions)
>> in the non-guarded region.
>>
>> Most things can be improved as followup, but at least the
>> virtual PHI arg thing needs to be sorted out.
>>
>> Thanks,
>> Richard.
>>
>>
>>> ChangeLog:
>>> 2015-10-06  Yuri Rumyantsev  
>>>
>>> * tree-ssa-loop-unswitch.c: Include "gimple-iterator.h" and
>>> "cfghooks.h", add prototypes for introduced new functions.
>>> (tree_ssa_unswitch_loop

[PATCH] Simple 2-lines fix for outer-loop vectorization.

2015-10-08 Thread Yuri Rumyantsev
Hi All,

Here is a simple patch which allows to have phi with virtual operands
in inner-loop loop-closed exit phi in outer-loop vectorization (i.e. a
phi in the tail of the outer-loop).

Bootstrap and regression testing did not show any new failures.

Is it OK for trunk?

ChangeLog:

2015-10-08  Yuri Rumyantsev  

* tree-vect-loop.c (vect_analyze_loop_operations): Skip virtual phi
in the tail of outer-loop.

   gcc/testsuite/ChangeLog:
* gcc.dg/vect/vect-outer-simd-3.c: New test.


patch.outer-vec
Description: Binary data


[PATCH PR67909 PR67947]

2015-10-13 Thread Yuri Rumyantsev
Hi All,

Here is a simple patch for unswitching outer loop through guard-edge
hoisting. The check that guard-edge is around the inner loop was
missed.

Bootstrapping and regression testing did not show new failures.

Is it OK for trunk?

ChangeLog:
2014-10-13  Yuri Rumyantsev  

PR tree-optimization/67909, 67947
* tree-ssa-loop-unswitch.c (find_loop_guard): Add check that GUARD_EDGE
really skip the inner loop.

gcc/testsuite/ChangeLog
* gcc.dg/torture/pr67947.c: New test.


patch
Description: Binary data


Re: [PATCH PR67909 PR67947]

2015-10-13 Thread Yuri Rumyantsev
Here is updated patch with splitting long line.
The patch is attached.

Yuri.

2015-10-13 15:38 GMT+03:00 H.J. Lu :
> On Tue, Oct 13, 2015 at 4:57 AM, Yuri Rumyantsev  wrote:
>> Hi All,
>>
>> Here is a simple patch for unswitching outer loop through guard-edge
>> hoisting. The check that guard-edge is around the inner loop was
>> missed.
>>
>> Bootstrapping and regression testing did not show new failures.
>>
>> Is it OK for trunk?
>>
>> ChangeLog:
>> 2014-10-13  Yuri Rumyantsev  
>>
>> PR tree-optimization/67909, 67947
>> * tree-ssa-loop-unswitch.c (find_loop_guard): Add check that GUARD_EDGE
>> really skip the inner loop.
>>
>> gcc/testsuite/ChangeLog
>> * gcc.dg/torture/pr67947.c: New test.
>
> +  /* Guard edge must skip inner loop.  */
> +  if (!dominated_by_p (CDI_DOMINATORS, loop->inner->header,
> +  guard_edge == fe ? te->dest : fe->dest))
>   It should line up with "CDI_DOMINATORS".
>
> + fprintf (dump_file, "Guard edge %d --> %d is not around the
> loop!\n",guard_edge->src->index,guard_edge->dest->index);
>
> Please break lone line.
>
> --
> H.J.


patch1
Description: Binary data


[RFC] Combine vectorized loops with its scalar remainder.

2015-10-28 Thread Yuri Rumyantsev
Hi All,

Here is a preliminary patch to combine vectorized loop with its scalar
remainder, draft of which was proposed by Kirill Yukhin month ago:
https://gcc.gnu.org/ml/gcc-patches/2015-09/msg01435.html
It was tested wwith '-mavx2' option to run on Haswell processor.
The main goal of it is to improve performance of vectorized loops for AVX512.
Note that only loads/stores and simple reductions with binary operations are
converted to masked form, e.g. load --> masked load and reduction like
r1 = f  r2 --> t = f  r2; r1 = m ? t : r2. Masking is performed through
creation of a new vector induction variable initialized with consequent values
from 0.. VF-1, new const vector upper bound which contains number of iterations
and the result of comparison which is considered as mask vector.
This implementation has several restrictions:

1. Multiple types are not supported.
2. SLP is not supported.
3. Gather/Scatter's are also not supported.
4. Vectorization of the loops with low trip count is not implemented yet since
   it requires additional design and tuning.

We are planning to eleminate all these restrictions in GCCv7.

This patch will be extended to include cost model to reject unprofutable
transformations, e.g. new vector body cost will be evaluated through new
target hook which estimates cast of masking different vector statements. New
threshold parameter will be introduced which determines permissible cost
increasing which will be tuned on an AVX512 machine.
This patch is not in sync with changes of Ilya Enkovich for AVX512 masked
load/store support since only part of them is in trunk compiler.

Any comments will be appreciated.


remainder.patch.1
Description: Binary data


Re: [PATCH] Simple optimization for MASK_STORE.

2015-11-02 Thread Yuri Rumyantsev
Hi Richard,

I've come back to this optimization and try to implement your proposal
for comparison:
> Btw, you didn't try the simpler alternative of
>
> tree type = type_for_mode (int_mode_for_mode (TYPE_MODE (vectype)));
> build2 (EQ_EXPR, boolean_type_node,
>  build1 (VIEW_CONVERT, type, op0), build1 (VIEW_CONVERT, type, op1));
>
> ?  That is, use the GIMPLE level equivalent of
>  (cmp (subreg:TI reg:V4SI) (subreg:TI reg:V4SI))

using the following code:

  vectype = TREE_TYPE (mask);
  ext_mode = mode_for_size (GET_MODE_BITSIZE (TYPE_MODE (vectype)),
MODE_INT, 0);
  ext_type = lang_hooks.types.type_for_mode (ext_mode , 1);

but I've got zero type for it. Should I miss something?

Any help will be appreciated.
Yuri.


2015-08-13 14:40 GMT+03:00 Richard Biener :
> On Thu, Aug 13, 2015 at 1:32 PM, Yuri Rumyantsev  wrote:
>> Hi Richard,
>>
>> Did you have a chance to look at updated patch?
>
> Having a quick look now.  Btw, you didn't try the simpler alternative of
>
>  tree type = type_for_mode (int_mode_for_mode (TYPE_MODE (vectype)));
>  build2 (EQ_EXPR, boolean_type_node,
>build1 (VIEW_CONVERT, type, op0), build1 (VIEW_CONVERT, type, op1));
>
> ?  That is, use the GIMPLE level equivalent of
>
>  (cmp (subreg:TI reg:V4SI) (subreg:TI reg:V4SI))
>
> ?  That should be supported by the expander already, though again not sure if
> the target(s) have compares that match this.
>
> Btw, the tree-cfg.c hook wasn't what was agreed on - the restriction
> on EQ/NE_EXPR
> is missing.  Operand type equality is tested anyway.
>
> Why do you need to restrict forward_propagate_into_comparison_1?
>
> Otherwise this looks better, but can you try with the VIEW_CONVERT as well?
>
> Thanks,
> Richard.
>
>
>> Thanks.
>> Yuri.
>>
>> 2015-08-06 14:07 GMT+03:00 Yuri Rumyantsev :
>>> HI All,
>>>
>>> Here is updated patch which implements Richard proposal to use vector
>>> comparison with boolean result instead of target hook. Support for it
>>> was added to ix86_expand_branch.
>>>
>>> Any comments will be appreciated.
>>>
>>> Bootstrap and regression testing did not show any new failures.
>>>
>>> ChangeLog:
>>> 2015-08-06  Yuri Rumyantsev  
>>>
>>> * config/i386/i386.c (ix86_expand_branch): Implement vector
>>> comparison with boolean result.
>>> * config/i386/sse.md (define_expand "cbranch4): Add define
>>> for vector comparion.
>>> * fold-const.c (fold_relational_const): Add handling of vector
>>> comparison with boolean result.
>>> * params.def (PARAM_ZERO_TEST_FOR_STORE_MASK): New DEFPARAM.
>>> * params.h (ENABLE_ZERO_TEST_FOR_STORE_MASK): new macros.
>>> * tree-cfg.c (verify_gimple_comparison): Add test for vector
>>> comparion with boolean result.
>>> * tree-ssa-forwprop.c (forward_propagate_into_comparison_1): Do not
>>> propagate vector comparion with boolean result.
>>> * tree-vect-stmts.c (vectorizable_mask_load_store): Initialize
>>> has_mask_store field of vect_info.
>>> * tree-vectorizer.c: Include files ssa.h, cfghooks.h and params.h.
>>> (is_valid_sink): New function.
>>> (optimize_mask_stores): New function.
>>> (vectorize_loops): Invoke optimaze_mask_stores for loops having masked
>>> stores.
>>> * tree-vectorizer.h (loop_vec_info): Add new has_mask_store field and
>>> correspondent macros.
>>>
>>> gcc/testsuite/ChangeLog:
>>> * gcc.target/i386/avx2-vect-mask-store-move1.c: New test.
>>>
>>>
>>> 2015-07-27 11:48 GMT+03:00 Richard Biener :
>>>> On Fri, Jul 24, 2015 at 9:11 PM, Jeff Law  wrote:
>>>>> On 07/24/2015 03:16 AM, Richard Biener wrote:
>>>>>>>
>>>>>>> Is there any rationale given anywhere for the transformation into
>>>>>>> conditional expressions?  ie, is there any reason why we can't have a
>>>>>>> GIMPLE_COND where the expression is a vector condition?
>>>>>>
>>>>>>
>>>>>> No rationale for equality compare which would have the semantic of
>>>>>> having all elements equal or not equal.  But you can't define a sensible
>>>>>> ordering (that HW implements) for other compare operators and you
>>>>>> obviously need a single boolean result, not a vector of element 
>>>>>> comparison
>>>>>> results.
>>>>>
>>>>> Right.  EQ/NE only as others just don't have any real meaning.
>>>>>
>>>>>
>>>>>> I've already replied that I'm fine allowing ==/!= whole-vector compares.
>>>>>> But one needs to check whether expansion does anything sensible
>>>>>> with them (either expand to integer subreg compares or add optabs
>>>>>> for the compares).
>>>>>
>>>>> Agreed, EQ/NE for whole vector compares only would be fine for me too 
>>>>> under
>>>>> the same conditions.
>>>>
>>>> Btw, you can already do this on GIMPLE by doing
>>>>
>>>>   TImode vec_as_int = VIEW_CONVERT_EXPR  (vec_2);
>>>>   if (vec_as_int == 0)
>>>> ...
>>>>
>>>> which is what the RTL will look like in the end.  So not sure if making 
>>>> this
>>>> higher-level in GIMPLE is good or required.
>>>>
>>>> Richard.
>>>>
>>>>> jeff


Re: [RFC] Combine vectorized loops with its scalar remainder.

2015-11-03 Thread Yuri Rumyantsev
This is expected failure since this patch is not in sync with the
latest patches related to masking support for AVX512.
I am waiting for support for masking load/store support which is not
integrated to trunk. To get workable version of compiler use revision
before r229128.

2015-11-03 13:08 GMT+03:00 Richard Henderson :
> On 10/28/2015 11:45 AM, Yuri Rumyantsev wrote:
>>
>> Hi All,
>>
>> Here is a preliminary patch to combine vectorized loop with its scalar
>> remainder, draft of which was proposed by Kirill Yukhin month ago:
>> https://gcc.gnu.org/ml/gcc-patches/2015-09/msg01435.html
>> It was tested wwith '-mavx2' option to run on Haswell processor.
>> The main goal of it is to improve performance of vectorized loops for
>> AVX512.
>
>
> Ought this really be enabled for avx2?  While it's nice for testing to be
> able to use normal vcond patterns to be able to test with current hardware,
> I have trouble imagining that it's an improvement without the real masked
> operations.
>
> I tried to have a look myself at what kind of output we'd be getting, but
> the very first test I tried produced an ICE:
>
> void foo(float *a, float *b, int n)
> {
>   int i;
>   for (i = 0; i < n; ++i)
> a[i] += b[i];
> }
>
> $ ./cc1 -O3 -mavx2 z.c
>  foo
> Analyzing compilation unit
> Performing interprocedural optimizations
>  <*free_lang_data>   
>  
> 
> Assembling functions:
>   foo
> z.c: In function ‘foo’:
> z.c:1:6: error: bogus comparison result type
>  void foo(float *a, float *b, int n)
>   ^
> vector(8) signed int
> vect_vec_mask_.24_116 = vect_vec_iv_.22_112 < vect_cst_.23_115;
> z.c:1:6: internal compiler error: verify_gimple failed
> 0xb20d17 verify_gimple_in_cfg(function*, bool)
> ../../git-master/gcc/tree-cfg.c:5082
> 0xa16d77 execute_function_todo
> ../../git-master/gcc/passes.c:1940
> 0xa1769b execute_todo
> ../../git-master/gcc/passes.c:1995
> Please submit a full bug report,
>
>
> r~


Re: [RFC] Combine vectorized loops with its scalar remainder.

2015-11-03 Thread Yuri Rumyantsev
Richard,

It looks like misunderstanding - we assume that for GCCv6 the simple
scheme of remainder will be used through introducing new IV :
https://gcc.gnu.org/ml/gcc-patches/2015-09/msg01435.html

Is it true or we missed something?
Now we are testing vectorization of loops with small non-constant trip count.
Yuri.

2015-11-03 14:47 GMT+03:00 Richard Biener :
> On Wed, Oct 28, 2015 at 11:45 AM, Yuri Rumyantsev  wrote:
>> Hi All,
>>
>> Here is a preliminary patch to combine vectorized loop with its scalar
>> remainder, draft of which was proposed by Kirill Yukhin month ago:
>> https://gcc.gnu.org/ml/gcc-patches/2015-09/msg01435.html
>> It was tested wwith '-mavx2' option to run on Haswell processor.
>> The main goal of it is to improve performance of vectorized loops for AVX512.
>> Note that only loads/stores and simple reductions with binary operations are
>> converted to masked form, e.g. load --> masked load and reduction like
>> r1 = f  r2 --> t = f  r2; r1 = m ? t : r2. Masking is performed 
>> through
>> creation of a new vector induction variable initialized with consequent 
>> values
>> from 0.. VF-1, new const vector upper bound which contains number of 
>> iterations
>> and the result of comparison which is considered as mask vector.
>> This implementation has several restrictions:
>>
>> 1. Multiple types are not supported.
>> 2. SLP is not supported.
>> 3. Gather/Scatter's are also not supported.
>> 4. Vectorization of the loops with low trip count is not implemented yet 
>> since
>>it requires additional design and tuning.
>>
>> We are planning to eleminate all these restrictions in GCCv7.
>>
>> This patch will be extended to include cost model to reject unprofutable
>> transformations, e.g. new vector body cost will be evaluated through new
>> target hook which estimates cast of masking different vector statements. New
>> threshold parameter will be introduced which determines permissible cost
>> increasing which will be tuned on an AVX512 machine.
>> This patch is not in sync with changes of Ilya Enkovich for AVX512 masked
>> load/store support since only part of them is in trunk compiler.
>>
>> Any comments will be appreciated.
>
> As stated in the previous discussion I don't think the extra mask IV
> is a good idea
> and we instead should have a masked final iteration for the epilogue
> (yes, that's
> not really "combined" then).  This is because in the end we'd not only
> want AVX512
> to benefit from this work but also other ISAs that can do unaligned or masked
> operations (we can overlap the epilogue work with the vectorized work or use
> masked loads/stores available with AVX).  Note that the same applies to
> the alignment prologue if present, I can't see how you can handle that with 
> the
> in-loop approach.
>
> Richard.


Re: [PATCH] Simple optimization for MASK_STORE.

2015-11-05 Thread Yuri Rumyantsev
Hi All!

I prepared another patch which performs insertion additional check on
zero mask for masked stores if only parameter
PARAM_ZERO_TEST_FOR_MASK_STORE has non-zero value. My attempt to use
approach proposed by Richard with simpler alternative for comparison -
use scalar type for 256-bit was not successful and I returned to
vectori comparison with scalar Boolean result.

ChangeLog:
2015-11-05  Yuri Rumyantsev  

* config/i386/i386.c: Add conditional initialization of
PARAM_ZERO_TEST_FOR_MASK_STORE.
(ix86_expand_branch): Implement vector comparison with boolean result.
* config/i386/i386.h: New macros TARGET_OPTIMIZE_MASK_STORE.
* config/i386/sse.md (define_expand "cbranch4): Add define-expand
for vector comparion with eq/ne only.
* config/i386/x86-tune.def: New macros X86_TUNE_OPTIMIZE_MASK_STORE.
* fold-const.c (fold_relational_const): Add handling of vector
comparison with boolean result.
* params.def (PARAM_ZERO_TEST_FOR_MASK_STORE): New DEFPARAM.
* params.h (ENABLE_ZERO_TEST_FOR_MASK_STORE): New macros.
* tree-cfg.c (verify_gimple_comparison): Add argument CODE, allow
comparison of vector operands with boolean result for EQ/NE only.
(verify_gimple_assign_binary): Adjust call for verify_gimple_comparison.
(verify_gimple_cond): Likewise.
* tree-ssa-forwprop.c (forward_propagate_into_comparison_1): Do not
combine vector comparison with boolean result and VEC_COND_EXPR that
has vector result.
* tree-vect-loop.c (is_valid_sink): New function.
(optimize_mask_stores): Likewise.
* tree-vect-stmts.c (vectorizable_mask_load_store): Initialize
has_mask_store field of vect_info.
* tree-vectorizer.c (vectorize_loops): Invoke optimaze_mask_stores for
vectorized loops having masked stores.
* tree-vectorizer.h (loop_vec_info): Add new has_mask_store field and
correspondent macros.
(optimize_mask_stores): Add prototype.

gcc/testsuite/ChangeLog:
* gcc.target/i386/avx2-vect-mask-store-move1.c: New test.
* gcc.target/i386/avx2-vect-mask-store-move2.c: Likewise.

2015-11-02 18:24 GMT+03:00 Yuri Rumyantsev :
> Hi Richard,
>
> I've come back to this optimization and try to implement your proposal
> for comparison:
>> Btw, you didn't try the simpler alternative of
>>
>> tree type = type_for_mode (int_mode_for_mode (TYPE_MODE (vectype)));
>> build2 (EQ_EXPR, boolean_type_node,
>>  build1 (VIEW_CONVERT, type, op0), build1 (VIEW_CONVERT, type, op1));
>>
>> ?  That is, use the GIMPLE level equivalent of
>>  (cmp (subreg:TI reg:V4SI) (subreg:TI reg:V4SI))
>
> using the following code:
>
>   vectype = TREE_TYPE (mask);
>   ext_mode = mode_for_size (GET_MODE_BITSIZE (TYPE_MODE (vectype)),
> MODE_INT, 0);
>   ext_type = lang_hooks.types.type_for_mode (ext_mode , 1);
>
> but I've got zero type for it. Should I miss something?
>
> Any help will be appreciated.
> Yuri.
>
>
> 2015-08-13 14:40 GMT+03:00 Richard Biener :
>> On Thu, Aug 13, 2015 at 1:32 PM, Yuri Rumyantsev  wrote:
>>> Hi Richard,
>>>
>>> Did you have a chance to look at updated patch?
>>
>> Having a quick look now.  Btw, you didn't try the simpler alternative of
>>
>>  tree type = type_for_mode (int_mode_for_mode (TYPE_MODE (vectype)));
>>  build2 (EQ_EXPR, boolean_type_node,
>>build1 (VIEW_CONVERT, type, op0), build1 (VIEW_CONVERT, type, op1));
>>
>> ?  That is, use the GIMPLE level equivalent of
>>
>>  (cmp (subreg:TI reg:V4SI) (subreg:TI reg:V4SI))
>>
>> ?  That should be supported by the expander already, though again not sure if
>> the target(s) have compares that match this.
>>
>> Btw, the tree-cfg.c hook wasn't what was agreed on - the restriction
>> on EQ/NE_EXPR
>> is missing.  Operand type equality is tested anyway.
>>
>> Why do you need to restrict forward_propagate_into_comparison_1?
>>
>> Otherwise this looks better, but can you try with the VIEW_CONVERT as well?
>>
>> Thanks,
>> Richard.
>>
>>
>>> Thanks.
>>> Yuri.
>>>
>>> 2015-08-06 14:07 GMT+03:00 Yuri Rumyantsev :
>>>> HI All,
>>>>
>>>> Here is updated patch which implements Richard proposal to use vector
>>>> comparison with boolean result instead of target hook. Support for it
>>>> was added to ix86_expand_branch.
>>>>
>>>> Any comments will be appreciated.
>>>>
>>>> Bootstrap and regression testing did not show any new failures.
>>>>
>>>> ChangeLog:
>>>> 2015-08-06  Yuri Rumyantsev  
>>>>
>>>> * config/i386/i386.c (ix86_expand_branch): Implement vector
>>>> comparison with boolean result.
>>>> * config/i386/sse.md

Re: [PATCH] Simple optimization for MASK_STORE.

2015-11-06 Thread Yuri Rumyantsev
Richard,

I tried it but 256-bit precision integer type is not yet supported.

Yuri.


2015-11-06 15:56 GMT+03:00 Richard Biener :
> On Mon, Nov 2, 2015 at 4:24 PM, Yuri Rumyantsev  wrote:
>> Hi Richard,
>>
>> I've come back to this optimization and try to implement your proposal
>> for comparison:
>>> Btw, you didn't try the simpler alternative of
>>>
>>> tree type = type_for_mode (int_mode_for_mode (TYPE_MODE (vectype)));
>>> build2 (EQ_EXPR, boolean_type_node,
>>>  build1 (VIEW_CONVERT, type, op0), build1 (VIEW_CONVERT, type, op1));
>>>
>>> ?  That is, use the GIMPLE level equivalent of
>>>  (cmp (subreg:TI reg:V4SI) (subreg:TI reg:V4SI))
>>
>> using the following code:
>>
>>   vectype = TREE_TYPE (mask);
>>   ext_mode = mode_for_size (GET_MODE_BITSIZE (TYPE_MODE (vectype)),
>> MODE_INT, 0);
>>   ext_type = lang_hooks.types.type_for_mode (ext_mode , 1);
>>
>> but I've got zero type for it. Should I miss something?
>
> Use ext_type = build_nonstandard_integer_type (GET_MODE_PRECISION
> (ext_mode), 1);
>
> Richard.
>
>> Any help will be appreciated.
>> Yuri.
>>
>>
>> 2015-08-13 14:40 GMT+03:00 Richard Biener :
>>> On Thu, Aug 13, 2015 at 1:32 PM, Yuri Rumyantsev  wrote:
>>>> Hi Richard,
>>>>
>>>> Did you have a chance to look at updated patch?
>>>
>>> Having a quick look now.  Btw, you didn't try the simpler alternative of
>>>
>>>  tree type = type_for_mode (int_mode_for_mode (TYPE_MODE (vectype)));
>>>  build2 (EQ_EXPR, boolean_type_node,
>>>build1 (VIEW_CONVERT, type, op0), build1 (VIEW_CONVERT, type, op1));
>>>
>>> ?  That is, use the GIMPLE level equivalent of
>>>
>>>  (cmp (subreg:TI reg:V4SI) (subreg:TI reg:V4SI))
>>>
>>> ?  That should be supported by the expander already, though again not sure 
>>> if
>>> the target(s) have compares that match this.
>>>
>>> Btw, the tree-cfg.c hook wasn't what was agreed on - the restriction
>>> on EQ/NE_EXPR
>>> is missing.  Operand type equality is tested anyway.
>>>
>>> Why do you need to restrict forward_propagate_into_comparison_1?
>>>
>>> Otherwise this looks better, but can you try with the VIEW_CONVERT as well?
>>>
>>> Thanks,
>>> Richard.
>>>
>>>
>>>> Thanks.
>>>> Yuri.
>>>>
>>>> 2015-08-06 14:07 GMT+03:00 Yuri Rumyantsev :
>>>>> HI All,
>>>>>
>>>>> Here is updated patch which implements Richard proposal to use vector
>>>>> comparison with boolean result instead of target hook. Support for it
>>>>> was added to ix86_expand_branch.
>>>>>
>>>>> Any comments will be appreciated.
>>>>>
>>>>> Bootstrap and regression testing did not show any new failures.
>>>>>
>>>>> ChangeLog:
>>>>> 2015-08-06  Yuri Rumyantsev  
>>>>>
>>>>> * config/i386/i386.c (ix86_expand_branch): Implement vector
>>>>> comparison with boolean result.
>>>>> * config/i386/sse.md (define_expand "cbranch4): Add define
>>>>> for vector comparion.
>>>>> * fold-const.c (fold_relational_const): Add handling of vector
>>>>> comparison with boolean result.
>>>>> * params.def (PARAM_ZERO_TEST_FOR_STORE_MASK): New DEFPARAM.
>>>>> * params.h (ENABLE_ZERO_TEST_FOR_STORE_MASK): new macros.
>>>>> * tree-cfg.c (verify_gimple_comparison): Add test for vector
>>>>> comparion with boolean result.
>>>>> * tree-ssa-forwprop.c (forward_propagate_into_comparison_1): Do not
>>>>> propagate vector comparion with boolean result.
>>>>> * tree-vect-stmts.c (vectorizable_mask_load_store): Initialize
>>>>> has_mask_store field of vect_info.
>>>>> * tree-vectorizer.c: Include files ssa.h, cfghooks.h and params.h.
>>>>> (is_valid_sink): New function.
>>>>> (optimize_mask_stores): New function.
>>>>> (vectorize_loops): Invoke optimaze_mask_stores for loops having masked
>>>>> stores.
>>>>> * tree-vectorizer.h (loop_vec_info): Add new has_mask_store field and
>>>>> correspondent macros.
>>>>>
>>>>> gcc/testsuite/ChangeLog:
>>>>> * gcc.target/i386/avx2-vect-mask-store-move1.c: New test.
>>>>

Re: [PATCH] Simple optimization for MASK_STORE.

2015-11-11 Thread Yuri Rumyantsev
Richard,

What we should do to cope with this problem (structure size increasing)?
Should we return to vector comparison version?

Thanks.
Yuri.

2015-11-11 12:18 GMT+03:00 Richard Biener :
> On Tue, Nov 10, 2015 at 3:56 PM, Ilya Enkovich  wrote:
>> 2015-11-10 17:46 GMT+03:00 Richard Biener :
>>> On Tue, Nov 10, 2015 at 1:48 PM, Ilya Enkovich  
>>> wrote:
>>>> 2015-11-10 15:33 GMT+03:00 Richard Biener :
>>>>> On Fri, Nov 6, 2015 at 2:28 PM, Yuri Rumyantsev  
>>>>> wrote:
>>>>>> Richard,
>>>>>>
>>>>>> I tried it but 256-bit precision integer type is not yet supported.
>>>>>
>>>>> What's the symptom?  The compare cannot be expanded?  Just add a pattern 
>>>>> then.
>>>>> After all we have modes up to XImode.
>>>>
>>>> I suppose problem may be in:
>>>>
>>>> gcc/config/i386/i386-modes.def:#define MAX_BITSIZE_MODE_ANY_INT (128)
>>>>
>>>> which doesn't allow to create constants of bigger size.  Changing it
>>>> to maximum vector size (512) would mean we increase wide_int structure
>>>> size significantly. New patterns are probably also needed.
>>>
>>> Yes, new patterns are needed but wide-int should be fine (we only need to 
>>> create
>>> a literal zero AFACS).  The "new pattern" would be equality/inequality
>>> against zero
>>> compares only.
>>
>> Currently 256bit integer creation fails because wide_int for max and
>> min values cannot be created.
>
> Hmm, indeed:
>
> #1  0x0072dab5 in wi::extended_tree<192>::extended_tree (
> this=0x7fffd950, t=0x76a000b0)
> at /space/rguenther/src/svn/trunk/gcc/tree.h:5125
> 5125  gcc_checking_assert (TYPE_PRECISION (TREE_TYPE (t)) <= N);
>
> but that's not that the constants fail to be created but
>
> #5  0x010d8828 in build_nonstandard_integer_type (precision=512,
> unsignedp=65) at /space/rguenther/src/svn/trunk/gcc/tree.c:8051
> 8051  if (tree_fits_uhwi_p (TYPE_MAX_VALUE (itype)))
> (gdb) l
> 8046fixup_unsigned_type (itype);
> 8047  else
> 8048fixup_signed_type (itype);
> 8049
> 8050  ret = itype;
> 8051  if (tree_fits_uhwi_p (TYPE_MAX_VALUE (itype)))
> 8052ret = type_hash_canon (tree_to_uhwi (TYPE_MAX_VALUE
> (itype)), itype);
>
> thus the integer type hashing being "interesting".  tree_fits_uhwi_p
> fails because
> it does
>
> 7289bool
> 7290tree_fits_uhwi_p (const_tree t)
> 7291{
> 7292  return (t != NULL_TREE
> 7293  && TREE_CODE (t) == INTEGER_CST
> 7294  && wi::fits_uhwi_p (wi::to_widest (t)));
> 7295}
>
> and wi::to_widest () fails with doing
>
> 5121template 
> 5122inline wi::extended_tree ::extended_tree (const_tree t)
> 5123  : m_t (t)
> 5124{
> 5125  gcc_checking_assert (TYPE_PRECISION (TREE_TYPE (t)) <= N);
> 5126}
>
> fixing the hashing then runs into type_cache_hasher::equal doing
> tree_int_cst_equal
> which again uses to_widest (it should be easier and cheaper to do the compare 
> on
> the actual tree representation, but well, seems to be just the first
> of various issues
> we'd run into).
>
> We eventually could fix the assert above (but then need to hope we assert
> when a computation overflows the narrower precision of widest_int) or use
> a special really_widest_int (ugh).
>
>> It is fixed by increasing MAX_BITSIZE_MODE_ANY_INT, but it increases
>> WIDE_INT_MAX_ELTS
>> and thus increases wide_int structure. If we use 512 for
>> MAX_BITSIZE_MODE_ANY_INT then
>> wide_int structure would grow by 48 bytes (16 bytes if use 256 for
>> MAX_BITSIZE_MODE_ANY_INT).
>> Is it OK for such narrow usage?
>
> widest_int is used in some long-living structures (which is the reason for
> MAX_BITSIZE_MODE_ANY_INT in the first place).  So I don't think so.
>
> Richard.
>
>> Ilya
>>
>>>
>>> Richard.
>>>
>>>> Ilya
>>>>
>>>>>
>>>>> Richard.
>>>>>
>>>>>> Yuri.
>>>>>>
>>>>>>


Re: [RFC] Combine vectorized loops with its scalar remainder.

2015-11-13 Thread Yuri Rumyantsev
Hi Richard,

Here is updated version of the patch which 91) is in sync with trunk
compiler and (2) contains simple cost model to estimate profitability
of scalar epilogue elimination. The part related to vectorization of
loops with small trip count is in process of developing. Note that
implemented cost model was not tuned  well for HASWELL and KNL but we
got  ~6% speed-up on 436.cactusADM from spec2006 suite for HASWELL.

2015-11-10 17:52 GMT+03:00 Richard Biener :
> On Tue, Nov 10, 2015 at 2:02 PM, Ilya Enkovich  wrote:
>> 2015-11-10 15:30 GMT+03:00 Richard Biener :
>>> On Tue, Nov 3, 2015 at 1:08 PM, Yuri Rumyantsev  wrote:
>>>> Richard,
>>>>
>>>> It looks like misunderstanding - we assume that for GCCv6 the simple
>>>> scheme of remainder will be used through introducing new IV :
>>>> https://gcc.gnu.org/ml/gcc-patches/2015-09/msg01435.html
>>>>
>>>> Is it true or we missed something?
>>>
>>> 
>>>> > Do you have an idea how "masking" is better be organized to be usable
>>>> > for both 4b and 4c?
>>>>
>>>> Do 2a ...
>>> Okay.
>>> 
>>
>> 2a was 'transform already vectorized loop as a separate
>> post-processing'. Isn't it what this prototype patch implements?
>> Current version only masks loop body which is in practice applicable
>> for AVX-512 only in the most cases.  With AVX-512 it's easier to see
>> how profitable masking might be and it is a main target for the first
>> masking version.  Extending it to prologues/epilogues and thus making
>> it more profitable for other targets is the next step and is out of
>> the scope of this patch.
>
> Ok, technically the prototype transforms the already vectorized loop.
> Of course I meant the vectorized loop be copied, masked and that
> result used as epilogue...
>
> I'll queue a more detailed look into the patch for this week.
>
> Did you perform any measurements with this patch like # of
> masked epilogues in SPEC 2006 FP (and any speedup?)
>
> Thanks,
> Richard.
>
>> Thanks,
>> Ilya
>>
>>>
>>> Richard.
>>>


remainder.patch.2
Description: Binary data


Re: [PATCH] Simple optimization for MASK_STORE.

2015-11-19 Thread Yuri Rumyantsev
Hi Richard,

I send you updated version of patch which contains fixes you mentioned
and additional early exit in
register_edge_assert_for() for gcond with vector comparison - it tries
to produce assert for
  if (vect != {0,0,0,0}) but can't create such constant. This is not
essential since this is applied to very specialized context.

My answers are below.

2015-11-12 16:58 GMT+03:00 Richard Biener :
> On Wed, Nov 11, 2015 at 2:13 PM, Yuri Rumyantsev  wrote:
>> Richard,
>>
>> What we should do to cope with this problem (structure size increasing)?
>> Should we return to vector comparison version?
>
> Ok, given this constraint I think the cleanest approach is to allow
> integer(!) vector equality(!) compares with scalar result.  This should then
> expand via cmp_optab and not via vec_cmp_optab.

  In fact it is expanded through cbranch_optab since the only use of
such comparison is for masked
store motion
>
> On gimple you can then have
>
>  if (mask_vec_1 != {0, 0,  })
> ...
>
> Note that a fallback expansion (for optabs.c to try) would be
> the suggested view-conversion (aka, subreg) variant using
> a same-sized integer mode.
>
> Target maintainers can then choose what is a better fit for
> their target (and instruction set as register set constraints may apply).
>
> The patch you posted seems to do this but not restrict the compares
> to integer ones (please do that).
>
>if (TREE_CODE (op0_type) == VECTOR_TYPE
>   || TREE_CODE (op1_type) == VECTOR_TYPE)
>  {
> -  error ("vector comparison returning a boolean");
> -  debug_generic_expr (op0_type);
> -  debug_generic_expr (op1_type);
> -  return true;
> + /* Allow vector comparison returning boolean if operand types
> +are equal and CODE is EQ/NE.  */
> + if ((code != EQ_EXPR && code != NE_EXPR)
> + || TREE_CODE (op0_type) != TREE_CODE (op1_type)
> + || TYPE_VECTOR_SUBPARTS (op0_type)
> +!= TYPE_VECTOR_SUBPARTS (op1_type)
> + || GET_MODE_SIZE (TYPE_MODE (TREE_TYPE (op0_type)))
> +!= GET_MODE_SIZE (TYPE_MODE (TREE_TYPE (op1_type
>
> These are all checked with the useless_type_conversion_p checks done earlier.
>
> As said I'd like to see a
>
> || ! VECTOR_INTEGER_TYPE_P (op0_type)

  I added check on VECTOR_BOOLEAN_TYPE_P (op0_type) instead since type
of mask was changed.
>
> check added so we and targets do not need to worry about using EQ/NE vs. CMP
> and worry about signed zeros and friends.
>
> +   {
> + error ("type mismatch for vector comparison returning a 
> boolean");
> + debug_generic_expr (op0_type);
> + debug_generic_expr (op1_type);
> + return true;
> +   }
>
>
>
> --- a/gcc/tree-ssa-forwprop.c
> +++ b/gcc/tree-ssa-forwprop.c
> @@ -422,6 +422,15 @@ forward_propagate_into_comparison_1 (gimple *stmt,
>   enum tree_code def_code = gimple_assign_rhs_code (def_stmt);
>   bool invariant_only_p = !single_use0_p;
>
> + /* Can't combine vector comparison with scalar boolean type of
> +the result and VEC_COND_EXPR having vector type of comparison.  
> */
> + if (TREE_CODE (TREE_TYPE (op0)) == VECTOR_TYPE
> + && INTEGRAL_TYPE_P (type)
> + && (TREE_CODE (type) == BOOLEAN_TYPE
> + || TYPE_PRECISION (type) == 1)
> + && def_code == VEC_COND_EXPR)
> +   return NULL_TREE;
>
> this hints at larger fallout you paper over here.  So this effectively
> means we're trying combining (vec1 != vec2) != 0 for example
> and that fails miserably?  If so then the solution is to fix whatever
> does not expect this (valid) GENERIC tree.

  I changed it to the following check in combine_cond_expr_cond:
  /* Do not perform combining it types are not compatible.  */
  if (TREE_CODE (TREE_TYPE (op0)) == VECTOR_TYPE
  && !tree_int_cst_equal (TYPE_SIZE (type), TYPE_SIZE (TREE_TYPE (op0
return NULL_TREE;
>
> +  if (ENABLE_ZERO_TEST_FOR_MASK_STORE == 0)
> +return;
>
> not sure if I like a param more than a target hook ... :/
  I introduced the param instead of a target hook to make this
transformation user visible, i.e. to switch it on/off
for different targets.
>
> +  /* Create vector comparison with boolean result.  */
> +  vectype = TREE_TYPE (mask);
> +  zero = build_zero_cst (TREE_TYPE (vectype));
> +  zero = build_vector_from_val (vectype, zero);
>
> build_zero_cst (vectype);

Done.
>
> +  stmt = gimple_build_cond (EQ_EXPR, mask

Re: [RFC] Combine vectorized loops with its scalar remainder.

2015-11-23 Thread Yuri Rumyantsev
Hi Richard,

Did you have a chance to look at this?

Thanks.
Yuri.

2015-11-13 13:35 GMT+03:00 Yuri Rumyantsev :
> Hi Richard,
>
> Here is updated version of the patch which 91) is in sync with trunk
> compiler and (2) contains simple cost model to estimate profitability
> of scalar epilogue elimination. The part related to vectorization of
> loops with small trip count is in process of developing. Note that
> implemented cost model was not tuned  well for HASWELL and KNL but we
> got  ~6% speed-up on 436.cactusADM from spec2006 suite for HASWELL.
>
> 2015-11-10 17:52 GMT+03:00 Richard Biener :
>> On Tue, Nov 10, 2015 at 2:02 PM, Ilya Enkovich  
>> wrote:
>>> 2015-11-10 15:30 GMT+03:00 Richard Biener :
>>>> On Tue, Nov 3, 2015 at 1:08 PM, Yuri Rumyantsev  wrote:
>>>>> Richard,
>>>>>
>>>>> It looks like misunderstanding - we assume that for GCCv6 the simple
>>>>> scheme of remainder will be used through introducing new IV :
>>>>> https://gcc.gnu.org/ml/gcc-patches/2015-09/msg01435.html
>>>>>
>>>>> Is it true or we missed something?
>>>>
>>>> 
>>>>> > Do you have an idea how "masking" is better be organized to be usable
>>>>> > for both 4b and 4c?
>>>>>
>>>>> Do 2a ...
>>>> Okay.
>>>> 
>>>
>>> 2a was 'transform already vectorized loop as a separate
>>> post-processing'. Isn't it what this prototype patch implements?
>>> Current version only masks loop body which is in practice applicable
>>> for AVX-512 only in the most cases.  With AVX-512 it's easier to see
>>> how profitable masking might be and it is a main target for the first
>>> masking version.  Extending it to prologues/epilogues and thus making
>>> it more profitable for other targets is the next step and is out of
>>> the scope of this patch.
>>
>> Ok, technically the prototype transforms the already vectorized loop.
>> Of course I meant the vectorized loop be copied, masked and that
>> result used as epilogue...
>>
>> I'll queue a more detailed look into the patch for this week.
>>
>> Did you perform any measurements with this patch like # of
>> masked epilogues in SPEC 2006 FP (and any speedup?)
>>
>> Thanks,
>> Richard.
>>
>>> Thanks,
>>> Ilya
>>>
>>>>
>>>> Richard.
>>>>


[PATCH PR68542]

2015-11-30 Thread Yuri Rumyantsev
Hi All,

Here is a patch for 481.wrf preformance regression for avx2 which is
sligthly modified mask store optimization. This transformation allows
perform unpredication for semi-hammock containing masked stores, other
words if we have a loop like
for (i=0; i

PR middle-end/68542
* config/i386/i386.c (ix86_expand_branch): Implement integral vector
comparison with boolean result.
* config/i386/sse.md (define_expand "cbranch4): Add define-expand
for vector comparion with eq/ne only.
* fold-const.c (fold_relational_const): Add handling of vector
comparison with boolean result.
* tree-cfg.c (verify_gimple_comparison): Add argument CODE, allow
comparison of vector operands with boolean result for EQ/NE only.
(verify_gimple_assign_binary): Adjust call for verify_gimple_comparison.
(verify_gimple_cond): Likewise.
* tree-ssa-forwprop.c (combine_cond_expr_cond): Do not perform
combining for non-compatible vector types.
* tree-vect-loop.c (is_valid_sink): New function.
(optimize_mask_stores): Likewise.
* tree-vect-stmts.c (vectorizable_mask_load_store): Initialize
has_mask_store field of vect_info.
* tree-vectorizer.c (vectorize_loops): Invoke optimaze_mask_stores for
vectorized loops having masked stores.
* tree-vectorizer.h (loop_vec_info): Add new has_mask_store field and
correspondent macros.
(optimize_mask_stores): Add prototype.
* tree-vrp.c (register_edge_assert_for): Do not handle NAME with vector
type.

gcc/testsuite/ChangeLog:
* gcc.target/i386/avx2-vect-mask-store-move1.c: New test.


PR68542.patch
Description: Binary data


Re: [RFC] Combine vectorized loops with its scalar remainder.

2015-11-30 Thread Yuri Rumyantsev
Richard,

Thanks a lot for your detailed comments!

Few words about 436.cactusADM gain. The loop which was transformed for
avx2 is very huge and this is the last inner-most loop in routine
Bench_StaggeredLeapfrog2 (StaggeredLeapfrog2.F #366). If you don't
have sources, let me know.

Yuri.

2015-11-27 16:45 GMT+03:00 Richard Biener :
> On Fri, Nov 13, 2015 at 11:35 AM, Yuri Rumyantsev  wrote:
>> Hi Richard,
>>
>> Here is updated version of the patch which 91) is in sync with trunk
>> compiler and (2) contains simple cost model to estimate profitability
>> of scalar epilogue elimination. The part related to vectorization of
>> loops with small trip count is in process of developing. Note that
>> implemented cost model was not tuned  well for HASWELL and KNL but we
>> got  ~6% speed-up on 436.cactusADM from spec2006 suite for HASWELL.
>
> Ok, so I don't know where to start with this.
>
> First of all while I wanted to have the actual stmt processing to be
> as post-processing
> on the vectorized loop body I didn't want to have this competely separated 
> from
> vectorizing.
>
> So, do combine_vect_loop_remainder () from vect_transform_loop, not by 
> iterating
> over all (vectorized) loops at the end.
>
> Second, all the adjustments of the number of iterations for the vector
> loop should
> be integrated into the main vectorization scheme as should determining the
> cost of the predication.  So you'll end up adding a
> LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE flag, determined during
> cost analysis and during code generation adjust vector iteration computation
> accordingly and _not_ generate the epilogue loop (or wire it up correctly in
> the first place).
>
> The actual stmt processing should then still happen in a similar way as you 
> do.
>
> So I'm going to comment on that part only as I expect the rest will look a lot
> different.
>
> +/* Generate induction_vector which will be used to mask evaluation.  */
> +
> +static tree
> +gen_vec_induction (loop_vec_info loop_vinfo, unsigned elem_size, unsigned 
> size)
> +{
>
> please make use of create_iv.  Add more comments.  I reverse-engineered
> that you add a { { 0, ..., vf }, +, {vf, ... vf } } IV which you use
> in gen_mask_for_remainder
> by comparing it against { niter, ..., niter }.
>
> +  gsi = gsi_after_labels (loop->header);
> +  niters = LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)
> +  ? LOOP_VINFO_NITERS (loop_vinfo)
> +  : LOOP_VINFO_NITERS_UNCHANGED (loop_vinfo);
>
> that's either wrong or unnecessary.  if ! peeling for alignment
> loop-vinfo-niters
> is equal to loop-vinfo-niters-unchanged.
>
> +  ptr = build_int_cst (reference_alias_ptr_type (ref), 0);
> +  if (!SSA_NAME_PTR_INFO (addr))
> +   copy_ref_info (build2 (MEM_REF, TREE_TYPE (ref), addr, ptr), ref);
>
> vect_duplicate_ssa_name_ptr_info.
>
> +
> +static void
> +fix_mask_for_masked_ld_st (vec *masked_stmt, tree mask)
> +{
> +  gimple *stmt, *new_stmt;
> +  tree old, lhs, vectype, var, n_lhs;
>
> no comment?  what's this for.
>
> +/* Convert vectorized reductions to VEC_COND statements to preserve
> +   reduction semantic:
> +   s1 = x + s2 --> t = x + s2; s1 = (mask)? t : s2.  */
> +
> +static void
> +convert_reductions (loop_vec_info loop_vinfo, tree mask)
> +{
>
> for reductions it looks like preserving the last iteration x plus the mask
> could avoid predicating it this way and compensate in the reduction
> epilogue by "subtracting" x & mask?  With true predication support
> that'll likely be more expensive of course.
>
> +  /* Generate new VEC_COND expr.  */
> +  vec_cond_expr = build3 (VEC_COND_EXPR, vectype, mask, new_lhs, rhs);
> +  new_stmt = gimple_build_assign (lhs, vec_cond_expr);
>
> gimple_build_assign (lhs, VEC_COND_EXPR, vectype, mask, new_lhs, rhs);
>
> +/* Return true if MEM_REF is incremented by vector size and false
> otherwise.  */
> +
> +static bool
> +mem_ref_is_vec_size_incremented (loop_vec_info loop_vinfo, tree lhs)
> +{
> +  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
>
> what?!  Just look at DR_STEP of the store?
>
>
> +void
> +combine_vect_loop_remainder (loop_vec_info loop_vinfo)
> +{
> +  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> +  auto_vec loads;
> +  auto_vec stores;
>
> so you need to re-structure this in a way that it computes
>
>   a) wheter it can perform the operation - and you need to do that
>   reliably before the operation has taken place
>   b) its cost
>
> instead of looking at def types or gimple_assign_load/store_p predicates
> please look at S

Re: [PATCH PR68542]

2015-12-04 Thread Yuri Rumyantsev
Hi Richard.

Thanks a lot for your review.
Below are my answers.

You asked why I inserted additional check to
++ b/gcc/tree-ssa-forwprop.c
@@ -373,6 +373,11 @@ combine_cond_expr_cond (gimple *stmt, enum
tree_code code, tree type,

   gcc_assert (TREE_CODE_CLASS (code) == tcc_comparison);

+  /* Do not perform combining it types are not compatible.  */
+  if (TREE_CODE (TREE_TYPE (op0)) == VECTOR_TYPE
+  && !tree_int_cst_equal (TYPE_SIZE (type), TYPE_SIZE (TREE_TYPE (op0
+return NULL_TREE;
+

again, how does this happen?

This is because without it I've got assert in fold_convert_loc
  gcc_assert (TREE_CODE (orig) == VECTOR_TYPE
 && tree_int_cst_equal (TYPE_SIZE (type), TYPE_SIZE (orig)));

since it tries to convert vector of bool to scalar bool.
Here is essential part of call-stack:

#0  internal_error (gmsgid=0x1e48397 "in %s, at %s:%d")
at ../../gcc/diagnostic.c:1259
#1  0x01743ada in fancy_abort (
file=0x1847fc3 "../../gcc/fold-const.c", line=2217,
function=0x184b9d0  "fold_convert_loc") at
../../gcc/diagnostic.c:1332
#2  0x009c8330 in fold_convert_loc (loc=0, type=0x718a9d20,
arg=0x71a7f488) at ../../gcc/fold-const.c:2216
#3  0x009f003f in fold_ternary_loc (loc=0, code=VEC_COND_EXPR,
type=0x718a9d20, op0=0x71a7f460, op1=0x718c2000,
op2=0x718c2030) at ../../gcc/fold-const.c:11453
#4  0x009f2f94 in fold_build3_stat_loc (loc=0, code=VEC_COND_EXPR,
type=0x718a9d20, op0=0x71a7f460, op1=0x718c2000,
op2=0x718c2030) at ../../gcc/fold-const.c:12394
#5  0x009d870c in fold_binary_op_with_conditional_arg (loc=0,
code=EQ_EXPR, type=0x718a9d20, op0=0x71a7f460,
op1=0x71a48780, cond=0x71a7f460, arg=0x71a48780,
cond_first_p=1) at ../../gcc/fold-const.c:6465
#6  0x009e3407 in fold_binary_loc (loc=0, code=EQ_EXPR,
type=0x718a9d20, op0=0x71a7f460, op1=0x71a48780)
at ../../gcc/fold-const.c:9211
#7  0x00ecb8fa in combine_cond_expr_cond (stmt=0x71a487d0,
code=EQ_EXPR, type=0x718a9d20, op0=0x71a7f460,
op1=0x71a48780, invariant_only=true)
at ../../gcc/tree-ssa-forwprop.c:382


Secondly, I did not catch your idea to implement GCC Vector Extension
for vector comparison with bool result since
such extension completely depends on comparison context, e.g. for your
example, result type of comparison depends on using - for
if-comparison it is scalar, but for c = (a==b) - result type is
vector. I don't think that this is reasonable for current release.

And finally about AMD performance. I checked that this transformation
works for "-march=bdver4" option and regression for 481.wrf must
disappear too.

Thanks.
Yuri.

2015-12-04 15:18 GMT+03:00 Richard Biener :
> On Mon, Nov 30, 2015 at 2:11 PM, Yuri Rumyantsev  wrote:
>> Hi All,
>>
>> Here is a patch for 481.wrf preformance regression for avx2 which is
>> sligthly modified mask store optimization. This transformation allows
>> perform unpredication for semi-hammock containing masked stores, other
>> words if we have a loop like
>> for (i=0; i>   if (c[i]) {
>> p1[i] += 1;
>> p2[i] = p3[i] +2;
>>   }
>>
>> then it will be transformed to
>>if (!mask__ifc__42.18_165 == { 0, 0, 0, 0, 0, 0, 0, 0 }) {
>>  vect__11.19_170 = MASK_LOAD (vectp_p1.20_168, 0B, mask__ifc__42.18_165);
>>  vect__12.22_172 = vect__11.19_170 + vect_cst__171;
>>  MASK_STORE (vectp_p1.23_175, 0B, mask__ifc__42.18_165, vect__12.22_172);
>>  vect__18.25_182 = MASK_LOAD (vectp_p3.26_180, 0B, mask__ifc__42.18_165);
>>  vect__19.28_184 = vect__18.25_182 + vect_cst__183;
>>  MASK_STORE (vectp_p2.29_187, 0B, mask__ifc__42.18_165, vect__19.28_184);
>>}
>> i.e. it will put all computations related to masked stores to semi-hammock.
>>
>> Bootstrapping and regression testing did not show any new failures.
>
> Can you please split out the middle-end support for vector equality compares?
>
> @@ -3448,10 +3448,17 @@ verify_gimple_comparison (tree type, tree op0, tree 
> op1)
>if (TREE_CODE (op0_type) == VECTOR_TYPE
>   || TREE_CODE (op1_type) == VECTOR_TYPE)
>  {
> -  error ("vector comparison returning a boolean");
> -  debug_generic_expr (op0_type);
> -  debug_generic_expr (op1_type);
> -  return true;
> + /* Allow vector comparison returning boolean if operand types
> +are equal and CODE is EQ/NE.  */
> + if ((code != EQ_EXPR && code != NE_EXPR)
> + || !(VECTOR_BOOLEAN_TYPE_P (op0_type)
> +  || VECTOR_INTEGER_TYPE_P (op0_type)))
> +   {
> + error ("

Re: [PATCH PR68542]

2015-12-07 Thread Yuri Rumyantsev
Richard!

Here is middle-end part of patch with changes proposed by you.

Is it OK for trunk?

Thanks.
Yuri.

ChangeLog:
2015-12-07  Yuri Rumyantsev  

PR middle-end/68542
* fold-const.c (fold_relational_const): Add handling of vector
comparison with boolean result.
* tree-cfg.c (verify_gimple_comparison): Add argument CODE, allow
comparison of vector operands with boolean result for EQ/NE only.
(verify_gimple_assign_binary): Adjust call for verify_gimple_comparison.
(verify_gimple_cond): Likewise.
* tree-ssa-forwprop.c (combine_cond_expr_cond): Do not perform
combining for non-compatible vector types.
* tree-vrp.c (register_edge_assert_for): VRP does not track ranges for
vector types.



2015-12-04 18:07 GMT+03:00 Yuri Rumyantsev :
> Hi Richard.
>
> Thanks a lot for your review.
> Below are my answers.
>
> You asked why I inserted additional check to
> ++ b/gcc/tree-ssa-forwprop.c
> @@ -373,6 +373,11 @@ combine_cond_expr_cond (gimple *stmt, enum
> tree_code code, tree type,
>
>gcc_assert (TREE_CODE_CLASS (code) == tcc_comparison);
>
> +  /* Do not perform combining it types are not compatible.  */
> +  if (TREE_CODE (TREE_TYPE (op0)) == VECTOR_TYPE
> +  && !tree_int_cst_equal (TYPE_SIZE (type), TYPE_SIZE (TREE_TYPE (op0
> +return NULL_TREE;
> +
>
> again, how does this happen?
>
> This is because without it I've got assert in fold_convert_loc
>   gcc_assert (TREE_CODE (orig) == VECTOR_TYPE
>  && tree_int_cst_equal (TYPE_SIZE (type), TYPE_SIZE (orig)));
>
> since it tries to convert vector of bool to scalar bool.
> Here is essential part of call-stack:
>
> #0  internal_error (gmsgid=0x1e48397 "in %s, at %s:%d")
> at ../../gcc/diagnostic.c:1259
> #1  0x01743ada in fancy_abort (
> file=0x1847fc3 "../../gcc/fold-const.c", line=2217,
> function=0x184b9d0  tree_node*)::__FUNCTION__> "fold_convert_loc") at
> ../../gcc/diagnostic.c:1332
> #2  0x009c8330 in fold_convert_loc (loc=0, type=0x718a9d20,
> arg=0x71a7f488) at ../../gcc/fold-const.c:2216
> #3  0x009f003f in fold_ternary_loc (loc=0, code=VEC_COND_EXPR,
> type=0x718a9d20, op0=0x71a7f460, op1=0x718c2000,
> op2=0x718c2030) at ../../gcc/fold-const.c:11453
> #4  0x009f2f94 in fold_build3_stat_loc (loc=0, code=VEC_COND_EXPR,
> type=0x718a9d20, op0=0x71a7f460, op1=0x718c2000,
> op2=0x718c2030) at ../../gcc/fold-const.c:12394
> #5  0x009d870c in fold_binary_op_with_conditional_arg (loc=0,
> code=EQ_EXPR, type=0x718a9d20, op0=0x71a7f460,
> op1=0x71a48780, cond=0x71a7f460, arg=0x71a48780,
> cond_first_p=1) at ../../gcc/fold-const.c:6465
> #6  0x009e3407 in fold_binary_loc (loc=0, code=EQ_EXPR,
> type=0x718a9d20, op0=0x71a7f460, op1=0x71a48780)
> at ../../gcc/fold-const.c:9211
> #7  0x00ecb8fa in combine_cond_expr_cond (stmt=0x71a487d0,
> code=EQ_EXPR, type=0x718a9d20, op0=0x71a7f460,
> op1=0x71a48780, invariant_only=true)
> at ../../gcc/tree-ssa-forwprop.c:382
>
>
> Secondly, I did not catch your idea to implement GCC Vector Extension
> for vector comparison with bool result since
> such extension completely depends on comparison context, e.g. for your
> example, result type of comparison depends on using - for
> if-comparison it is scalar, but for c = (a==b) - result type is
> vector. I don't think that this is reasonable for current release.
>
> And finally about AMD performance. I checked that this transformation
> works for "-march=bdver4" option and regression for 481.wrf must
> disappear too.
>
> Thanks.
> Yuri.
>
> 2015-12-04 15:18 GMT+03:00 Richard Biener :
>> On Mon, Nov 30, 2015 at 2:11 PM, Yuri Rumyantsev  wrote:
>>> Hi All,
>>>
>>> Here is a patch for 481.wrf preformance regression for avx2 which is
>>> sligthly modified mask store optimization. This transformation allows
>>> perform unpredication for semi-hammock containing masked stores, other
>>> words if we have a loop like
>>> for (i=0; i>>   if (c[i]) {
>>> p1[i] += 1;
>>> p2[i] = p3[i] +2;
>>>   }
>>>
>>> then it will be transformed to
>>>if (!mask__ifc__42.18_165 == { 0, 0, 0, 0, 0, 0, 0, 0 }) {
>>>  vect__11.19_170 = MASK_LOAD (vectp_p1.20_168, 0B, 
>>> mask__ifc__42.18_165);
>>>  vect__12.22_172 = vect__11.19_170 + vect_cst__171;
>>>  MASK_STORE (vectp_p1.23_175, 0B, mask__ifc__42.18_165, 
>>> vect__12.22_172);
>>>  vect__18.25_182 = MASK_LOAD (vectp_p3.26_180, 0B, 
>>> mask__ifc__

Re: [PATCH PR68542]

2015-12-08 Thread Yuri Rumyantsev
Hi Richard,

Here is the second part of patch.

Is it OK for trunk?

I assume that it should fix huge degradation on 481.wrf for -march=bdver4 also.

ChangeLog:
2015-12-08  Yuri Rumyantsev  

PR middle-end/68542
* config/i386/i386.c (ix86_expand_branch): Implement integral vector
comparison with boolean result.
* config/i386/sse.md (define_expand "cbranch4): Add define-expand
for vector comparion with eq/ne only.
* tree-vect-loop.c (is_valid_sink): New function.
(optimize_mask_stores): Likewise.
* tree-vect-stmts.c (vectorizable_mask_load_store): Initialize
has_mask_store field of vect_info.
* tree-vectorizer.c (vectorize_loops): Invoke optimaze_mask_stores for
vectorized loops having masked stores.
* tree-vectorizer.h (loop_vec_info): Add new has_mask_store field and
correspondent macros.
(optimize_mask_stores): Add prototype.

gcc/testsuite/ChangeLog:
* gcc.target/i386/avx2-vect-mask-store-move1.c: New test.

2015-12-07 13:57 GMT+03:00 Yuri Rumyantsev :
> Richard!
>
> Here is middle-end part of patch with changes proposed by you.
>
> Is it OK for trunk?
>
> Thanks.
> Yuri.
>
> ChangeLog:
> 2015-12-07  Yuri Rumyantsev  
>
> PR middle-end/68542
> * fold-const.c (fold_relational_const): Add handling of vector
> comparison with boolean result.
> * tree-cfg.c (verify_gimple_comparison): Add argument CODE, allow
> comparison of vector operands with boolean result for EQ/NE only.
> (verify_gimple_assign_binary): Adjust call for verify_gimple_comparison.
> (verify_gimple_cond): Likewise.
> * tree-ssa-forwprop.c (combine_cond_expr_cond): Do not perform
> combining for non-compatible vector types.
> * tree-vrp.c (register_edge_assert_for): VRP does not track ranges for
> vector types.
>
>
>
> 2015-12-04 18:07 GMT+03:00 Yuri Rumyantsev :
>> Hi Richard.
>>
>> Thanks a lot for your review.
>> Below are my answers.
>>
>> You asked why I inserted additional check to
>> ++ b/gcc/tree-ssa-forwprop.c
>> @@ -373,6 +373,11 @@ combine_cond_expr_cond (gimple *stmt, enum
>> tree_code code, tree type,
>>
>>gcc_assert (TREE_CODE_CLASS (code) == tcc_comparison);
>>
>> +  /* Do not perform combining it types are not compatible.  */
>> +  if (TREE_CODE (TREE_TYPE (op0)) == VECTOR_TYPE
>> +  && !tree_int_cst_equal (TYPE_SIZE (type), TYPE_SIZE (TREE_TYPE 
>> (op0
>> +return NULL_TREE;
>> +
>>
>> again, how does this happen?
>>
>> This is because without it I've got assert in fold_convert_loc
>>   gcc_assert (TREE_CODE (orig) == VECTOR_TYPE
>>  && tree_int_cst_equal (TYPE_SIZE (type), TYPE_SIZE (orig)));
>>
>> since it tries to convert vector of bool to scalar bool.
>> Here is essential part of call-stack:
>>
>> #0  internal_error (gmsgid=0x1e48397 "in %s, at %s:%d")
>> at ../../gcc/diagnostic.c:1259
>> #1  0x01743ada in fancy_abort (
>> file=0x1847fc3 "../../gcc/fold-const.c", line=2217,
>> function=0x184b9d0 > tree_node*)::__FUNCTION__> "fold_convert_loc") at
>> ../../gcc/diagnostic.c:1332
>> #2  0x009c8330 in fold_convert_loc (loc=0, type=0x718a9d20,
>> arg=0x71a7f488) at ../../gcc/fold-const.c:2216
>> #3  0x009f003f in fold_ternary_loc (loc=0, code=VEC_COND_EXPR,
>> type=0x718a9d20, op0=0x71a7f460, op1=0x718c2000,
>> op2=0x718c2030) at ../../gcc/fold-const.c:11453
>> #4  0x009f2f94 in fold_build3_stat_loc (loc=0, code=VEC_COND_EXPR,
>> type=0x718a9d20, op0=0x71a7f460, op1=0x718c2000,
>> op2=0x718c2030) at ../../gcc/fold-const.c:12394
>> #5  0x009d870c in fold_binary_op_with_conditional_arg (loc=0,
>> code=EQ_EXPR, type=0x718a9d20, op0=0x71a7f460,
>> op1=0x71a48780, cond=0x71a7f460, arg=0x71a48780,
>> cond_first_p=1) at ../../gcc/fold-const.c:6465
>> #6  0x009e3407 in fold_binary_loc (loc=0, code=EQ_EXPR,
>> type=0x718a9d20, op0=0x71a7f460, op1=0x71a48780)
>> at ../../gcc/fold-const.c:9211
>> #7  0x00ecb8fa in combine_cond_expr_cond (stmt=0x71a487d0,
>> code=EQ_EXPR, type=0x718a9d20, op0=0x71a7f460,
>> op1=0x71a48780, invariant_only=true)
>> at ../../gcc/tree-ssa-forwprop.c:382
>>
>>
>> Secondly, I did not catch your idea to implement GCC Vector Extension
>> for vector comparison with bool result since
>> such extension completely depends on comparison context, e.g. for your
>> example, result type of comparison depends on using - for
>> if-comparison it is scalar, but for c = (a==b) - result type is
>>

[PATCH PR62011]

2014-08-14 Thread Yuri Rumyantsev
Hi All,

Here is a fix for PR 62011 - remove false dependency for unary
bit-manipulation instructions for latest BigCore chips (Sandybridge
and Haswell) by outputting in assembly file zeroing destination
register before bmi instruction. I checked that performance restored
for popcnt, lzcnt and tzcnt instructions.

Bootstrap and regression testing did not show any new failures.

Is it OK for trunk?

gcc/ChangeLog
2014-08-14  Yuri Rumyantsev  

PR target/62011
* config/i386/i386-protos.h (ix86_avoid_false_dep_for_bm): New function
 prototype.
* config/i386/i386.c (ix86_avoid_false_dep_for_bm): New function.
* config/i386/i386.h (TARGET_AVOID_FALSE_DEP_FOR_BM) New macros.
* config/i386/i386.md (ctz2, clz2_lzcnt, popcount2,
 *popcount2_cmp, *popcountsi2_cmp_zext): Output zeroing
 destination register for unary bit-manipulation instructions
 if required.
* config/i386/x86-tune.def (X86_TUNE_AVOID_FALSE_DEP_FOR_BM): New.


patch
Description: Binary data


Re: [PATCH PR62011]

2014-08-14 Thread Yuri Rumyantsev
For example, for the first loop for attached test-case we do not
prepend xor to popcnt because of using destination register:

.L23:
leal 1(%rdx), %ecx
popcntq (%rbx,%rax,8), %rax
leal 2(%rdx), %r8d
popcntq (%rbx,%rcx,8), %rcx
addq %rax, %rcx
leal 3(%rdx), %esi
xorq %rax, %rax
popcntq (%rbx,%r8,8), %rax
addq %rax, %rcx
xorq %rax, %rax
popcntq (%rbx,%rsi,8), %rax
addq %rax, %rcx
leal 4(%rdx), %eax
addq %rcx, %r14
movq %rax, %rdx
cmpq %rax, %r12
ja .L23

2014-08-14 18:06 GMT+04:00 Ilya Enkovich :
> 2014-08-14 18:00 GMT+04:00 Alexander Monakov :
>>
>>
>> On Thu, 14 Aug 2014, Yuri Rumyantsev wrote:
>>
>>> Hi All,
>>>
>>> Here is a fix for PR 62011 - remove false dependency for unary
>>> bit-manipulation instructions for latest BigCore chips (Sandybridge
>>> and Haswell) by outputting in assembly file zeroing destination
>>> register before bmi instruction. I checked that performance restored
>>> for popcnt, lzcnt and tzcnt instructions.
>>
>> I am not an x86 reviewer, but one thing looks a bit superfluous to me:
>>
>>> +/* Retirn true if we need to insert before bit-manipulation instruction
>> note typo^
>>
>>> +   zeroing of its destination register.  */
>>> +bool
>>> +ix86_avoid_false_dep_for_bm (rtx insn, rtx operands[])
>>> +{
>>> +  unsigned int regno0;
>>> +  df_ref use;
>>> +  if (!TARGET_AVOID_FALSE_DEP_FOR_BM || optimize_function_for_size_p 
>>> (cfun))
>>> +return false;
>>> +  regno0 = true_regnum (operands[0]);
>>> +  /* Check if insn does not use REGNO0.  */
>>> +  FOR_EACH_INSN_USE (use, insn)
>>> +if (regno0 == DF_REF_REGNO (use))
>>> +  return false;
>>> +  return true;
>>> +}
>>
>> The loop is to prevent adding the xor when the dest operand is also the 
>> source
>> operand.  Looks like a simpler "reg_or_subregno (operands[0]) ==
>> reg_or_subregno (operands[1])" could be used here, as long as the assumption
>> that this is called only for two-operand instruction holds?
>
> This wouldn't cover memory operand case.
>
> Ilya
>
>>
>> Alexander


Re: [PATCH PR62011]

2014-08-14 Thread Yuri Rumyantsev
It does not help Silvermont, i.e. only Haswell and SandyBridge are affected.
I don't use splitter since (1) it deletes zeroing of dest reg; (2)
scheduler can hoist them up . I will try r16/r32 variants and tell you
later.

2014-08-14 19:18 GMT+04:00 H.J. Lu :
> On Thu, Aug 14, 2014 at 4:50 AM, Yuri Rumyantsev  wrote:
>> Hi All,
>>
>> Here is a fix for PR 62011 - remove false dependency for unary
>> bit-manipulation instructions for latest BigCore chips (Sandybridge
>> and Haswell) by outputting in assembly file zeroing destination
>> register before bmi instruction. I checked that performance restored
>> for popcnt, lzcnt and tzcnt instructions.
>>
>> Bootstrap and regression testing did not show any new failures.
>>
>> Is it OK for trunk?
>>
>> gcc/ChangeLog
>> 2014-08-14  Yuri Rumyantsev  
>>
>> PR target/62011
>> * config/i386/i386-protos.h (ix86_avoid_false_dep_for_bm): New function
>>  prototype.
>> * config/i386/i386.c (ix86_avoid_false_dep_for_bm): New function.
>> * config/i386/i386.h (TARGET_AVOID_FALSE_DEP_FOR_BM) New macros.
>> * config/i386/i386.md (ctz2, clz2_lzcnt, popcount2,
>>  *popcount2_cmp, *popcountsi2_cmp_zext): Output zeroing
>>  destination register for unary bit-manipulation instructions
>>  if required.
>
> Why don't you use splitter to to generate XOR?
>
>> * config/i386/x86-tune.def (X86_TUNE_AVOID_FALSE_DEP_FOR_BM): New.
>
> Is this needed for r16 and r32?  The original report says that only
> r64 is affected:
>
> http://stackoverflow.com/questions/25078285/replacing-a-32-bit-loop-count-variable-with-64-bit-introduces-crazy-performance
>
> Have you tried this on Silvermont?  Does it help Silvermont?
>
> --
> H.J.


Re: [PATCH PR62011]

2014-08-15 Thread Yuri Rumyantsev
I checked that zeroing destination operand for unary bit-manipulation
instruction is helpful for 64- and 32-bit mode only. So the patch was
changed.

Is it OK for trunk?

gcc/ChangeLog
2014-08-15  Yuri Rumyantsev  

PR target/62011
* config/i386/i386-protos.h (ix86_avoid_false_dep_for_bm): New function
 prototype.
* config/i386/i386.c (ix86_avoid_false_dep_for_bm): New function.
* config/i386/i386.h (TARGET_AVOID_FALSE_DEP_FOR_BM) New macros.
* config/i386/i386.md (ctz2, clz2_lzcnt, popcount2,
 *popcount2_cmp, *popcountsi2_cmp_zext): Output zeroing
 destination register for unary bit-manipulation instructions
 if required.
* config/i386/x86-tune.def (X86_TUNE_AVOID_FALSE_DEP_FOR_BM): New.

2014-08-14 19:39 GMT+04:00 Yuri Rumyantsev :
> It does not help Silvermont, i.e. only Haswell and SandyBridge are affected.
> I don't use splitter since (1) it deletes zeroing of dest reg; (2)
> scheduler can hoist them up . I will try r16/r32 variants and tell you
> later.
>
> 2014-08-14 19:18 GMT+04:00 H.J. Lu :
>> On Thu, Aug 14, 2014 at 4:50 AM, Yuri Rumyantsev  wrote:
>>> Hi All,
>>>
>>> Here is a fix for PR 62011 - remove false dependency for unary
>>> bit-manipulation instructions for latest BigCore chips (Sandybridge
>>> and Haswell) by outputting in assembly file zeroing destination
>>> register before bmi instruction. I checked that performance restored
>>> for popcnt, lzcnt and tzcnt instructions.
>>>
>>> Bootstrap and regression testing did not show any new failures.
>>>
>>> Is it OK for trunk?
>>>
>>> gcc/ChangeLog
>>> 2014-08-14  Yuri Rumyantsev  
>>>
>>> PR target/62011
>>> * config/i386/i386-protos.h (ix86_avoid_false_dep_for_bm): New function
>>>  prototype.
>>> * config/i386/i386.c (ix86_avoid_false_dep_for_bm): New function.
>>> * config/i386/i386.h (TARGET_AVOID_FALSE_DEP_FOR_BM) New macros.
>>> * config/i386/i386.md (ctz2, clz2_lzcnt, popcount2,
>>>  *popcount2_cmp, *popcountsi2_cmp_zext): Output zeroing
>>>  destination register for unary bit-manipulation instructions
>>>  if required.
>>
>> Why don't you use splitter to to generate XOR?
>>
>>> * config/i386/x86-tune.def (X86_TUNE_AVOID_FALSE_DEP_FOR_BM): New.
>>
>> Is this needed for r16 and r32?  The original report says that only
>> r64 is affected:
>>
>> http://stackoverflow.com/questions/25078285/replacing-a-32-bit-loop-count-variable-with-64-bit-introduces-crazy-performance
>>
>> Have you tried this on Silvermont?  Does it help Silvermont?
>>
>> --
>> H.J.


patch1
Description: Binary data


Re: [PATCH] Extended if-conversion for loops marked with pragma omp simd.

2014-08-15 Thread Yuri Rumyantsev
Richard!
Here is updated patch with the following changes:

1. Any restrictions on phi-function were eliminated for extended conversion.
2.  Put predicate for critical edges to 'aux' field of edge, i.e.
negate_predicate was deleted.
3. Deleted splitting of critical edges, i.e. both outgoing edges can
be critical.
4. Use notion of cd-equivalence to set-up predicate for join basic
blocks to simplify it.
5. I decided to not design pre-pass since it will lead generating
chain of cond expressions for phi-node if conversion, whereas for phi
of kind
  x = PHI <1(2), 1(3), 2(4)>
only one cond expression is required and this is considered as simple
optimization for arbitrary phi-function. More precise,
if phi-function have only two different arguments and one of them has
single occurrence, if- conversion is performed as if phi have only 2
arguments.
For arbitrary phi function a chain of cond expressions is produced.

Updated patch is attached.

Any comments will be appreciated.

2014-08-15  Yuri Rumyantsev  

* tree-if-conv.c (cgraph.h): Add include file to detect function clone.
(flag_force_vectorize): New variable.
(edge_predicate): New function.
(set_edge_predicate): New function.
(add_stmt_to_bb_predicate_gimplified_stmts): New function.
(init_bb_predicate): Add initialization of negate_predicate field.
(reset_bb_predicate): Reset negate_predicate to NULL_TREE.
(convert_name_to_cmp): New function.
(get_type_for_cond): New function.
(convert_bool_predicate): New function.
(predicate_disjunction): New function.
(predicate_conjunction): New function.
(add_to_predicate_list): Add convert_bool argument.
Use predicate of cd-equivalent block if convert_bool is true and
such bb exists; save it in static variable for further possible use.
Add call of predicate_disjunction if convert_bool argument is true.
(add_to_dst_predicate_list): Add convert_bool argument.
Add early function exit if edge target block is always executed.
Add call of predicate_conjunction if convert_bool argument is true.
Pass convert_bool argument for add_to_predicate_list.
Set-up predicate for crritical edge if convert_bool is true.
(equal_phi_args): New function.
(phi_has_two_different_args): New function.
(if_convertible_phi_p): Accept phi nodes with more than two args
if flag_force_vectorize wa set-up.
(ifcvt_can_use_mask_load_store): Add test on flag_force_vectorize.
(if_convertible_stmt_p): Allow calls of function clones if
flag_force_vectorize was set-up.
(all_edges_are_critical): New function.
(if_convertible_bb_p): Allow bb has more than two predecessors if
flag_force_vectorize was set-up. Use call of all_edges_are_critical
to reject block if-conversion with imcoming critical edges only if
flag_force_vectorize was not set-up.
(walk_cond_tree): New function.
(vect_bool_pattern_is_applicable): New function.
(predicate_bbs): Add convert_bool argument which is used to transform
comparison expressions of boolean type into conditional expressions
with integral operands. If convert_bool argument was set-up and
vect bool pattern can be appied perform the following transformation:
(bool) x != 0  --> y = (int) x; x != 0;
Add check that if fold_build2 produces bool conversion if convert_bool
was set-up, recompute predicate using build2_loc. Additional argument
'convert_bool" is passed to add_to_dst_predicate_list and
add_to_predicate_list.
(if_convertible_loop_p_1): Recompute POST_DOMINATOR tree if
flag_force_vectorize was set-up to calculate cd equivalent bb's.
Call predicate_bbs with additional argument equal to false.
(find_phi_replacement_condition): Extend function interface:
it returns NULL if given phi node must be handled by means of
extended phi node predication. If number of predecessors of phi-block
is equal 2 and atleast one incoming edge is not critical original
algorithm is used.
(is_cond_scalar_reduction): Add 'extended' argument which signals that
phi arguments must be evaluated through phi_has_two_different_args.
(predicate_scalar_phi): Add invoсation of convert_name_to_cmp if cond
is SSA_NAME. Add 'false' argument to call of is_cond_scalar_reduction.
(get_predicate_for_edge): New function.
(find_insertion_point): New function.
(predicate_arbitrary_phi): New function.
(predicate_extended_scalar_phi): New function.
(predicate_all_scalar_phis): Add code to set-up gimple statement
iterator for predication of extended scalar phi's for insertion.
(insert_gimplified_predicates): Add test for non-predicated basic
blocks that there are no gimplified statements to insert. Insert
predicates at the block begining for extended if-conversion.
(predicate_mem_writes): Invoke convert_name_to_cmp for extended
predication to build mask.
(combine_blocks): Pass flag_force_vectorize to predicate_bbs.
(tree_if_conversion): Initialize flag_force_vectorize from current
loop or outer loop (to support pragma omp declare).Do loop versioning
for innermost loop marked with pragma omp simd.

2014-08-01 1

Re: [PATCH PR62011]

2014-08-15 Thread Yuri Rumyantsev
Jakub,

Is it important to have correct value for length attribute for Big Cores?
As I new this attribute is used for code layout alignment.

2014-08-15 15:54 GMT+04:00 Jakub Jelinek :
> On Fri, Aug 15, 2014 at 03:45:33PM +0400, Yuri Rumyantsev wrote:
>> gcc/ChangeLog
>> 2014-08-15  Yuri Rumyantsev  
>>
>> PR target/62011
>> * config/i386/i386-protos.h (ix86_avoid_false_dep_for_bm): New function
>>  prototype.
>> * config/i386/i386.c (ix86_avoid_false_dep_for_bm): New function.
>> * config/i386/i386.h (TARGET_AVOID_FALSE_DEP_FOR_BM) New macros.
>> * config/i386/i386.md (ctz2, clz2_lzcnt, popcount2,
>>  *popcount2_cmp, *popcountsi2_cmp_zext): Output zeroing
>>  destination register for unary bit-manipulation instructions
>>  if required.
>> * config/i386/x86-tune.def (X86_TUNE_AVOID_FALSE_DEP_FOR_BM): New.
>
> --- config/i386/i386.md (revision 213842)
> +++ config/i386/i386.md (working copy)
> @@ -12111,7 +12111,13 @@
>""
>  {
>if (TARGET_BMI)
> -return "tzcnt{}\t{%1, %0|%0, %1}";
> +{
> +  if (ix86_avoid_false_dep_for_bm (insn, operands))
> +   return "xor{}\t%0, %0\n\t"
> +   "tzcnt{}\t{%1, %0|%0, %1}";
> +  else
> +return "tzcnt{}\t{%1, %0|%0, %1}";
> +}
>else if (optimize_function_for_size_p (cfun))
>  ;
>else if (TARGET_GENERIC)
>
> etc., this will make lenght attribute incorrect though.
>
> Jakub


Re: [PATCH PR62011]

2014-08-18 Thread Yuri Rumyantsev
Uros,

Your patch is not complete since yo missed two another bit
manipulation instruction - lzcnt and tzcnt which have the same
deficiency. Moreover, I don't think that we need to do it for INTEL
target since it is related to SILVERMONT target rather than BIG COREs.
Should I modify your patch or you do it yourself?

Thanks.
Yuri.

2014-08-15 18:06 GMT+04:00 Uros Bizjak :
> On Fri, Aug 15, 2014 at 2:26 PM, Yuri Rumyantsev  wrote:
>
>> Is it important to have correct value for length attribute for Big Cores?
>> As I new this attribute is used for code layout alignment.
>>
>> 2014-08-15 15:54 GMT+04:00 Jakub Jelinek :
>>> On Fri, Aug 15, 2014 at 03:45:33PM +0400, Yuri Rumyantsev wrote:
>>>> gcc/ChangeLog
>>>> 2014-08-15  Yuri Rumyantsev  
>>>>
>>>> PR target/62011
>>>> * config/i386/i386-protos.h (ix86_avoid_false_dep_for_bm): New function
>>>>  prototype.
>>>> * config/i386/i386.c (ix86_avoid_false_dep_for_bm): New function.
>>>> * config/i386/i386.h (TARGET_AVOID_FALSE_DEP_FOR_BM) New macros.
>>>> * config/i386/i386.md (ctz2, clz2_lzcnt, popcount2,
>>>>  *popcount2_cmp, *popcountsi2_cmp_zext): Output zeroing
>>>>  destination register for unary bit-manipulation instructions
>>>>  if required.
>>>> * config/i386/x86-tune.def (X86_TUNE_AVOID_FALSE_DEP_FOR_BM): New.
>
> I am testing a different approach, outlined in the attached patch. In
> the patch, insn is split after reload to separate insns.
>
> As far as popcnt is concerned, we don't need _cmp pattern, the generic
> code is clever enough to substuitute "if (popcnt (a))" with "if (a)".
>
> Uros.


Re: [PATCH] Extended if-conversion for loops marked with pragma omp simd.

2014-09-08 Thread Yuri Rumyantsev
Richard,

Did you have a chance to look at this?

Thanks.

2014-08-15 16:02 GMT+04:00 Yuri Rumyantsev :
> Richard!
> Here is updated patch with the following changes:
>
> 1. Any restrictions on phi-function were eliminated for extended conversion.
> 2.  Put predicate for critical edges to 'aux' field of edge, i.e.
> negate_predicate was deleted.
> 3. Deleted splitting of critical edges, i.e. both outgoing edges can
> be critical.
> 4. Use notion of cd-equivalence to set-up predicate for join basic
> blocks to simplify it.
> 5. I decided to not design pre-pass since it will lead generating
> chain of cond expressions for phi-node if conversion, whereas for phi
> of kind
>   x = PHI <1(2), 1(3), 2(4)>
> only one cond expression is required and this is considered as simple
> optimization for arbitrary phi-function. More precise,
> if phi-function have only two different arguments and one of them has
> single occurrence, if- conversion is performed as if phi have only 2
> arguments.
> For arbitrary phi function a chain of cond expressions is produced.
>
> Updated patch is attached.
>
> Any comments will be appreciated.
>
> 2014-08-15  Yuri Rumyantsev  
>
> * tree-if-conv.c (cgraph.h): Add include file to detect function clone.
> (flag_force_vectorize): New variable.
> (edge_predicate): New function.
> (set_edge_predicate): New function.
> (add_stmt_to_bb_predicate_gimplified_stmts): New function.
> (init_bb_predicate): Add initialization of negate_predicate field.
> (reset_bb_predicate): Reset negate_predicate to NULL_TREE.
> (convert_name_to_cmp): New function.
> (get_type_for_cond): New function.
> (convert_bool_predicate): New function.
> (predicate_disjunction): New function.
> (predicate_conjunction): New function.
> (add_to_predicate_list): Add convert_bool argument.
> Use predicate of cd-equivalent block if convert_bool is true and
> such bb exists; save it in static variable for further possible use.
> Add call of predicate_disjunction if convert_bool argument is true.
> (add_to_dst_predicate_list): Add convert_bool argument.
> Add early function exit if edge target block is always executed.
> Add call of predicate_conjunction if convert_bool argument is true.
> Pass convert_bool argument for add_to_predicate_list.
> Set-up predicate for crritical edge if convert_bool is true.
> (equal_phi_args): New function.
> (phi_has_two_different_args): New function.
> (if_convertible_phi_p): Accept phi nodes with more than two args
> if flag_force_vectorize wa set-up.
> (ifcvt_can_use_mask_load_store): Add test on flag_force_vectorize.
> (if_convertible_stmt_p): Allow calls of function clones if
> flag_force_vectorize was set-up.
> (all_edges_are_critical): New function.
> (if_convertible_bb_p): Allow bb has more than two predecessors if
> flag_force_vectorize was set-up. Use call of all_edges_are_critical
> to reject block if-conversion with imcoming critical edges only if
> flag_force_vectorize was not set-up.
> (walk_cond_tree): New function.
> (vect_bool_pattern_is_applicable): New function.
> (predicate_bbs): Add convert_bool argument which is used to transform
> comparison expressions of boolean type into conditional expressions
> with integral operands. If convert_bool argument was set-up and
> vect bool pattern can be appied perform the following transformation:
> (bool) x != 0  --> y = (int) x; x != 0;
> Add check that if fold_build2 produces bool conversion if convert_bool
> was set-up, recompute predicate using build2_loc. Additional argument
> 'convert_bool" is passed to add_to_dst_predicate_list and
> add_to_predicate_list.
> (if_convertible_loop_p_1): Recompute POST_DOMINATOR tree if
> flag_force_vectorize was set-up to calculate cd equivalent bb's.
> Call predicate_bbs with additional argument equal to false.
> (find_phi_replacement_condition): Extend function interface:
> it returns NULL if given phi node must be handled by means of
> extended phi node predication. If number of predecessors of phi-block
> is equal 2 and atleast one incoming edge is not critical original
> algorithm is used.
> (is_cond_scalar_reduction): Add 'extended' argument which signals that
> phi arguments must be evaluated through phi_has_two_different_args.
> (predicate_scalar_phi): Add invoсation of convert_name_to_cmp if cond
> is SSA_NAME. Add 'false' argument to call of is_cond_scalar_reduction.
> (get_predicate_for_edge): New function.
> (find_insertion_point): New function.
> (predicate_arbitrary_phi): New function.
> (predicate_extended_scalar_phi): New function.
> (predicate_all_scalar_phis): Add code to set-up gimple statement
> iterator for predication of extended scalar phi's for in

[x86,PATCH] Additional fix for 57756.

2013-11-19 Thread Yuri Rumyantsev
Hi All,

We found out that compiler configured with '-fpmath=sse' option does
not generate scalar floating-point instructions present in the SSE
instruction set for generic32 that leads to performance degradation
for Fortran benchmarks using library functions in 32-bit mode.

This simple fix was designed to cure this issue - definition of
TARGET_FPMATH_DEFAULT_P macros was missed in i386/ssemath.h. Also one
missed fix was done.

Bootstrapping and regression testing were successful.

Is it OK for trunk

2013-11-20  Yuri Rumyantsev  

* config/i386/i386.c (ix86_option_override_internal): Add missed
argument prefix for 'ix86_fpmath'.
* config/i386/ssemath.h: Add missed definition of
TARGET_FPMATH_DEFAULT_P macros.


57756.patch
Description: Binary data


Re: [x86,PATCH] Additional fix for 57756.

2013-11-19 Thread Yuri Rumyantsev
Resend modified ChangeLog:

2013-11-20  Yuri Rumyantsev  
PR target/57756
* config/i386/i386.c (ix86_option_override_internal): Add missed
argument prefix for 'ix86_fpmath'.
* config/i386/ssemath.h: Add missed definition of
TARGET_FPMATH_DEFAULT_P macros.

2013/11/19 H.J. Lu :
> On Tue, Nov 19, 2013 at 5:31 AM, Yuri Rumyantsev  wrote:
>> Hi All,
>>
>> We found out that compiler configured with '-fpmath=sse' option does
>> not generate scalar floating-point instructions present in the SSE
>> instruction set for generic32 that leads to performance degradation
>> for Fortran benchmarks using library functions in 32-bit mode.
>>
>> This simple fix was designed to cure this issue - definition of
>> TARGET_FPMATH_DEFAULT_P macros was missed in i386/ssemath.h. Also one
>> missed fix was done.
>>
>> Bootstrapping and regression testing were successful.
>>
>> Is it OK for trunk
>>
>> 2013-11-20  Yuri Rumyantsev  
>>
>> * config/i386/i386.c (ix86_option_override_internal): Add missed
>> argument prefix for 'ix86_fpmath'.
>> * config/i386/ssemath.h: Add missed definition of
>> TARGET_FPMATH_DEFAULT_P macros.
>
> Please add "PR target/57756" in ChangeLog entry.
>
> Thanks.
>
> --
> H.J.


[x86 PATCH] Additional changes Silvermont enabling.

2013-11-22 Thread Yuri Rumyantsev
Hi,

This is addendum to H.J.'s patch for SIlvermont.

Testing is in progress.

Is it OK for the trunk after testing completion?

2013-11-22  Yuri Rumyantsev  

libgcc:

* config/i386/cpuinfo.c (get_intel_cpu): Add Silvermont cases.

gcc:

* config/i386/i386.c(processor_alias_table): Enable PTA_AES,
PTA_PCLMUL and PTA_RDRND for Silvermont.
* config/i386/driver-i386.c (host_detect_local_cpu): Set up cpu
for Silvermont.

* doc/invoke.texi: Mention PTA_AES, PTA_PCLMUL and PTA_RDRND
for Silvermont.


SLM-fix1.patch
Description: Binary data


Re: [PATCH,1/2] Extended if-conversion for loops marked with pragma omp simd.

2014-10-13 Thread Yuri Rumyantsev
Richard,

Here is updated patch (part1) for extended if conversion.

Second part of patch will be sent later.

Changelog.

2014-10-13  Yuri Rumyantsev  

* tree-if-conv.c (cgraph.h): Add include file to detect function clone.
(flag_force_vectorize): New variable.
(edge_predicate): New function.
(set_edge_predicate): New function.
(add_to_predicate_list): Check unconditionally that bb is always
executed to early exit. Use predicate of cd-equivalent block
for join blocks if it exists.
(add_to_dst_predicate_list): Invoke add_to_predicate_list if
destination block of edge is not always executed. Set-up predicate
for critical edge.
(if_convertible_phi_p): Accept phi nodes with more than two args
if FLAG_FORCE_VECTORIZE was set-up.
(ifcvt_can_use_mask_load_store): Use FLAG_FORCE_VECTORIZE.
(if_convertible_stmt_p): Fix up pre-function comments.
(all_edges_are_critical): New function.
(if_convertible_bb_p): Allow bb has more than two predecessors if
FLAG_FORCE_VECTORIZE was set-up. Use call of all_edges_are_critical
to reject block if-conversion with incoming critical edges only if
FLAG_FORCE_VECTORIZE was not set-up.
(predicate_bbs): Skip loop exit block also. Add check that if
fold_build2 produces bool conversion, recompute predicate using
build2_loc. Add zeroing of edge 'aux' field under FLAG_FORCE_VECTORIZE.
(if_convertible_loop_p_1): Recompute POST_DOMINATOR tree if
FLAG_FORCE_VECTORIZE was set-up to calculate cd equivalent bb's.
(find_phi_replacement_condition): Extend function interface:
it returns NULL if given phi node must be handled by means of
extended phi node predication. If number of predecessors of phi-block
is equal 2 and atleast one incoming edge is not critical original
algorithm is used.
(get_predicate_for_edge): New function.
(find_insertion_point): New function.
(predicate_arbitrary_scalar_phi): New function.
(predicate_all_scalar_phis): Introduce new variable BEFORE.
Invoke find_insertion_point to initialize gsi and
predicate_arbitrary_scalar_phi if TRUE_BB is NULL - it signals
that extended predication must be applied).
(insert_gimplified_predicates): Add test for non-predicated basic
blocks that there are no gimplified statements to insert. Insert
predicates at the block begining for extended if-conversion.
(tree_if_conversion): Initialize flag_force_vectorize from current
loop or outer loop (to support pragma omp declare).Do loop versioning
for innermost loop marked with pragma omp simd and
FLAG_TREE_LOOP_IF_CONVERT was not sett-up. Nullify 'aux' field of edges
for blocks with two successors.




2014-09-22 12:28 GMT+04:00 Yuri Rumyantsev :
> Richard,
>
> here is reduced patch (part.1) which was reduced almost twice.
> Let's me also answer on your comments.
>
> 1. I really use edge field 'aux' to keep predicate for critical edges.
> My previous code was not correct and now it looks like:
>
>   if (EDGE_COUNT (b->succs) == 1 || EDGE_COUNT (e->dest->preds) == 1)
> /* Edge E is not critical,  use predicate of edge source bb. */
> c = bb_predicate (b);
>   else
> /* Edge E is critical and its aux field contains predicate.  */
> c = edge_predicate (e);
>
> 2. I completely delete all code related to creation of conditional
> expressions and completely rely on bool pattern recognition in
> vectorizer. But we need to delete all dead predicate computations
> which are not used since they prevent vectorization. I will add this
> local-dce function in next patch.
> 3. I also did not include in this patch recognition of general
> phi-nodes with two arguments only for which conversion of conditional
> scalar reduction can be applied also.
> Note that all these changes are applied for loop marked with pragma
> omp simd only.
>
> 2014-09-22  Yuri Rumyantsev  
>
> * tree-if-conv.c (cgraph.h): Add include file to detect function clone.
> (flag_force_vectorize): New variable.
> (edge_predicate): New function.
> (set_edge_predicate): New function.
> (convert_name_to_cmp): New function.
> (add_to_predicate_list): Check unconditionally that bb is always
> executed to early exit. Use predicate of cd-equivalent block
> for join blocks if it exists.
> (add_to_dst_predicate_list): Invoke add_to_predicate_list if
> destination block of edge is not always executed. Set-up predicate
> for critical edge.
> (if_convertible_phi_p): Accept phi nodes with more than two args
> if FLAG_FORCE_VECTORIZE was set-up.
> (ifcvt_can_use_mask_load_store): Use FLAG_FORCE_VECTORIZE.
> (if_convertible_stmt_p): Fix up pre-function comments.
> (all_edges_are_critical): New function.
> (if_convertible_bb_p): Allow bb has more than two predecessors if
> FLAG_FORCE_VECTORIZE was set-up. Use call of all_edges_are_critical
> to reject block if-conversion with incoming critical edges only if
> FLAG_FORCE_VECTORIZE was 

[PATCH, 2/2] Extended if-conversion for loops marked with pragma omp simd.

2014-10-13 Thread Yuri Rumyantsev
Richard,

Here is second part of patch which includes
1. One particular phi node recognition  - if phi function has more
than 2 arguments but it has only two really different arguments and
one argument has the only occurrence. This is important for
conditional scalar reduction conversion, e.g. for such test-case;
 if (a[i] != 0 && b[i] != 0)  n++;
2. New sub-pass which repairs bool pattern candidates with multiple
uses - such situation occurres if the same predicate (not simple
compare) is used for phi node conversion and load/store mask. If for
some var participating in tree traversal its def stmt has multiple
uses we create copy of this definition with unique left hand side and
change one use of original vat to newly created one. We repeat this
process until all multiple uses will be deleted.
3. Another sub-pass which  delete redundant predicate computations
which are dead locally, i.e. local dead code elimination. Note that
such dead code can prevent loop vectorization.

Changelog:

2014-10-13  Yuri Rumyantsev  

* tree-if-conv.c (cgraph.h): Add include file to issue error message.
(phi_has_two_different_args): New function.
(is_cond_scalar_reduction): Add argument EXTENDED to choose access
to phi arguments. Invoke phi_has_two_different_args to get phi
arguments iff EXTENDED is true. Change check stmt-block is predecessor
of phi-block since phi may haave more than two arguments.
(convert_scalar_cond_reduction): Add argument BEFORE to insert
statement before/after gsi point.
(predicate_scalar_phi): Add argument false to call of
is_cond_scalar_reduction. Add argument true to call of
convert_scalar_cond_reduction.
(predicate_arbitrary_scalar_phi): Change result of function to tree
representing rhs of new phi replacement stmt.
(predicate_extended_scalar_phi): New function.
(predicate_all_scalar_phis): Invoke predicate_extended_scalar_phi
instead of predicate_arbitrary_scalar_phi.
(ifcvt_split_def_stmt): New function.
(ifcvt_walk_pattern_tree): New function.
(stmt_is_root_of_bool_pattern): New function.
(ifcvt_repair_bool_pattern): New function.
(ifcvt_local_dce): New function.
(tree_if_conversion): Invoke ifcvt_local_dce and
ifcvt_repair_bool_pattern under FLAG_FORCE_VECTORIZE.


patch.part-2
Description: Binary data


[PATCH] Simple improvement for predicate computation in if-convert phase.

2014-10-16 Thread Yuri Rumyantsev
Hi All,

Here is a simple enhancement for predicate computation in if-convert phase:

 We use notion of cd equivalence to get simpler predicate for
 join block, e.g. if join block has 2 predecessors with predicates
 p1 & p2 and p1 & !p2, we'd like to get p1 for it instead of
 p1 & p2 | p1 & !p2.

Bootstrap and regression testing did not show any new failures.

Is it OK for trunk?

gcc/ChangeLog
2014-10-16  Yuri Rumyantsev  

* tree-if-conv.c (add_to_predicate_list): Check unconditionally
that bb is always executed to early exit. Use predicate of
cd-equivalent block for join blocks if it exists.
(if_convertible_loop_p_1): Recompute POST_DOMINATOR tree.
(tree_if_conversion): Free post-dominance information.


if-conv.patch
Description: Binary data


Re: [PATCH,1/2] Extended if-conversion for loops marked with pragma omp simd.

2014-10-16 Thread Yuri Rumyantsev
Richard,

Here is reduced patch as you requested. All your remarks have been fixed.
Could you please look at it ( I have already sent the patch with
changes in add_to_predicate_list for review).

Thanks.
Yuri.
ChangeLog
2014-10-16  Yuri Rumyantsev  

(flag_force_vectorize): New variable.
(edge_predicate): New function.
(set_edge_predicate): New function.
(add_to_dst_predicate_list): Conditionally invoke add_to_predicate_list
if destination block of edge is not always executed. Set-up predicate
for critical edge.
(if_convertible_phi_p): Accept phi nodes with more than two args
if FLAG_FORCE_VECTORIZE was set-up.
(ifcvt_can_use_mask_load_store): Use FLAG_FORCE_VECTORIZE.
(if_convertible_stmt_p): Fix up pre-function comments.
(all_edges_are_critical): New function.
(if_convertible_bb_p): Allow bb has more than two predecessors if
FLAG_FORCE_VECTORIZE was set-up. Use call of all_edges_are_critical
to reject block if-conversion with incoming critical edges only if
FLAG_FORCE_VECTORIZE was not set-up.
(predicate_bbs): Skip loop exit block also.Invoke build2_loc
to compute predicate instead of fold_build2_loc.
Add zeroing of edge 'aux' field.
(find_phi_replacement_condition): Extend function interface:
it returns NULL if given phi node must be handled by means of
extended phi node predication. If number of predecessors of phi-block
is equal 2 and atleast one incoming edge is not critical original
algorithm is used.
(tree_if_conversion): Temporary set-up FLAG_FORCE_VECTORIZE to false.
Nullify 'aux' field of edges for blocks with two successors.



2014-10-15 13:50 GMT+04:00 Richard Biener :
> On Mon, Oct 13, 2014 at 11:38 AM, Yuri Rumyantsev  wrote:
>> Richard,
>>
>> Here is updated patch (part1) for extended if conversion.
>>
>> Second part of patch will be sent later.
>
> Ok, I'm starting to look at this.  I'd still like you to split things up
> more.
>
>  static inline void
>  add_to_predicate_list (struct loop *loop, basic_block bb, tree nc)
>  {
> ...
>
> +  /* We use notion of cd equivalence to get simplier predicate for
> +join block, e.g. if join block has 2 predecessors with predicates
> +p1 & p2 and p1 & !p2, we'd like to get p1 for it instead of
> +p1 & p2 | p1 & !p2.  */
> +  if (dom_bb != loop->header
> + && get_immediate_dominator (CDI_POST_DOMINATORS, dom_bb) == bb)
> +   {
> + gcc_assert (flow_bb_inside_loop_p (loop, dom_bb));
> + bc = bb_predicate (dom_bb);
> + gcc_assert (!is_true_predicate (bc));
>
> these changes look worthwhile even for !flag_force_vectorize.  So please
> split the change to add_to_predicate_list out and compute post-dominators
> unconditionally.  Note that you should call free_dominance_info
> (CDI_POST_DOMINATORS) at the end of if-conversion.
>
> +  if (!dominated_by_p (CDI_DOMINATORS, loop->latch, e->dest))
> +add_to_predicate_list (loop, e->dest, cond);
> +
> +  /* If edge E is critical save predicate on it.  */
> +  if (EDGE_COUNT (e->dest->preds) >= 2)
> +set_edge_predicate (e, cond);
>
> how do we know the edge is critical by this simple check?  Why not
> simply always save edge predicates (well, you kind of do but omit
> the case where e->src dominates e->dest).
>
> Btw, you can rely on edge->aux being NULL at the start of the
> pass but need to clear it at the end (best use clear_aux_for_edges ()
> for that).  So stuff like
>
> + extract_true_false_edges_from_block (bb, &true_edge, &false_edge);
> + if (flag_force_vectorize)
> +   true_edge->aux = false_edge->aux = NULL;
>
> shouldn't be necessary.
>
> I think the edge predicate handling should also be unconditionally
> and not depend on flag_force_vectorize.
>
> +  /* The loop latch and loop exit block are always executed and
> +have no extra conditions to be processed: skip them.  */
> +  if (bb == loop->latch
> + || bb_with_exit_edge_p (loop, bb))
>
> I don't think the edge stuff is true - given you still only reset the
> loop->latch bb predicate the change looks broken.
>
> + /* Fold_build2 can produce bool conversion which is not
> + supported by vectorizer, so re-build it without folding.
> +For example, such conversion is generated for sequence:
> +   _Bool _7, _8, _9;
> +   _7 = _6 != 13; _8 = _6 != 0; _9 = _8 & _9;
> +   if (_9 != 0)  --> (bool)_9.  */
> +
> + if (CONVERT_EXPR_P (c)
> + && TREE_CODE_CLASS (code) == tcc_comparison)
>
> I think you should simply use canonicalize_cond_expr_cond on the
> folding result.  Or rath

Re: [PATCH] Simple improvement for predicate computation in if-convert phase.

2014-10-17 Thread Yuri Rumyantsev
Jeff,

I prepared another patch that includes test-case as you requested.

Below are answers on your questions.

> First, for the benefit of anyone trying to understand what you're doing, 
> defining what "cd equivalent" means would be >helpful.

I added the following  comment to function:

   fwe call basic blocks bb1 and bb2
   cd-equivalent if they are executed under the same condition.


Is it sufficient?

>So, do you have a case where the dominated_by_p test above is true and 
>is_predicated(bb) returns true as well?  I think this >part of the change is 
>largely responsible for the hack you're doing with having the function scoped 
>static variable join_bb.

I don't have such test-case and I assume that if bb is always
executed, it is not predicated.

I also deleted "join_bb" in my changes.


Is it OK for trunk now.

Thanks.
Yuri.

2014-10-17  Yuri Rumyantsev  
gcc/ChangeLog

* tree-if-conv.c (add_to_predicate_list): Check unconditionally
that bb is always executed to early exit. Use predicate of
cd-equivalent block for join blocks if it exists.
(if_convertible_loop_p_1): Recompute POST_DOMINATOR tree.
(tree_if_conversion): Free post-dominance information.

gcc/testsuite/ChangeLog

* gcc/dg/tree-ssa/ifc-cd.c: New test.



2014-10-17 1:16 GMT+04:00 Jeff Law :
> On 10/16/14 05:52, Yuri Rumyantsev wrote:
>>
>> Hi All,
>>
>> Here is a simple enhancement for predicate computation in if-convert
>> phase:
>>
>>   We use notion of cd equivalence to get simpler predicate for
>>   join block, e.g. if join block has 2 predecessors with predicates
>>   p1 & p2 and p1 & !p2, we'd like to get p1 for it instead of
>>   p1 & p2 | p1 & !p2.
>>
>> Bootstrap and regression testing did not show any new failures.
>>
>> Is it OK for trunk?
>>
>> gcc/ChangeLog
>> 2014-10-16  Yuri Rumyantsev
>>
>> * tree-if-conv.c (add_to_predicate_list): Check unconditionally
>> that bb is always executed to early exit. Use predicate of
>> cd-equivalent block for join blocks if it exists.
>> (if_convertible_loop_p_1): Recompute POST_DOMINATOR tree.
>> (tree_if_conversion): Free post-dominance information.
>
> First, for the benefit of anyone trying to understand what you're doing,
> defining what "cd equivalent" means would be helpful.
>
>
>>
>>
>> if-conv.patch
>>
>>
>> Index: tree-if-conv.c
>> ===
>> --- tree-if-conv.c  (revision 216217)
>> +++ tree-if-conv.c  (working copy)
>> @@ -396,25 +396,51 @@
>>   }
>>
>>   /* Add condition NC to the predicate list of basic block BB.  LOOP is
>> -   the loop to be if-converted.  */
>> +   the loop to be if-converted. Use predicate of cd-equivalent block
>> +   for join bb if it exists.  */
>>
>>   static inline void
>>   add_to_predicate_list (struct loop *loop, basic_block bb, tree nc)
>>   {
>> tree bc, *tp;
>> +  basic_block dom_bb;
>> +  static basic_block join_bb = NULL;
>>
>> if (is_true_predicate (nc))
>>   return;
>>
>> -  if (!is_predicated (bb))
>> +  /* If dominance tells us this basic block is always executed,
>> + don't record any predicates for it.  */
>> +  if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
>> +return;
>
> So, do you have a case where the dominated_by_p test above is true and
> is_predicated(bb) returns true as well?  I think this part of the change is
> largely responsible for the hack you're doing with having the function
> scoped static variable join_bb.
>
>
>
>
>> +
>> +  /* If predicate has been already set up for given bb using
>> cd-equivalent
>> + block predicate, simply escape.  */
>> +  if (join_bb == bb)
>> +return;
>
> I *really* dislike the state you're carrying around via join_bb.
>
> ISTM that if you compute that there's an equivalence, then you just set the
> predicate for the equivalent block and the right things would have happened
> if you had not changed the test above.
>
> You also need a testcase.  It doesn't have to be extensive, but at least
> some basic "smoke test" to verify basic operation of this code.  It's
> perfectly fine to scan the debugging dumps for debug output.
>
>
> jeff
>
>


if-conv.patch.new
Description: Binary data


Re: [PATCH,1/2] Extended if-conversion for loops marked with pragma omp simd.

2014-10-17 Thread Yuri Rumyantsev
Richard,

I reworked the patch as you proposed, but I didn't understand what
did you mean by:

>So please rework the patch so critical edges are always handled
>correctly.

In current patch flag_force_vectorize is used (1) to reject phi nodes
with more than 2 arguments; (2) to reject basic blocks with only
critical incoming edges since support for extended predication of phi
nodes will be in next patch.

Could you please clarify your statement.

I attached modified patch.

ChangeLog:

2014-10-17  Yuri Rumyantsev  

(flag_force_vectorize): New variable.
(edge_predicate): New function.
(set_edge_predicate): New function.
(add_to_dst_predicate_list): Conditionally invoke add_to_predicate_list
if destination block of edge is not always executed. Set-up predicate
for critical edge.
(if_convertible_phi_p): Accept phi nodes with more than two args
if FLAG_FORCE_VECTORIZE was set-up.
(ifcvt_can_use_mask_load_store): Use FLAG_FORCE_VECTORIZE.
(if_convertible_stmt_p): Fix up pre-function comments.
(all_edges_are_critical): New function.
(if_convertible_bb_p): Use call of all_preds_critical_p
to reject block if-conversion with incoming critical edges only if
FLAG_FORCE_VECTORIZE was not set-up.
(predicate_bbs): Skip loop exit block also.Invoke build2_loc
to compute predicate instead of fold_build2_loc.
Add zeroing of edge 'aux' field.
(find_phi_replacement_condition): Extend function interface:
it returns NULL if given phi node must be handled by means of
extended phi node predication. If number of predecessors of phi-block
is equal 2 and atleast one incoming edge is not critical original
algorithm is used.
(tree_if_conversion): Temporary set-up FLAG_FORCE_VECTORIZE to false.
Nullify 'aux' field of edges for blocks with two successors.




2014-10-17 13:09 GMT+04:00 Richard Biener :
> On Thu, Oct 16, 2014 at 5:42 PM, Yuri Rumyantsev  wrote:
>> Richard,
>>
>> Here is reduced patch as you requested. All your remarks have been fixed.
>> Could you please look at it ( I have already sent the patch with
>> changes in add_to_predicate_list for review).
>
> + if (dump_file && (dump_flags & TDF_DETAILS))
> +   fprintf (dump_file, "More than two phi node args.\n");
> + return false;
> +   }
> +
> +}
>
> Excess vertical space.
>
>
> +/* Assumes that BB has more than 2 predecessors.
>
> More than 1 predecessor?
>
> +   Returns false if at least one successor is not on critical edge
> +   and true otherwise.  */
> +
> +static inline bool
> +all_edges_are_critical (basic_block bb)
> +{
>
> "all_preds_critical_p" would be a better name
>
> +  if (EDGE_COUNT (bb->preds) > 2)
> +{
> +  if (!flag_force_vectorize)
> +   return false;
> +}
>
> as I said in the last review I don't think we should restrict edge
> predicates to flag_force_vectorize.  At least I can't see how
> if-conversion is magically more expensive for that case?
>
> So please rework the patch so critical edges are always handled
> correctly.
>
> Ok with that and the above suggested changes.
>
> Thanks,
> Richard.
>
>
>> Thanks.
>> Yuri.
>> ChangeLog
>> 2014-10-16  Yuri Rumyantsev  
>>
>> (flag_force_vectorize): New variable.
>> (edge_predicate): New function.
>> (set_edge_predicate): New function.
>> (add_to_dst_predicate_list): Conditionally invoke add_to_predicate_list
>> if destination block of edge is not always executed. Set-up predicate
>> for critical edge.
>> (if_convertible_phi_p): Accept phi nodes with more than two args
>> if FLAG_FORCE_VECTORIZE was set-up.
>> (ifcvt_can_use_mask_load_store): Use FLAG_FORCE_VECTORIZE.
>> (if_convertible_stmt_p): Fix up pre-function comments.
>> (all_edges_are_critical): New function.
>> (if_convertible_bb_p): Allow bb has more than two predecessors if
>> FLAG_FORCE_VECTORIZE was set-up. Use call of all_edges_are_critical
>> to reject block if-conversion with incoming critical edges only if
>> FLAG_FORCE_VECTORIZE was not set-up.
>> (predicate_bbs): Skip loop exit block also.Invoke build2_loc
>> to compute predicate instead of fold_build2_loc.
>> Add zeroing of edge 'aux' field.
>> (find_phi_replacement_condition): Extend function interface:
>> it returns NULL if given phi node must be handled by means of
>> extended phi node predication. If number of predecessors of phi-block
>> is equal 2 and atleast one incoming edge is not critical original
>> algorithm is used.
>> (tree_if_conversion): Temporary set-up FLAG_FORCE_VECTORIZE to false.
>> Nullify 'aux' field of edges for blocks with two successors.
>

Re: [PATCH,1/2] Extended if-conversion for loops marked with pragma omp simd.

2014-10-20 Thread Yuri Rumyantsev
Richard,

Thanks for your answer!

In current implementation phi node conversion assume that one of
incoming edge to bb containing given phi has at least one non-critical
edge and choose it to insert predicated code. But if we choose
critical edge we need to determine insert point and insertion
direction (before/after) since in other case we can get invalid ssa
form (use before def). This is done by my new function which is not in
current patch ( I will present this patch later). SO I assume that we
need to leave this patch as it is to not introduce new bugs.

Thanks.
Yuri.

2014-10-20 12:00 GMT+04:00 Richard Biener :
> On Fri, Oct 17, 2014 at 4:09 PM, Yuri Rumyantsev  wrote:
>> Richard,
>>
>> I reworked the patch as you proposed, but I didn't understand what
>> did you mean by:
>>
>>>So please rework the patch so critical edges are always handled
>>>correctly.
>>
>> In current patch flag_force_vectorize is used (1) to reject phi nodes
>> with more than 2 arguments; (2) to reject basic blocks with only
>> critical incoming edges since support for extended predication of phi
>> nodes will be in next patch.
>
> I mean that (2) should not be rejected dependent on flag_force_vectorize.
> It was rejected because if-cvt couldn't handle it correctly before but with
> this patch this is fixed.  I see no reason to still reject this then even
> for !flag_force_vectorize.
>
> Rejecting PHIs with more than two arguments with flag_force_vectorize
> is ok.
>
> Richard.
>
>> Could you please clarify your statement.
>>
>> I attached modified patch.
>>
>> ChangeLog:
>>
>> 2014-10-17  Yuri Rumyantsev  
>>
>> (flag_force_vectorize): New variable.
>> (edge_predicate): New function.
>> (set_edge_predicate): New function.
>> (add_to_dst_predicate_list): Conditionally invoke add_to_predicate_list
>> if destination block of edge is not always executed. Set-up predicate
>> for critical edge.
>> (if_convertible_phi_p): Accept phi nodes with more than two args
>> if FLAG_FORCE_VECTORIZE was set-up.
>> (ifcvt_can_use_mask_load_store): Use FLAG_FORCE_VECTORIZE.
>> (if_convertible_stmt_p): Fix up pre-function comments.
>> (all_edges_are_critical): New function.
>> (if_convertible_bb_p): Use call of all_preds_critical_p
>> to reject block if-conversion with incoming critical edges only if
>> FLAG_FORCE_VECTORIZE was not set-up.
>> (predicate_bbs): Skip loop exit block also.Invoke build2_loc
>> to compute predicate instead of fold_build2_loc.
>> Add zeroing of edge 'aux' field.
>> (find_phi_replacement_condition): Extend function interface:
>> it returns NULL if given phi node must be handled by means of
>> extended phi node predication. If number of predecessors of phi-block
>> is equal 2 and atleast one incoming edge is not critical original
>> algorithm is used.
>> (tree_if_conversion): Temporary set-up FLAG_FORCE_VECTORIZE to false.
>> Nullify 'aux' field of edges for blocks with two successors.
>>
>>
>>
>>
>> 2014-10-17 13:09 GMT+04:00 Richard Biener :
>>> On Thu, Oct 16, 2014 at 5:42 PM, Yuri Rumyantsev  wrote:
>>>> Richard,
>>>>
>>>> Here is reduced patch as you requested. All your remarks have been fixed.
>>>> Could you please look at it ( I have already sent the patch with
>>>> changes in add_to_predicate_list for review).
>>>
>>> + if (dump_file && (dump_flags & TDF_DETAILS))
>>> +   fprintf (dump_file, "More than two phi node args.\n");
>>> + return false;
>>> +   }
>>> +
>>> +}
>>>
>>> Excess vertical space.
>>>
>>>
>>> +/* Assumes that BB has more than 2 predecessors.
>>>
>>> More than 1 predecessor?
>>>
>>> +   Returns false if at least one successor is not on critical edge
>>> +   and true otherwise.  */
>>> +
>>> +static inline bool
>>> +all_edges_are_critical (basic_block bb)
>>> +{
>>>
>>> "all_preds_critical_p" would be a better name
>>>
>>> +  if (EDGE_COUNT (bb->preds) > 2)
>>> +{
>>> +  if (!flag_force_vectorize)
>>> +   return false;
>>> +}
>>>
>>> as I said in the last review I don't think we should restrict edge
>>> predicates to flag_force_vectorize.  At least I can't see how
>>> if-conversion is magically more expensive for that case?
>>>
>>> So please rework 

Re: [PATCH,1/2] Extended if-conversion for loops marked with pragma omp simd.

2014-10-21 Thread Yuri Rumyantsev
Richard,

I did some changes in patch and ChangeLog to mark that support for
if-convert of blocks with only critical incoming edges will be added
in the future (more precise in patch.4).

Could you please review it.

Thanks.

ChangeLog:

2014-10-21  Yuri Rumyantsev  

(flag_force_vectorize): New variable.
(edge_predicate): New function.
(set_edge_predicate): New function.
(add_to_dst_predicate_list): Conditionally invoke add_to_predicate_list
if destination block of edge is not always executed. Set-up predicate
for critical edge.
(if_convertible_phi_p): Accept phi nodes with more than two args
if FLAG_FORCE_VECTORIZE was set-up.
(ifcvt_can_use_mask_load_store): Use FLAG_FORCE_VECTORIZE.
(if_convertible_stmt_p): Fix up pre-function comments.
(all_preds_critical_p): New function.
(if_convertible_bb_p): Use call of all_preds_critical_p
to reject temporarily block if-conversion with incoming critical edges
if FLAG_FORCE_VECTORIZE was not set-up. This restriction will be deleted
after adding support for extended predication.
(predicate_bbs): Skip loop exit block also.Invoke build2_loc
to compute predicate instead of fold_build2_loc.
Add zeroing of edge 'aux' field.
(find_phi_replacement_condition): Extend function interface:
it returns NULL if given phi node must be handled by means of
extended phi node predication. If number of predecessors of phi-block
is equal 2 and at least one incoming edge is not critical original
algorithm is used.
(tree_if_conversion): Temporarily set-up FLAG_FORCE_VECTORIZE to false.
Nullify 'aux' field of edges for blocks with two successors.

2014-10-20 17:55 GMT+04:00 Yuri Rumyantsev :
> Richard,
>
> Thanks for your answer!
>
> In current implementation phi node conversion assume that one of
> incoming edge to bb containing given phi has at least one non-critical
> edge and choose it to insert predicated code. But if we choose
> critical edge we need to determine insert point and insertion
> direction (before/after) since in other case we can get invalid ssa
> form (use before def). This is done by my new function which is not in
> current patch ( I will present this patch later). SO I assume that we
> need to leave this patch as it is to not introduce new bugs.
>
> Thanks.
> Yuri.
>
> 2014-10-20 12:00 GMT+04:00 Richard Biener :
>> On Fri, Oct 17, 2014 at 4:09 PM, Yuri Rumyantsev  wrote:
>>> Richard,
>>>
>>> I reworked the patch as you proposed, but I didn't understand what
>>> did you mean by:
>>>
>>>>So please rework the patch so critical edges are always handled
>>>>correctly.
>>>
>>> In current patch flag_force_vectorize is used (1) to reject phi nodes
>>> with more than 2 arguments; (2) to reject basic blocks with only
>>> critical incoming edges since support for extended predication of phi
>>> nodes will be in next patch.
>>
>> I mean that (2) should not be rejected dependent on flag_force_vectorize.
>> It was rejected because if-cvt couldn't handle it correctly before but with
>> this patch this is fixed.  I see no reason to still reject this then even
>> for !flag_force_vectorize.
>>
>> Rejecting PHIs with more than two arguments with flag_force_vectorize
>> is ok.
>>
>> Richard.
>>
>>> Could you please clarify your statement.
>>>
>>> I attached modified patch.
>>>
>>> ChangeLog:
>>>
>>> 2014-10-17  Yuri Rumyantsev  
>>>
>>> (flag_force_vectorize): New variable.
>>> (edge_predicate): New function.
>>> (set_edge_predicate): New function.
>>> (add_to_dst_predicate_list): Conditionally invoke add_to_predicate_list
>>> if destination block of edge is not always executed. Set-up predicate
>>> for critical edge.
>>> (if_convertible_phi_p): Accept phi nodes with more than two args
>>> if FLAG_FORCE_VECTORIZE was set-up.
>>> (ifcvt_can_use_mask_load_store): Use FLAG_FORCE_VECTORIZE.
>>> (if_convertible_stmt_p): Fix up pre-function comments.
>>> (all_edges_are_critical): New function.
>>> (if_convertible_bb_p): Use call of all_preds_critical_p
>>> to reject block if-conversion with incoming critical edges only if
>>> FLAG_FORCE_VECTORIZE was not set-up.
>>> (predicate_bbs): Skip loop exit block also.Invoke build2_loc
>>> to compute predicate instead of fold_build2_loc.
>>> Add zeroing of edge 'aux' field.
>>> (find_phi_replacement_condition): Extend function interface:
>>> it returns NULL if given phi node must be handled by means of
>>> extended phi node predication. If number of predecessors of phi-block
>>> is equal 2 and atleast one incoming edge is

Re: [PATCH,1/2] Extended if-conversion for loops marked with pragma omp simd.

2014-10-21 Thread Yuri Rumyantsev
Richard,

Yes, This patch does not make sense since phi node predication for bb
with critical incoming edges only performs another function which is
absent (predicate_extended_scalar_phi).

BTW I see that commit_edge_insertions() is used for rtx instructions
only but you propose to use it for tree also.
Did I miss something?

Thanks ahead.


2014-10-21 16:44 GMT+04:00 Richard Biener :
> On Tue, Oct 21, 2014 at 2:25 PM, Yuri Rumyantsev  wrote:
>> Richard,
>>
>> I did some changes in patch and ChangeLog to mark that support for
>> if-convert of blocks with only critical incoming edges will be added
>> in the future (more precise in patch.4).
>
> But the same reasoning applies to this version of the patch when
> flag_force_vectorize is true!?  (insertion point and invalid SSA form)
>
> Which means the patch doesn't make sense in isolation?
>
> Btw, I think for the case you should simply do gsi_insert_on_edge ()
> and commit_edge_insertions () before the call to combine_blocks
> (pushing the edge predicate to the newly created block).
>
> Richard.
>
>> Could you please review it.
>>
>> Thanks.
>>
>> ChangeLog:
>>
>> 2014-10-21  Yuri Rumyantsev  
>>
>> (flag_force_vectorize): New variable.
>> (edge_predicate): New function.
>> (set_edge_predicate): New function.
>> (add_to_dst_predicate_list): Conditionally invoke add_to_predicate_list
>> if destination block of edge is not always executed. Set-up predicate
>> for critical edge.
>> (if_convertible_phi_p): Accept phi nodes with more than two args
>> if FLAG_FORCE_VECTORIZE was set-up.
>> (ifcvt_can_use_mask_load_store): Use FLAG_FORCE_VECTORIZE.
>> (if_convertible_stmt_p): Fix up pre-function comments.
>> (all_preds_critical_p): New function.
>> (if_convertible_bb_p): Use call of all_preds_critical_p
>> to reject temporarily block if-conversion with incoming critical edges
>> if FLAG_FORCE_VECTORIZE was not set-up. This restriction will be deleted
>> after adding support for extended predication.
>> (predicate_bbs): Skip loop exit block also.Invoke build2_loc
>> to compute predicate instead of fold_build2_loc.
>> Add zeroing of edge 'aux' field.
>> (find_phi_replacement_condition): Extend function interface:
>> it returns NULL if given phi node must be handled by means of
>> extended phi node predication. If number of predecessors of phi-block
>> is equal 2 and at least one incoming edge is not critical original
>> algorithm is used.
>> (tree_if_conversion): Temporarily set-up FLAG_FORCE_VECTORIZE to false.
>> Nullify 'aux' field of edges for blocks with two successors.
>>
>> 2014-10-20 17:55 GMT+04:00 Yuri Rumyantsev :
>>> Richard,
>>>
>>> Thanks for your answer!
>>>
>>> In current implementation phi node conversion assume that one of
>>> incoming edge to bb containing given phi has at least one non-critical
>>> edge and choose it to insert predicated code. But if we choose
>>> critical edge we need to determine insert point and insertion
>>> direction (before/after) since in other case we can get invalid ssa
>>> form (use before def). This is done by my new function which is not in
>>> current patch ( I will present this patch later). SO I assume that we
>>> need to leave this patch as it is to not introduce new bugs.
>>>
>>> Thanks.
>>> Yuri.
>>>
>>> 2014-10-20 12:00 GMT+04:00 Richard Biener :
>>>> On Fri, Oct 17, 2014 at 4:09 PM, Yuri Rumyantsev  
>>>> wrote:
>>>>> Richard,
>>>>>
>>>>> I reworked the patch as you proposed, but I didn't understand what
>>>>> did you mean by:
>>>>>
>>>>>>So please rework the patch so critical edges are always handled
>>>>>>correctly.
>>>>>
>>>>> In current patch flag_force_vectorize is used (1) to reject phi nodes
>>>>> with more than 2 arguments; (2) to reject basic blocks with only
>>>>> critical incoming edges since support for extended predication of phi
>>>>> nodes will be in next patch.
>>>>
>>>> I mean that (2) should not be rejected dependent on flag_force_vectorize.
>>>> It was rejected because if-cvt couldn't handle it correctly before but with
>>>> this patch this is fixed.  I see no reason to still reject this then even
>>>> for !flag_force_vectorize.
>>>>
>>>> Rejecting PHIs with more than two arguments with flag_force_vectorize
>>>> i

Re: [PATCH,1/2] Extended if-conversion for loops marked with pragma omp simd.

2014-10-21 Thread Yuri Rumyantsev
Richard,

I saw the sources of these functions, but I can't understand why I
should use something else? Note that all predicate computations are
located in basic blocks ( by design of if-conv) and there is special
function that put these computations in bb
(insert_gimplified_predicates). Edge contains only predicate not its
computations. New function - find_insertion_point() does very simple
search - it finds out the latest (in current bb) operand def-stmt of
predicates taken from all incoming edges.
In original algorithm the predicate of non-critical edge is taken to
perform phi-node predication since for critical edge it does not work
properly.

My question is: does your comments mean that I should re-design my extensions?

Thanks.
Yuri.

BTW Jeff did initial review of my changes related to predicate
computation for join blocks. I presented him updated patch with
test-case and some minor changes in patch. But still did not get any
feedback on it. Could you please take a look also on it?


2014-10-21 17:38 GMT+04:00 Richard Biener :
> On Tue, Oct 21, 2014 at 3:20 PM, Yuri Rumyantsev  wrote:
>> Richard,
>>
>> Yes, This patch does not make sense since phi node predication for bb
>> with critical incoming edges only performs another function which is
>> absent (predicate_extended_scalar_phi).
>>
>> BTW I see that commit_edge_insertions() is used for rtx instructions
>> only but you propose to use it for tree also.
>> Did I miss something?
>
> Ah, it's gsi_commit_edge_inserts () (or gsi_commit_one_edge_insert
> if you want easy access to the newly created basic block to push
> the predicate to - see gsi_commit_edge_inserts implementation).
>
> Richard.
>
>> Thanks ahead.
>>
>>
>> 2014-10-21 16:44 GMT+04:00 Richard Biener :
>>> On Tue, Oct 21, 2014 at 2:25 PM, Yuri Rumyantsev  wrote:
>>>> Richard,
>>>>
>>>> I did some changes in patch and ChangeLog to mark that support for
>>>> if-convert of blocks with only critical incoming edges will be added
>>>> in the future (more precise in patch.4).
>>>
>>> But the same reasoning applies to this version of the patch when
>>> flag_force_vectorize is true!?  (insertion point and invalid SSA form)
>>>
>>> Which means the patch doesn't make sense in isolation?
>>>
>>> Btw, I think for the case you should simply do gsi_insert_on_edge ()
>>> and commit_edge_insertions () before the call to combine_blocks
>>> (pushing the edge predicate to the newly created block).
>>>
>>> Richard.
>>>
>>>> Could you please review it.
>>>>
>>>> Thanks.
>>>>
>>>> ChangeLog:
>>>>
>>>> 2014-10-21  Yuri Rumyantsev  
>>>>
>>>> (flag_force_vectorize): New variable.
>>>> (edge_predicate): New function.
>>>> (set_edge_predicate): New function.
>>>> (add_to_dst_predicate_list): Conditionally invoke add_to_predicate_list
>>>> if destination block of edge is not always executed. Set-up predicate
>>>> for critical edge.
>>>> (if_convertible_phi_p): Accept phi nodes with more than two args
>>>> if FLAG_FORCE_VECTORIZE was set-up.
>>>> (ifcvt_can_use_mask_load_store): Use FLAG_FORCE_VECTORIZE.
>>>> (if_convertible_stmt_p): Fix up pre-function comments.
>>>> (all_preds_critical_p): New function.
>>>> (if_convertible_bb_p): Use call of all_preds_critical_p
>>>> to reject temporarily block if-conversion with incoming critical edges
>>>> if FLAG_FORCE_VECTORIZE was not set-up. This restriction will be deleted
>>>> after adding support for extended predication.
>>>> (predicate_bbs): Skip loop exit block also.Invoke build2_loc
>>>> to compute predicate instead of fold_build2_loc.
>>>> Add zeroing of edge 'aux' field.
>>>> (find_phi_replacement_condition): Extend function interface:
>>>> it returns NULL if given phi node must be handled by means of
>>>> extended phi node predication. If number of predecessors of phi-block
>>>> is equal 2 and at least one incoming edge is not critical original
>>>> algorithm is used.
>>>> (tree_if_conversion): Temporarily set-up FLAG_FORCE_VECTORIZE to false.
>>>> Nullify 'aux' field of edges for blocks with two successors.
>>>>
>>>> 2014-10-20 17:55 GMT+04:00 Yuri Rumyantsev :
>>>>> Richard,
>>>>>
>>>>> Thanks for your answer!
>>>>>
>>>>> In current implementation phi node conversion a

Re: [PATCH,1/2] Extended if-conversion for loops marked with pragma omp simd.

2014-10-21 Thread Yuri Rumyantsev
Richard,

In my initial design I did such splitting but before start real
if-conversion but I decided to not perform it since code size for
if-converted loop is growing (number of phi nodes is increased). It is
worth noting also that for phi with #nodes > 2 we need to get all
predicates (except one) to do phi-predication and it means that block
containing such phi can have only 1 critical edge.

Thanks.
Yuri.

2014-10-21 18:19 GMT+04:00 Richard Biener :
> On Tue, Oct 21, 2014 at 4:09 PM, Richard Biener
>  wrote:
>> On Tue, Oct 21, 2014 at 3:58 PM, Yuri Rumyantsev  wrote:
>>> Richard,
>>>
>>> I saw the sources of these functions, but I can't understand why I
>>> should use something else? Note that all predicate computations are
>>> located in basic blocks ( by design of if-conv) and there is special
>>> function that put these computations in bb
>>> (insert_gimplified_predicates). Edge contains only predicate not its
>>> computations. New function - find_insertion_point() does very simple
>>> search - it finds out the latest (in current bb) operand def-stmt of
>>> predicates taken from all incoming edges.
>>> In original algorithm the predicate of non-critical edge is taken to
>>> perform phi-node predication since for critical edge it does not work
>>> properly.
>>>
>>> My question is: does your comments mean that I should re-design my 
>>> extensions?
>>
>> Well, we have infrastructure for inserting code on edges and you've
>> made critical edges predicated correctly.  So why re-invent the wheel?
>> I realize this is very similar to my initial suggestion to simply split
>> critical edges in loops you want to if-convert but delays splitting
>> until it turns out to be necessary (which might be good for the
>> !force_vect case).
>>
>> For edge predicates you simply can emit their computation on the
>> edge, no?
>>
>> Btw, I very originally suggested to rework if-conversion to only
>> record edge predicates - having both block and edge predicates
>> somewhat complicates the code and makes it harder to
>> maintain (thus also the suggestion to simply split critical edges
>> if necessary to make BB predicates work always).
>>
>> Your patches add a lot of code and to me it seems we can avoid
>> doing so much special casing.
>
> For example attacking the critical edge issue by a simple
>
> Index: tree-if-conv.c
> ===
> --- tree-if-conv.c  (revision 216508)
> +++ tree-if-conv.c  (working copy)
> @@ -980,11 +980,7 @@ if_convertible_bb_p (struct loop *loop,
> if (EDGE_COUNT (e->src->succs) == 1)
>   found = true;
>if (!found)
> -   {
> - if (dump_file && (dump_flags & TDF_DETAILS))
> -   fprintf (dump_file, "only critical predecessors\n");
> - return false;
> -   }
> +   split_edge (EDGE_PRED (bb, 0));
>  }
>
>return true;
>
> it changes the number of blocks in the loop, so
> get_loop_body_in_if_conv_order should probably be re-done with the
> above eventually signalling that it created a new block.  Or the above
> should populate a vector of edges to split and do that after the
> loop calling if_convertible_bb_p.
>
> Richard.
>
>> Richard.
>>
>>> Thanks.
>>> Yuri.
>>>
>>> BTW Jeff did initial review of my changes related to predicate
>>> computation for join blocks. I presented him updated patch with
>>> test-case and some minor changes in patch. But still did not get any
>>> feedback on it. Could you please take a look also on it?
>>>
>>>
>>> 2014-10-21 17:38 GMT+04:00 Richard Biener :
>>>> On Tue, Oct 21, 2014 at 3:20 PM, Yuri Rumyantsev  
>>>> wrote:
>>>>> Richard,
>>>>>
>>>>> Yes, This patch does not make sense since phi node predication for bb
>>>>> with critical incoming edges only performs another function which is
>>>>> absent (predicate_extended_scalar_phi).
>>>>>
>>>>> BTW I see that commit_edge_insertions() is used for rtx instructions
>>>>> only but you propose to use it for tree also.
>>>>> Did I miss something?
>>>>
>>>> Ah, it's gsi_commit_edge_inserts () (or gsi_commit_one_edge_insert
>>>> if you want easy access to the newly created basic block to push
>>>> the predicate to - see gsi_commit_edge_inserts implementation).
>>>>
>>>> 

Re: [PATCH,1/2] Extended if-conversion for loops marked with pragma omp simd.

2014-10-24 Thread Yuri Rumyantsev
Richard,

Patch containing new core related to extended predication is attached.
Here is few comments which explain a main goal of design.

1. I don't want to insert any critical edge splitting since it may
lead to less efficient binaries (I remember some performance issue
when we designed lazy code motion algorithm in SPARC compiler).
2. One special case of extended PHI node predication was introduced
when #arguments is more than 2 but only two arguments are different
and one argument has the only occurrence. For such PHI conditional
scalar reduction is applied.
This is correspondent to the following:
if (q1 && q2 && q3) var++
 New function phi_has_two_different_args was introduced to detect such phi.
3. Original algorithm for PHI predication used assumption that at
least one incoming edge for blocks containing PHI is not critical - it
guarantees that all computations related to predicate of normal edge
are already inserted above this block and
core related to PHI predication can be inserted at the beginning of
block. But this is not true for critical edges for which predicate
computations are  in the block where code for phi predication must be
inserted. So new function find_insertion_point is introduced which is
simply found out the last statement in block defining predicates
correspondent to all incoming edges and insert phi predication code
after it (with some minor exceptions).

If you need more comments or something unclear will let me know.

Thanks.
Yuri.

ChangeLog:

2014-10-24  Yuri Rumyantsev  

* tree-if-conv.c (ifcvt_can_use_mask_load_store): Use
FLAG_FORCE_VECTORIZE instead of loop flag.
(if_convertible_bb_p): Allow bb has more than 2 predecessors if
FLAG_FORCE_VECTORIZE is true.
(if_convertible_bb_p): Delete check that bb has at least one
non-critical incoming edge.
(phi_has_two_different_args): New function.
(is_cond_scalar_reduction): Add argument EXTENDED to choose access
to phi arguments. Invoke phi_has_two_different_args to get phi
arguments if EXTENDED is true. Change check that block
containing reduction statement candidate is predecessor
of phi-block since phi may have more than two arguments.
(convert_scalar_cond_reduction): Add argument BEFORE to insert
statement before/after gsi point.
(predicate_scalar_phi): Add argument false (which means non-extended
predication) to call of is_cond_scalar_reduction. Add argument
true (which correspondent to argument BEFORE) to call of
convert_scalar_cond_reduction.
(get_predicate_for_edge): New function.
(predicate_arbitrary_scalar_phi): New function.
(predicate_extended_scalar_phi): New function.
(find_insertion_point): New function.
(predicate_all_scalar_phis): Add two boolean variables EXTENDED and
BEFORE. Initialize EXTENDED to true if BB containing phi has more
than 2 predecessors or both incoming edges are critical. Invoke
find_phi_replacement_condition and predicate_scalar_phi or
find_insertion_point and predicate_extended_scalar_phi depending on
EXTENDED value.
(insert_gimplified_predicates): Add check that non-predicated block
may have statements to insert. Insert predicate of BB just after label
if FLAG_FORCE_VECTORIZE is true.
(tree_if_conversion): Add initialization of FLAG_FORCE_VECTORIZE which
is copy of inner or outer loop field force_vectorize.


2014-10-24 13:12 GMT+04:00 Richard Biener :
> On Tue, Oct 21, 2014 at 4:34 PM, Yuri Rumyantsev  wrote:
>> Richard,
>>
>> In my initial design I did such splitting but before start real
>> if-conversion but I decided to not perform it since code size for
>> if-converted loop is growing (number of phi nodes is increased). It is
>> worth noting also that for phi with #nodes > 2 we need to get all
>> predicates (except one) to do phi-predication and it means that block
>> containing such phi can have only 1 critical edge.
>
> Can you point me to the patch with the special insertion code then?
> I definitely want to avoid the mess we ran into with the reassoc
> code "clever" insertion code.
>
> Richard.
>
>> Thanks.
>> Yuri.
>>
>> 2014-10-21 18:19 GMT+04:00 Richard Biener :
>>> On Tue, Oct 21, 2014 at 4:09 PM, Richard Biener
>>>  wrote:
>>>> On Tue, Oct 21, 2014 at 3:58 PM, Yuri Rumyantsev  
>>>> wrote:
>>>>> Richard,
>>>>>
>>>>> I saw the sources of these functions, but I can't understand why I
>>>>> should use something else? Note that all predicate computations are
>>>>> located in basic blocks ( by design of if-conv) and there is special
>>>>> function that put these computations in bb
>>>>> (insert_gimplified_predicates). Edge contains only predicate not its
>>>>> computations. New function - find_insertion_point() does very simple
>>>>> search - it finds out the latest (in 

[PATCH, PR 61391]

2014-06-04 Thread Yuri Rumyantsev
Hi All,

Here is a simple fix for 61391 - missed a check that statement basic
block is inside loop. With this fix test-case from bug is compiled
successfully.

Bootstrap and regression testing did not show any new failures.

Is it OK for trunk?

ChangeLog:

2014-06-04  Yuri Rumyantsev  

PR tree-optimization/61319
* tree-if-conv.c (is_cond_scalar_reduction): Add missed check that
stmt belongs to loop.


patch
Description: Binary data


Re: [PATCH, PR 61391]

2014-06-04 Thread Yuri Rumyantsev
Richard.

Here is update patch with test-case and new ChangeLog.

gcc/ChangeLog
2014-06-04  Yuri Rumyantsev  

PR tree-optimization/61319
* tree-if-conv.c (is_cond_scalar_reduction): Add missed check that
stmt belongs to loop.

gcc/testsuite/ChangeLog
* gcc.dg/torture/pr61319.c: New test.


2014-06-04 17:37 GMT+04:00 Richard Biener :
> On Wed, Jun 4, 2014 at 3:37 PM, Richard Biener
>  wrote:
>> On Wed, Jun 4, 2014 at 3:17 PM, Yuri Rumyantsev  wrote:
>>> Hi All,
>>>
>>> Here is a simple fix for 61391 - missed a check that statement basic
>>> block is inside loop. With this fix test-case from bug is compiled
>>> successfully.
>>>
>>> Bootstrap and regression testing did not show any new failures.
>>>
>>> Is it OK for trunk?
>>
>> Ok.
>
> Please also add the testcase.
>
>> Thanks,
>> Richard.
>>
>>> ChangeLog:
>>>
>>> 2014-06-04  Yuri Rumyantsev  
>>>
>>> PR tree-optimization/61319
>>> * tree-if-conv.c (is_cond_scalar_reduction): Add missed check that
>>> stmt belongs to loop.


patch1
Description: Binary data


Re: [PATCH, PR 61391]

2014-06-04 Thread Yuri Rumyantsev
I converted test-case to Unix format and new patch is attached.

2014-06-04 19:14 GMT+04:00 Jakub Jelinek :
> On Wed, Jun 04, 2014 at 07:11:26PM +0400, Yuri Rumyantsev wrote:
>> Here is update patch with test-case and new ChangeLog.
>
> If approved, please avoid the DOS style line endings in the testcase.
>
> Jakub


patch2
Description: Binary data


Re: [PATCH, PR 61391]

2014-06-04 Thread Yuri Rumyantsev
Sorry, I sent you 'bad' patch, resend it.

2014-06-04 19:19 GMT+04:00 Yuri Rumyantsev :
> I converted test-case to Unix format and new patch is attached.
>
> 2014-06-04 19:14 GMT+04:00 Jakub Jelinek :
>> On Wed, Jun 04, 2014 at 07:11:26PM +0400, Yuri Rumyantsev wrote:
>>> Here is update patch with test-case and new ChangeLog.
>>
>> If approved, please avoid the DOS style line endings in the testcase.
>>
>> Jakub


patch2
Description: Binary data


[PATCH PR61518]

2014-06-18 Thread Yuri Rumyantsev
Hi All,

Here is a fix for PR 61518 - additional test was added to reject
transformation if reduction variable is not used in reduction
statement only since such reduction will not be vectorized.

Bootstrap and regression testing did not show any new failures.

Is it OK for trunk?

 gcc/ChangeLog
2014-06-18  Yuri Rumyantsev  

PR tree-optimization/61518
* tree-if-conv.c (is_cond_scalar_reduction): Add missed check that
reduction var is used in reduction stmt or phi-function only.

gcc/testsuite/ChangeLog
* gcc.dg/torture/pr61518.c: New test.


patch
Description: Binary data


[PATCH PR61576]

2014-06-24 Thread Yuri Rumyantsev
Hi All,

Here is a fix for PR 61576 - additional test was added that block
containing reduction statement is predecessor of block containing phi
to choose the correct condition.

Bootstrap and regression testing did not show any new failures.

Is it OK for trunk?

gcc/ChangeLog
2014-06-24  Yuri Rumyantsev  

PR tree-optimization/61576
* tree-if-conv.c (is_cond_scalar_reduction): Add check that
basic block containing reduction statement is predecessor
of phi basi block.

gcc/testsuite/ChangeLog
* gcc.dg/torture/pr61576.c: New test.


patch
Description: Binary data


Re: [PATCH PR61576]

2014-07-03 Thread Yuri Rumyantsev
Ping!

2014-06-24 13:37 GMT+04:00 Yuri Rumyantsev :
> Hi All,
>
> Here is a fix for PR 61576 - additional test was added that block
> containing reduction statement is predecessor of block containing phi
> to choose the correct condition.
>
> Bootstrap and regression testing did not show any new failures.
>
> Is it OK for trunk?
>
> gcc/ChangeLog
> 2014-06-24  Yuri Rumyantsev  
>
> PR tree-optimization/61576
> * tree-if-conv.c (is_cond_scalar_reduction): Add check that
> basic block containing reduction statement is predecessor
> of phi basi block.
>
> gcc/testsuite/ChangeLog
> * gcc.dg/torture/pr61576.c: New test.


Re: [PATCH] Simple optimization for MASK_STORE.

2015-05-20 Thread Yuri Rumyantsev
Hi All,

Here is updated patch to optimize mask stores. The main goal of it is
to avoid execution of mask store if its mask is zero vector since
loads that follow it can be blocked.
The following changes were done:
  1.  A test on sink legality was added - it simply prohibits to cross
statements with non-null vdef or vuse.
  2. New phi node is created in join block for moved MASK_STORE statements.
  3. Test was changed to check that 2 MASK_STORE statements are not
moved to the same block.
  4. New field was added to loop_vec_info structure to mark loops
having MASK_STORE's.

Any comments will be appreciated.
Yuri.

2015-05-20  Yuri Rumyantsev  

* config/i386/i386.c: Include files stringpool.h and tree-ssanames.h.
(ix86_vectorize_is_zero_vector): New function.
(TARGET_VECTORIZE_IS_ZERO_VECTOR): New target macro
* doc/tm.texi.in: Add @hook TARGET_VECTORIZE_IS_ZERO_VECTOR.
* doc/tm.texi: Updated.
* target.def (is_zero_vector): New DEFHOOK.
* tree-vect-stmts.c : Include tree-into-ssa.h.
(vectorizable_mask_load_store): Initialize has_mask_store field.
(is_valid_sink): New function.
(optimize_mask_stores): New function.
* tree-vectorizer.c (vectorize_loops): Invoke optimaze_mask_stores for
loops having masked stores.
* tree-vectorizer.h (loop_vec_info): Add new has_mask_store field and
correspondent macros.
(optimize_mask_stores): Update prototype.

gcc/testsuite/ChangeLog:
* gcc.target/i386/avx2-vect-mask-store-move1.c: New test.


patch.2
Description: Binary data


[PATCH] Simple fix to enhance outer-loop vectorization.

2015-05-28 Thread Yuri Rumyantsev
Hi All,

Here is a simple patch which removes restriction on outer-loop
vectorization -  allow references in inner-loop with zero step. This
case was found in one important benchmark.

Bootstrap and regression testing did not show any new failures.
Is it OK for trunk.

ChangeLog:
2015-05-28  Yuri Rumyantsev  

* tree-vect-data-refs.c (vect_analyze_data_ref_access): Allow
consecutive accesses within outer-loop vectorization for references
with zero step in inner-loop.

gcc/testsuite/ChangeLog:
* gcc.dg/vect/fast-math-vect-outer-1.c: New test.


patch
Description: Binary data


Re: [PATCH] Simple fix to enhance outer-loop vectorization.

2015-05-28 Thread Yuri Rumyantsev
Richard,

First of all, I don't think that it is possible to write out test for
outer-loop vectorization with zero-step reference because of possible
loop-carried dependencies and run-time aliasing is not supported for
outer-loop. If there are no such dependencies pre or pdse does
hoisting (sinking) of such invariant references. So I add a check on
it to accept zero-step references for outer loop marked with
forc-vectorize flag to guarantee absence of loop-carried dependencies
between inner-loop iterations.
I included run-time test that checks vectorization correctness.

Update patch is attached.
Yuri..

2015-05-28 14:39 GMT+03:00 Richard Biener :
> On Thu, May 28, 2015 at 1:00 PM, Yuri Rumyantsev  wrote:
>> Hi All,
>>
>> Here is a simple patch which removes restriction on outer-loop
>> vectorization -  allow references in inner-loop with zero step. This
>> case was found in one important benchmark.
>>
>> Bootstrap and regression testing did not show any new failures.
>> Is it OK for trunk.
>>
>> ChangeLog:
>> 2015-05-28  Yuri Rumyantsev  
>>
>> * tree-vect-data-refs.c (vect_analyze_data_ref_access): Allow
>> consecutive accesses within outer-loop vectorization for references
>> with zero step in inner-loop.
>>
>> gcc/testsuite/ChangeLog:
>> * gcc.dg/vect/fast-math-vect-outer-1.c: New test.
>
> Can you please add a non-omp-simd testcase that triggers this as well and that
> is a runtime testcase verifying the transform is correct?
>
> Also please don't add to the strange testcase-name machinery but just
> use { dg-additional-options "-ffast-math" }
>
> Index: tree-vect-data-refs.c
> ===
> --- tree-vect-data-refs.c   (revision 223653)
> +++ tree-vect-data-refs.c   (working copy)
> @@ -2261,7 +2261,6 @@
>return true;
>  }
>
> -
>  /* Analyze the access pattern of the data-reference DR.
> In case of non-consecutive accesses call vect_analyze_group_access() to
> analyze groups of accesses.  */
>
> spurious white-space change
>
>
> @@ -2291,14 +2290,8 @@
>if (loop_vinfo && integer_zerop (step))
>
> Surely the comment before this needs updating now.
>
>  {
>GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)) = NULL;
> -  if (nested_in_vect_loop_p (loop, stmt))
> -   {
> - if (dump_enabled_p ())
> -   dump_printf_loc (MSG_NOTE, vect_location,
> -"zero step in inner loop of nest\n");
> - return false;
> -   }
> -  return DR_IS_READ (dr);
> +  if (!nested_in_vect_loop_p (loop, stmt))
> +   return DR_IS_READ (dr);
>  }
>
>if (loop && nested_in_vect_loop_p (loop, stmt))
>
> so what happens after the patch?  It would be nice to have a comment
> explaining what happens in the nested_in_vect_loop_p case for
> the case when the outer-loop step is zero and when it is not zero.
>
> In particular as you don't need any code generation changes - this hints
> at that you may miss something ;)
>
> Otherwise of course the patch is ok - lifting restrictions is good.
>
> Thanks,
> Richard.


patch.1
Description: Binary data


Re: [PATCH] Simple optimization for MASK_STORE.

2015-05-29 Thread Yuri Rumyantsev
Hi Richard,

Did you have a chance to look at my updated patch?

Any comments will be appreciated.
Yuri.

2015-05-20 17:00 GMT+03:00 Yuri Rumyantsev :
> Hi All,
>
> Here is updated patch to optimize mask stores. The main goal of it is
> to avoid execution of mask store if its mask is zero vector since
> loads that follow it can be blocked.
> The following changes were done:
>   1.  A test on sink legality was added - it simply prohibits to cross
> statements with non-null vdef or vuse.
>   2. New phi node is created in join block for moved MASK_STORE statements.
>   3. Test was changed to check that 2 MASK_STORE statements are not
> moved to the same block.
>   4. New field was added to loop_vec_info structure to mark loops
> having MASK_STORE's.
>
> Any comments will be appreciated.
> Yuri.
>
> 2015-05-20  Yuri Rumyantsev  
>
> * config/i386/i386.c: Include files stringpool.h and tree-ssanames.h.
> (ix86_vectorize_is_zero_vector): New function.
> (TARGET_VECTORIZE_IS_ZERO_VECTOR): New target macro
> * doc/tm.texi.in: Add @hook TARGET_VECTORIZE_IS_ZERO_VECTOR.
> * doc/tm.texi: Updated.
> * target.def (is_zero_vector): New DEFHOOK.
> * tree-vect-stmts.c : Include tree-into-ssa.h.
> (vectorizable_mask_load_store): Initialize has_mask_store field.
> (is_valid_sink): New function.
> (optimize_mask_stores): New function.
> * tree-vectorizer.c (vectorize_loops): Invoke optimaze_mask_stores for
> loops having masked stores.
> * tree-vectorizer.h (loop_vec_info): Add new has_mask_store field and
> correspondent macros.
> (optimize_mask_stores): Update prototype.
>
> gcc/testsuite/ChangeLog:
> * gcc.target/i386/avx2-vect-mask-store-move1.c: New test.


[PATCH] Yet another simple fix to enhance outer-loop vectorization.

2015-06-08 Thread Yuri Rumyantsev
Hi All,

Here is a simple fix which allows duplication of outer loops to
perform peeling for number of iterations if outer loop is marked with
pragma omp simd.

Bootstrap and regression testing did not show any new failures.
Is it OK for trunk?

ChangeLog:

2015-06-08  Yuri Rumyantsev  

* tree-vect-loop-manip.c (rename_variables_in_bb): Add argument
to allow renaming of PHI arguments on edges incoming from outer
loop header, add corresponding check before start PHI iterator.
(slpeel_tree_duplicate_loop_to_edge_cfg): Introduce new bool
variable DUPLICATE_OUTER_LOOP and set it to true for outer loops
with true force_vectorize.  Set-up dominator for outer loop too.
Pass DUPLICATE_OUTER_LOOP as argument to rename_variables_in_bb.
(slpeel_can_duplicate_loop_p): Allow duplicate of outer loop if it
was marked with force_vectorize and has restricted cfg.
* tre-vect-loop.c (vect_analyze_loop_2): Prohibit alignment peeling
for outer loops.

gcc/testsuite/ChangeLog:
* gcc.dg/vect/vect-outer-simd-2.c: New test.


patch.1
Description: Binary data


[PATCH] Simple optimization for MASK_STORE.

2015-05-06 Thread Yuri Rumyantsev
Hi All,

Here is a patch which gives us significant speed-up on HASWELL for
test containing masked stores. The main goal of that patch is attempt
to avoid HW hazard for maskmove instructions through inserting
additional check on zero mask and putting all masked store statements
into separate block on false edge.All MASK_STORE statements having the
same mask put into one block. Any comments will be appreciate.

ChangeLog:
2015-05-06  Yuri Rumyantsev  

* cfgloop.h (has_mask_store): Add new field to struct loop.
* config/i386/i386.c: Include files stringpool.h and tree-ssanames.h.
(ix86_vectorize_zero_vector): New function.
(TARGET_VECTORIZE_ZERO_VECTOR): New target macro
* doc/tm.texi.in: Add @hook TARGET_VECTORIZE_ZERO_VECTOR.
* doc/tm.texi: Updated.
* target.def (zero_vector): New DEFHOOK.
* tree-if-conv.c (predicate_mem_writes): Set has_mask_store for loop.
* tree-vect-stmts.c : Include tree-into-ssa.h.
(optimize_mask_stores): New function.
* tree-vectorizer.c (vectorize_loops): Zero has_mask_store field for
non-vectorized loops and invoke optimize_mask_stores function.

gcc/testsuite/ChangeLog:
* gcc.target/i386/avx2-vect-mask-store-move1.c: New test.


patch.1
Description: Binary data


[PATCH PR63941]

2014-11-28 Thread Yuri Rumyantsev
Hi All,

Here is a simple fix for PR 63941 - wrong assertion for dominating bb
which may have true predicate. I deleted it and add setting of
non-true predicate for join bb. After this fix test-case is compiled
successfully.

Bootstrap and regression testing did not show any new failures.

Is it OK for trunk?

gcc/ChangeLog
2014-11-28  Yuri Rumyantsev  

PR tree-optimization/63941
* tree-if-conv.c (add_to_predicate_list): Delete wrong assertion that
DOM_BB has non-true predicate, conditionally set non-true predicate
for BB.

gcc/testsuite/ChangeLog
* gcc.dg/torture/pr63941.c: New test.


patch
Description: Binary data


Re: [PATCH 2/3] Extended if-conversion

2014-12-01 Thread Yuri Rumyantsev
Hi Richard,

I resend you patch1 and patch2 with minor changes:
1. I renamed flag_force_vectorize to aggressive_if_conv.
2. Use static cast for the first argument of gimple_phi_arg_edge.
I also very sorry that I sent you bad patch.

Now let me answer on your questions related to second patch.
1. Why we need both predicate_extended_scalar_phi and
predicate_arbitrary_scalar_phi?

Let's consider the following simple test-case:

  #pragma omp simd safelen(8)
  for (i=0; i<512; i++)
  {
float t = a[i];
if (t > 0.0f & t < 1.0e+17f)
  if (c[i] != 0)  /* c is integer array. */
res += 1;
  }

we can see the following phi node correspondent to res:

# res_1 = PHI 

It is clear that we can optimize it to phi node with 2 arguments only
and only one check can be used for phi predication (for reduction in
our case), namely predicate of bb_5. In general case we can't do it
even if we sort all phi argument values since we still have to produce
a chain of cond expressions to perform phi predication (see comments
for predicate_arbitrary_scalar_phi).
2. Why we need to introduce find_insertion_point?
 Let's consider another test-case extracted from 175.vpr ( t5.c is
attached) and we can see that bb_7 and bb_9 containig phi nodes has
only critical incoming edges and both contain code computing edge
predicates, e.g.

:
# xmax_edge_30 = PHI 
_46 = xmax_17 == xmax_37;
_47 = xmax_17 == xmax_27;
_48 = _46 & _47;
_53 = xmax_17 == xmax_37;
_54 = ~_53;
_55 = xmax_17 == xmax_27;
_56 = _54 & _55;
_57 = _48 | _56;
xmax_edge_19 = xmax_edge_39 + 1;
goto ;

It is evident that we can not put phi predication at the block
beginning but need to put it after predicate computations.
Note also that if there are no critical edges for phi arguments
insertion point will be "after labels" Note also that phi result can
have use in this block too, so we can't put predication code to the
block end.

Let me know if you still have any questions.

Best regards.
Yuri.




2014-11-28 15:43 GMT+03:00 Richard Biener :
> On Wed, Nov 12, 2014 at 2:35 PM, Yuri Rumyantsev  wrote:
>> Hi All,
>>
>> Here is the second patch related to extended predication.
>> Few comments which explain a main goal of design.
>>
>> 1. I don't want to insert any critical edge splitting since it may
>> lead to less efficient binaries.
>> 2. One special case of extended PHI node predication was introduced
>> when #arguments is more than 2 but only two arguments are different
>> and one argument has the only occurrence. For such PHI conditional
>> scalar reduction is applied.
>> This is correspondent to the following statement:
>> if (q1 && q2 && q3) var++
>>  New function phi_has_two_different_args was introduced to detect such phi.
>> 3. Original algorithm for PHI predication used assumption that at
>> least one incoming edge for blocks containing PHI is not critical - it
>> guarantees that all computations related to predicate of normal edge
>> are already inserted above this block and
>> code related to PHI predication can be inserted at the beginning of
>> block. But this is not true for critical edges for which predicate
>> computations are  in the block where code for phi predication must be
>> inserted. So new function find_insertion_point is introduced which is
>> simply found out the last statement in block defining predicates
>> correspondent to all incoming edges and insert phi predication code
>> after it (with some minor exceptions).
>
> Unfortunately the patch doesn't apply for me - I get
>
> patch:  malformed patch at line 505: @@ -1720,6 +2075,8 @@
> predicate_all_scalar_phis (struct loop *loop)
>
> a few remarks nevertheless.  I don't see how we need both
> predicate_extended_scalar_phi and predicate_arbitrary_scalar_phi.
> Couldn't we simply sort an array of (edge, value) pairs after value
> and handle equal values specially in predicate_extended_scalar_phi?
> That would even make PHI  more optimal.
>
> I don't understand the need for find_insertion_point.  All SSA names
> required for the predicates are defined upward - and the complex CFG
> is squashed to a single basic-block, thus the defs will dominate the
> inserted code if you insert after labels just like for the other case.
> Or what am I missing?  ("flattening" of the basic-blocks of course needs
> to happen in dominator order - but I guess that happens already?)
>
> I'd like the extended PHI handling to be enablable by a flag even
> for !force-vectorization - I've seen cases with 3 PHI args multiple
> times that would have been nice to vectorize.  I suggest to
> add -ftree-loop-if-convert-aggressive for this.  We can do this as
> followup, but please rename 

Re: [PATCH 2/3] Extended if-conversion

2014-12-02 Thread Yuri Rumyantsev
Thanks Richard for your quick reply!

1. I agree that we can combine predicate_extended_ and
predicate_arbitrary_ to one function as you proposed.
2. What is your opinion about using more simple decision about
insertion point - if bb has use of phi result insert phi predication
before it and at the bb end otherwise. I assume that critical edge
splitting is not a good decision.

Best regards.
Yuri.

2014-12-02 16:28 GMT+03:00 Richard Biener :
> On Mon, Dec 1, 2014 at 4:53 PM, Yuri Rumyantsev  wrote:
>> Hi Richard,
>>
>> I resend you patch1 and patch2 with minor changes:
>> 1. I renamed flag_force_vectorize to aggressive_if_conv.
>> 2. Use static cast for the first argument of gimple_phi_arg_edge.
>> I also very sorry that I sent you bad patch.
>>
>> Now let me answer on your questions related to second patch.
>> 1. Why we need both predicate_extended_scalar_phi and
>> predicate_arbitrary_scalar_phi?
>>
>> Let's consider the following simple test-case:
>>
>>   #pragma omp simd safelen(8)
>>   for (i=0; i<512; i++)
>>   {
>> float t = a[i];
>> if (t > 0.0f & t < 1.0e+17f)
>>   if (c[i] != 0)  /* c is integer array. */
>> res += 1;
>>   }
>>
>> we can see the following phi node correspondent to res:
>>
>> # res_1 = PHI 
>>
>> It is clear that we can optimize it to phi node with 2 arguments only
>> and only one check can be used for phi predication (for reduction in
>> our case), namely predicate of bb_5. In general case we can't do it
>> even if we sort all phi argument values since we still have to produce
>> a chain of cond expressions to perform phi predication (see comments
>> for predicate_arbitrary_scalar_phi).
>
> How so?  We can always use !(condition) for the "last" value, thus
> treat it as an 'else' case.  That even works for
>
> # res_1 = PHI 
>
> where the condition for edges 5 and 7 can be computed as
> ! (condition for 3 || condition for 4).
>
> Of course it is worthwhile to also sort single-occurances first
> so your case gets just the condiiton for edge 5 and its inversion
> used for edges 3 and 4 combined.
>
>> 2. Why we need to introduce find_insertion_point?
>>  Let's consider another test-case extracted from 175.vpr ( t5.c is
>> attached) and we can see that bb_7 and bb_9 containig phi nodes has
>> only critical incoming edges and both contain code computing edge
>> predicates, e.g.
>>
>> :
>> # xmax_edge_30 = PHI 
>> _46 = xmax_17 == xmax_37;
>> _47 = xmax_17 == xmax_27;
>> _48 = _46 & _47;
>> _53 = xmax_17 == xmax_37;
>> _54 = ~_53;
>> _55 = xmax_17 == xmax_27;
>> _56 = _54 & _55;
>> _57 = _48 | _56;
>> xmax_edge_19 = xmax_edge_39 + 1;
>> goto ;
>>
>> It is evident that we can not put phi predication at the block
>> beginning but need to put it after predicate computations.
>> Note also that if there are no critical edges for phi arguments
>> insertion point will be "after labels" Note also that phi result can
>> have use in this block too, so we can't put predication code to the
>> block end.
>
> So the issue is that predicate insertion for edge predicates does
> not happen on the edge but somewhere else (generally impossible
> for critical edges unless you split them).
>
> I think I've told you before that I prefer simple solutions to such issues,
> like splitting the edge!  Certainly not involving a function walking
> GENERIC expressions.
>
> Thanks,
> Richard.
>
>> Let me know if you still have any questions.
>>
>> Best regards.
>> Yuri.
>>
>>
>>
>>
>> 2014-11-28 15:43 GMT+03:00 Richard Biener :
>>> On Wed, Nov 12, 2014 at 2:35 PM, Yuri Rumyantsev  wrote:
>>>> Hi All,
>>>>
>>>> Here is the second patch related to extended predication.
>>>> Few comments which explain a main goal of design.
>>>>
>>>> 1. I don't want to insert any critical edge splitting since it may
>>>> lead to less efficient binaries.
>>>> 2. One special case of extended PHI node predication was introduced
>>>> when #arguments is more than 2 but only two arguments are different
>>>> and one argument has the only occurrence. For such PHI conditional
>>>> scalar reduction is applied.
>>>> This is correspondent to the following statement:
>>>> if (q1 && q2 && q3) var++
>>>>  New function phi_has_two_different_args was introduced to detect such phi.

[PATCH PR65161]

2015-02-24 Thread Yuri Rumyantsev
Hi All!

Here is a simple patch to not perform instruction reordering for
selective scheduling since it uses interface of list scheduling
defined in "sched-int.h".

Bootstrap and regression testing did not show any new failures.

Is it OK for trunk?

ChangeLog:
2015-02-24  Yuri Rumyantsev  

PR target/65161
* config/i386/i386.c (ix86_sched_reorder): Skip instruction reordering
for selective scheduling.


patch
Description: Binary data


Re: [PATCH PR65161]

2015-02-25 Thread Yuri Rumyantsev
Hi All,

I prepared new patch which includes test-case.

I can't agree with patch proposed by Alexander since other functions
doing ready list reordering also use HID interface, so I put escape
check in ix86_sched_reorder.

Is it OK for trunk?

2015-02-25  Yuri Rumyantsev  

PR target/65161
* config/i386/i386.c (ix86_sched_reorder): Skip instruction reordering
for selective scheduling.

gcc/testsuite/ChangeLog
* gcc.target/i386/pr65161.c: New test.


2015-02-25 2:54 GMT+03:00 Andrey Belevantsev :
> On 24.02.2015 21:16, Alexander Monakov wrote:
>>
>>
>>
>> On Tue, 24 Feb 2015, Yuri Rumyantsev wrote:
>>
>>> Hi All!
>>>
>>> Here is a simple patch to not perform instruction reordering for
>>> selective scheduling since it uses interface of list scheduling
>>> defined in "sched-int.h".
>>
>>
>> As I see, the exact problem is that swap_top_of_ready_list accesses HID,
>> so
>> please use the more specialized patch below instead.
>
>
> You have missed a space before call parentheses in the patch, otherwise it
> looks fine.
>
> Andrey
>
>
>
>>
>> Thanks.
>> Alexander
>>
>> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
>> index 7f5796a..6eccd54 100644
>> --- a/gcc/config/i386/i386.c
>> +++ b/gcc/config/i386/i386.c
>> @@ -26615,6 +26615,12 @@ swap_top_of_ready_list (rtx_insn **ready, int
>> n_ready)
>> dep_t dep;
>> int clock1 = -1;
>> int clock2 = -1;
>> +
>> +  /* The following heuristic inspects h_i_d, but it is not extended for
>> insns
>> + created when doing selective scheduling.  */
>> +  if (sel_sched_p())
>> +return false;
>> +
>> #define INSN_TICK(INSN) (HID (INSN)->tick)
>>
>> if (!TARGET_SILVERMONT && !TARGET_INTEL)
>>
>


patch.1
Description: Binary data


Re: [PATCH PR65161]

2015-02-25 Thread Yuri Rumyantsev
Here is updated patch accordingly to Alexander comments.

BTW another function using HID interface is do_reorder_for_imul and it
is called from ix86_sched_reorder.

Is it OK for trunk?

2015-02-25 13:26 GMT+03:00 Alexander Monakov :
>
>
> On Wed, 25 Feb 2015, Yuri Rumyantsev wrote:
>
>> Hi All,
>>
>> I prepared new patch which includes test-case.
>>
>> I can't agree with patch proposed by Alexander since other functions
>> doing ready list reordering also use HID interface, so I put escape
>> check in ix86_sched_reorder.
>
> I don't see how that is the case.  Can you point me to specific lines of code
> in ix86_sched_reorder or its callees that access HID?
>
> In any case please use sel_sched_p () rather than flag_selective_scheduling2.
>
> Thanks.
>
> Alexander


patch.2
Description: Binary data


Re: [PATCH PR65161]

2015-02-25 Thread Yuri Rumyantsev
I modified patch accordingly to Alexander comments.

Is it OK for trunk?

2015-02-25 15:38 GMT+03:00 Alexander Monakov :
>
>
> On Wed, 25 Feb 2015, Yuri Rumyantsev wrote:
>
>> Here is updated patch accordingly to Alexander comments.
>>
>> BTW another function using HID interface is do_reorder_for_imul and it
>> is called from ix86_sched_reorder.
>
> do_reorder_for_imul uses dependency list iteration macros, which use HDID, not
> HID, and HDID is populated during selective scheduling.  How exactly does
> do_reorder_for_imul access HID?
>
> Alexander


patch.3
Description: Binary data


Re: [PATCH PR65161]

2015-02-25 Thread Yuri Rumyantsev
Hi All,

Here is updated patch to fix ICE.

Is it OK for trunk?

2015-02-25  Yuri Rumyantsev  

PR target/65161
* config/i386/i386.c (ix86_sched_reorder): Skip instruction reordering
for selective scheduling.

gcc/testsuite/ChangeLog
* gcc.target/i386/pr65161.c: New test.

2015-02-25 17:04 GMT+03:00 Alexander Monakov :
>
>
> On Wed, 25 Feb 2015, Yuri Rumyantsev wrote:
>
>> I modified patch accordingly to Alexander comments.
>>
>> Is it OK for trunk?
>
> If possible, please add a short comment explaining why a shortcut is
> necessary, for example "HID is not populated during selective scheduling".
>
> OK for trunk from selective scheduler maintainer's perspective.
>
> Thanks.
> Alexander


patch.3
Description: Binary data


Re: [PATCH] Yet another simple fix to enhance outer-loop vectorization.

2015-06-16 Thread Yuri Rumyantsev
Thanks a lot Richard for your review.

I presented updated patch which is not gated by force_vectorize.
I added test on outer-loop in vect_enhance_data_refs_alignment
and it returns false for it because we can not improve dr alighment
through outer-loop peeling in general. So I assume that only
versioning for alignment can be applied for targets do not support
unaligned memory access.

I did not change tests for outer loops in slpeel_can_duplicate_loop_p
as you proposed since it is not called outside vectorization.

I also noticed one not-resolved issue with outer-loop peeling - we don't
consider remainder for possible vectorization of inner-loop as we can see
on the following example:

  for (i = 0; i < n; i++) {
diff = 0;
for (j = 0; j < M; j++) {
  diff += in[j+i]*coeff[j];
}
out[i] = diff;
  }

Is it worth to fix it?

2015-06-16  Yuri Rumyantsev  

* tree-vect-loop-manip.c (rename_variables_in_bb): Add argument
to allow renaming of PHI arguments on edges incoming from outer
loop header, add corresponding check before start PHI iterator.
(slpeel_tree_duplicate_loop_to_edge_cfg): Introduce new bool
variable DUPLICATE_OUTER_LOOP and set it to true for outer loops
with true force_vectorize.  Set-up dominator for outer loop too.
Pass DUPLICATE_OUTER_LOOP as argument to rename_variables_in_bb.
(slpeel_can_duplicate_loop_p): Allow duplicate of outer loop if it
was marked with force_vectorize and has restricted cfg.
* tree-vect-data-refs.c (vector_alignment_reachable_p): Alignment can
not be reachable for outer loops.
(vect_enhance_data_refs_alignment): Add test on true value of
do_peeling.

gcc/testsuite/ChangeLog:
* gcc.dg/vect/vect-outer-simd-2.c: New test.

2015-06-09 16:26 GMT+03:00 Richard Biener :
> On Mon, Jun 8, 2015 at 12:27 PM, Yuri Rumyantsev  wrote:
>> Hi All,
>>
>> Here is a simple fix which allows duplication of outer loops to
>> perform peeling for number of iterations if outer loop is marked with
>> pragma omp simd.
>>
>> Bootstrap and regression testing did not show any new failures.
>> Is it OK for trunk?
>
> Hmm, I don't remember needing to adjust rename_variables_in_bb
> when teaching loop distibution to call slpeel_tree_duplicate_to_edge_cfg
> on non-innermost loops...  (I just copied, I never called
> slpeel_can_duplicate_loop_p though).
>
> So - you should just remove the loop->inner condition from
> slpeel_can_duplicate_loop_p as it is used by non-vectorizer
> code as well (yeah, I never merged the nested loop support
> for loop distribution...).
>
> Index: tree-vect-loop.c
> ===
> --- tree-vect-loop.c(revision 224100)
> +++ tree-vect-loop.c(working copy)
> @@ -1879,6 +1879,10 @@
>return false;
>  }
>
> +  /* Peeling for alignment is not supported for outer-loop vectorization.  */
> +  if (LOOP_VINFO_LOOP (loop_vinfo)->inner)
> +LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) = 0;
>
> I think you can't simply do this - if vect_enhance_data_refs_alignment
> decided to peel for alignment then it has adjusted the DRs alignment
> info already.  So instead of the above simply disallow peeling for
> alignment in vect_enhance_data_refs_alignment?  Thus add
> || ->inner to
>
>   /* Check if we can possibly peel the loop.  */
>   if (!vect_can_advance_ivs_p (loop_vinfo)
>   || !slpeel_can_duplicate_loop_p (loop, single_exit (loop)))
> do_peeling = false;
>
> ?
>
> I also can't see why the improvement has to be gated on force_vect,
> it surely looks profitable to enable more outer loop vectorization in
> general, no?
>
> How do the cost model calculations end up with peeling the outer loop
> for niter?
>
> On targets which don't support unaligned accesses we're left with
> versioning for alignment.  Isn't peeling for alignment better there?
> Thus only disallow peeling for alignment if there is no unhandled
> alignment?
>
> Thanks,
> Richard.
>
>> ChangeLog:
>>
>> 2015-06-08  Yuri Rumyantsev  
>>
>> * tree-vect-loop-manip.c (rename_variables_in_bb): Add argument
>> to allow renaming of PHI arguments on edges incoming from outer
>> loop header, add corresponding check before start PHI iterator.
>> (slpeel_tree_duplicate_loop_to_edge_cfg): Introduce new bool
>> variable DUPLICATE_OUTER_LOOP and set it to true for outer loops
>> with true force_vectorize.  Set-up dominator for outer loop too.
>> Pass DUPLICATE_OUTER_LOOP as argument to rename_variables_in_bb.
>> (slpeel_can_duplicate_loop_p): Allow duplicate of outer loop if it
>> was marked with force_vectorize and has restricted cfg.
>> * tre-vect-loop.c (vect_analyze_loop_2): Prohibit alignment peeling
>> for outer loops.
>>
>> gcc/testsuite/ChangeLog:
>> * gcc.dg/vect/vect-outer-simd-2.c: New test.


patch.1.2
Description: Binary data


Re: [PATCH] Yet another simple fix to enhance outer-loop vectorization.

2015-06-17 Thread Yuri Rumyantsev
Richard,

I attached updated patch.
You asked me to explain why I did changes for renaming.
If we do not change PHI arguments for inner loop header we can get the
following IR:
source outer loop:
  : outer-loop header
  # i_45 = PHI <0(4), i_18(9)>
  # .MEM_17 = PHI <.MEM_4(D)(4), .MEM_44(9)>

  :inner-loop header
  # .MEM_46 = PHI <.MEM_44(7), .MEM_17(5)>

after duplication we have (without argument correction):
  :
  # i_74 = PHI 
  # .MEM_73 = PHI <.MEM_97(13), .MEM_4(D)(17)>

  :
  # .MEM_63 = PHI <.MEM_17(12), .MEM_97(16)>

and later we get verifier error:
test1.c:20:6: error: definition in block 6 does not dominate use in block 10
 void foo (int n)
  ^
for SSA_NAME: .MEM_17 in statement:
.MEM_63 = PHI <.MEM_17(10), .MEM_97(14)>

and you can see that we need to rename MEM_17 argument for out-coming
edge to MEM_73 since
MEM_17 was converted to MEM_73 in outer-loop header.

This explains my simple fix in rename_variables_in_bb.
Note also that loop distribution is not performed for outer loops.

I also did a change in slpeel_can_duplicate_loop_p to simplify check.

Any comments will be appreciated.

Yuri.

2015-06-17 15:24 GMT+03:00 Richard Biener :
> On Tue, Jun 16, 2015 at 4:12 PM, Yuri Rumyantsev  wrote:
>> Thanks a lot Richard for your review.
>>
>> I presented updated patch which is not gated by force_vectorize.
>> I added test on outer-loop in vect_enhance_data_refs_alignment
>> and it returns false for it because we can not improve dr alighment
>> through outer-loop peeling in general. So I assume that only
>> versioning for alignment can be applied for targets do not support
>> unaligned memory access.
>
> @@ -998,7 +998,12 @@
>gimple stmt = DR_STMT (dr);
>stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
>tree vectype = STMT_VINFO_VECTYPE (stmt_info);
> +  loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);
>
> +  /* Peeling of outer loops can't improve alignment.  */
> +  if (loop_vinfo && LOOP_VINFO_LOOP (loop_vinfo)->inner)
> +return false;
> +
>
> but this looks wrong.  It depends on the context (DRs in the outer loop
> can improve alignment by peeling the outer loop and we can still
> peel the inner loop for alignment).
>
> So IMHO the correct place to amend is vect_enhance_data_refs_alignment
> (which it seems doesn't consider peeling the inner loop).
>
> I'd say you should simply add
>
>  || loop->inner)
>
> to the
>
>   /* Check if we can possibly peel the loop.  */
>   if (!vect_can_advance_ivs_p (loop_vinfo)
>   || !slpeel_can_duplicate_loop_p (loop, single_exit (loop)))
> do_peeling = false;
>
> check.
>
>> I did not change tests for outer loops in slpeel_can_duplicate_loop_p
>> as you proposed since it is not called outside vectorization.
>
> There is still no reason for this complex condition, so please remove it.
>
> _Please_ also generate diffs with -p, it is very tedious to see patch hunks
> without a function name context.
>
> You didn't explain why you needed the renaming changes as I don't
> remember needing it when using the code from loop distribution.
>
>> I also noticed one not-resolved issue with outer-loop peeling - we don't
>> consider remainder for possible vectorization of inner-loop as we can see
>> on the following example:
>>
>>   for (i = 0; i < n; i++) {
>> diff = 0;
>> for (j = 0; j < M; j++) {
>>   diff += in[j+i]*coeff[j];
>> }
>> out[i] = diff;
>>   }
>>
>> Is it worth to fix it?
>
> You mean vectorizing the inner loop in the niter peel epilogue loop
> of the outer loop?  Possibly yes.
>
> Thanks,
> Richard.
>
>> 2015-06-16  Yuri Rumyantsev  
>>
>> * tree-vect-loop-manip.c (rename_variables_in_bb): Add argument
>> to allow renaming of PHI arguments on edges incoming from outer
>> loop header, add corresponding check before start PHI iterator.
>> (slpeel_tree_duplicate_loop_to_edge_cfg): Introduce new bool
>> variable DUPLICATE_OUTER_LOOP and set it to true for outer loops
>> with true force_vectorize.  Set-up dominator for outer loop too.
>> Pass DUPLICATE_OUTER_LOOP as argument to rename_variables_in_bb.
>> (slpeel_can_duplicate_loop_p): Allow duplicate of outer loop if it
>> was marked with force_vectorize and has restricted cfg.
>> * tree-vect-data-refs.c (vector_alignment_reachable_p): Alignment can
>> not be reachable for outer loops.
>> (vect_enhance_data_refs_alignment): Add test on true value of
>> do_peeling.
>>
>> gcc/testsuite/ChangeLog:
>> * gcc.dg/vect/vect-outer-simd-2.c: New test.
>>
>> 2015-06-09

Re: [PATCH] Simple optimization for MASK_STORE.

2015-06-18 Thread Yuri Rumyantsev
Richard,

Here is updated patch which does not include your proposal related to
the target hook deletion.
You wrote:
> I still don't understand why you need the new target hook.  If we have a 
> masked
> load/store then the mask is computed by an assignment with a VEC_COND_EXPR
> (in your example) and thus a test for a zero mask is expressible as just
>
 > if (vect__ifc__41.17_167 == { 0, 0, 0, 0... })
>
> or am I missing something?
Such vector compare produces vector and does not set up cc flags
required for conditional branch (GIMPLE_COND).
If we use such comparison for GIMPLE_COND we got error message, so I
used target hook which does set up cc flags aka ptest instruction and
I left this part.
I moved new routines to loop-vectorizer.c file and both they are static.
I re-wrote is_valid_sink function to use def-use chain as you proposed.
I also add new parameter to control such transformation.
Few redundant tests have also been deleted.
Any comments will be appreciated.

Thanks.
Yuri.

2015-06-18  Yuri Rumyantsev  

* config/i386/i386.c: Include files stringpool.h and tree-ssanames.h.
(ix86_vectorize_build_zero_vector_test): New function.
(TARGET_VECTORIZE_BUILD_ZERO_VECTOR_TEST): New target macro
* doc/tm.texi.in: Add @hook TARGET_VECTORIZE_BUILD_ZERO_VECTOR_TEST.
* doc/tm.texi: Updated.
* params.def (PARAM_ZERO_TEST_FOR_STORE_MASK): New DEFPARAM.
* params.h (ENABLE_ZERO_TEST_FOR_STORE_MASK): new macros.
* target.def (build_zero_vector_test): New DEFHOOK.
* tree-vect-stmts.c (vectorizable_mask_load_store): Initialize
has_mask_store field of vect_info.
* tree-vectorizer.c: Include files stringpool.h and tree-ssanames.h.
(is_valid_sink): New function.
(optimize_mask_stores): New function.
(vectorize_loops): Invoke optimaze_mask_stores for loops having masked
stores.
* tree-vectorizer.h (loop_vec_info): Add new has_mask_store field and
correspondent macros.

gcc/testsuite/ChangeLog:
* gcc.target/i386/avx2-vect-mask-store-move1.c: New test.



2015-06-09 15:13 GMT+03:00 Richard Biener :
> On Wed, May 20, 2015 at 4:00 PM, Yuri Rumyantsev  wrote:
>> Hi All,
>>
>> Here is updated patch to optimize mask stores. The main goal of it is
>> to avoid execution of mask store if its mask is zero vector since
>> loads that follow it can be blocked.
>> The following changes were done:
>>   1.  A test on sink legality was added - it simply prohibits to cross
>> statements with non-null vdef or vuse.
>>   2. New phi node is created in join block for moved MASK_STORE statements.
>>   3. Test was changed to check that 2 MASK_STORE statements are not
>> moved to the same block.
>>   4. New field was added to loop_vec_info structure to mark loops
>> having MASK_STORE's.
>>
>> Any comments will be appreciated.
>> Yuri.
>
> I still don't understand why you need the new target hook.  If we have a 
> masked
> load/store then the mask is computed by an assignment with a VEC_COND_EXPR
> (in your example) and thus a test for a zero mask is expressible as just
>
>  if (vect__ifc__41.17_167 == { 0, 0, 0, 0... })
>
> or am I missing something?  As we dont' want this transform on all targets
> (or always) we can control it by either a --param or a new flag which default
> targets can adjust.  [Is the hazard still present with Skylake or other
> AVX512 implementations for example?]
>
> +/* Helper for optimize_mask_stores: returns true if STMT sinking to end
> +   of BB is valid and false otherwise.  */
> +
> +static bool
> +is_valid_sink (gimple stmt)
> +{
>
> so STMT is either a masked store or a masked load?  You shouldn't
> walk all stmts here but instead rely on the factored use-def def-use
> chains of virtual operands.
>
> +void
> +optimize_mask_stores (struct loop *loop)
> +{
> +  basic_block bb = loop->header;
> +  gimple_stmt_iterator gsi;
> +  gimple stmt;
> +  auto_vec worklist;
> +
> +  if (loop->dont_vectorize
> +  || loop->num_nodes > 2)
> +return;
>
> looks like no longer needed given the place this is called from now
> (or does it guard against outer loop vectorization as well?)
> Also put it into tree-vect-loop.c and make it static.
>
> +  /* Loop was not vectorized if mask does not have vector type.  */
> +  if (!VECTOR_TYPE_P (TREE_TYPE (mask)))
> +   return;
>
> this should be always false
>
> +  store_bb->frequency = bb->frequency / 2;
> +  efalse->probability = REG_BR_PROB_BASE / 2;
>
> I think the == 0 case should end up as unlikely.
>
> +  if (dom_info_available_p (CDI_POST_DOMINATORS))
> +   set_immediate_dominator (CDI_POST_DOMINATORS, bb, store_bb);
>
> post dominators are not available in the vectorizer.
>
> + /* Put

Re: [PATCH] Yet another simple fix to enhance outer-loop vectorization.

2015-06-29 Thread Yuri Rumyantsev
Hi All,

Here is updated patch containing missed change in
slpeel_tree_peel_loop_to_edge which prevents renaming of exit PHI uses
in inner loop.

ChangeLog:
2015-06-29  Yuri Rumyantsev  

* tree-vect-loop-manip.c (rename_variables_in_bb): Add argument
to allow renaming of PHI arguments on edges incoming from outer
loop header, add corresponding check before start PHI iterator.
(slpeel_tree_duplicate_loop_to_edge_cfg): Introduce new bool
variable DUPLICATE_OUTER_LOOP and set it to true for outer loops
with true force_vectorize.  Set-up dominator for outer loop too.
Pass DUPLICATE_OUTER_LOOP as argument to rename_variables_in_bb.
(slpeel_can_duplicate_loop_p): Allow duplicate of outer loop if it
was marked with force_vectorize and has restricted cfg.
(slpeel_tree_peel_loop_to_edge): Do not rename exit PHI uses in
inner loop.
* tree-vect-data-refs.c (vect_enhance_data_refs_alignment): Do not
do peeling for outer loops.

gcc/testsuite/ChangeLog:
* gcc.dg/vect/vect-outer-simd-2.c: New test.

2015-06-17 19:35 GMT+03:00 Yuri Rumyantsev :
> Richard,
>
> I attached updated patch.
> You asked me to explain why I did changes for renaming.
> If we do not change PHI arguments for inner loop header we can get the
> following IR:
> source outer loop:
>   : outer-loop header
>   # i_45 = PHI <0(4), i_18(9)>
>   # .MEM_17 = PHI <.MEM_4(D)(4), .MEM_44(9)>
>
>   :inner-loop header
>   # .MEM_46 = PHI <.MEM_44(7), .MEM_17(5)>
>
> after duplication we have (without argument correction):
>   :
>   # i_74 = PHI 
>   # .MEM_73 = PHI <.MEM_97(13), .MEM_4(D)(17)>
>
>   :
>   # .MEM_63 = PHI <.MEM_17(12), .MEM_97(16)>
>
> and later we get verifier error:
> test1.c:20:6: error: definition in block 6 does not dominate use in block 10
>  void foo (int n)
>   ^
> for SSA_NAME: .MEM_17 in statement:
> .MEM_63 = PHI <.MEM_17(10), .MEM_97(14)>
>
> and you can see that we need to rename MEM_17 argument for out-coming
> edge to MEM_73 since
> MEM_17 was converted to MEM_73 in outer-loop header.
>
> This explains my simple fix in rename_variables_in_bb.
> Note also that loop distribution is not performed for outer loops.
>
> I also did a change in slpeel_can_duplicate_loop_p to simplify check.
>
> Any comments will be appreciated.
>
> Yuri.
>
> 2015-06-17 15:24 GMT+03:00 Richard Biener :
>> On Tue, Jun 16, 2015 at 4:12 PM, Yuri Rumyantsev  wrote:
>>> Thanks a lot Richard for your review.
>>>
>>> I presented updated patch which is not gated by force_vectorize.
>>> I added test on outer-loop in vect_enhance_data_refs_alignment
>>> and it returns false for it because we can not improve dr alighment
>>> through outer-loop peeling in general. So I assume that only
>>> versioning for alignment can be applied for targets do not support
>>> unaligned memory access.
>>
>> @@ -998,7 +998,12 @@
>>gimple stmt = DR_STMT (dr);
>>stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
>>tree vectype = STMT_VINFO_VECTYPE (stmt_info);
>> +  loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);
>>
>> +  /* Peeling of outer loops can't improve alignment.  */
>> +  if (loop_vinfo && LOOP_VINFO_LOOP (loop_vinfo)->inner)
>> +return false;
>> +
>>
>> but this looks wrong.  It depends on the context (DRs in the outer loop
>> can improve alignment by peeling the outer loop and we can still
>> peel the inner loop for alignment).
>>
>> So IMHO the correct place to amend is vect_enhance_data_refs_alignment
>> (which it seems doesn't consider peeling the inner loop).
>>
>> I'd say you should simply add
>>
>>  || loop->inner)
>>
>> to the
>>
>>   /* Check if we can possibly peel the loop.  */
>>   if (!vect_can_advance_ivs_p (loop_vinfo)
>>   || !slpeel_can_duplicate_loop_p (loop, single_exit (loop)))
>> do_peeling = false;
>>
>> check.
>>
>>> I did not change tests for outer loops in slpeel_can_duplicate_loop_p
>>> as you proposed since it is not called outside vectorization.
>>
>> There is still no reason for this complex condition, so please remove it.
>>
>> _Please_ also generate diffs with -p, it is very tedious to see patch hunks
>> without a function name context.
>>
>> You didn't explain why you needed the renaming changes as I don't
>> remember needing it when using the code from loop distribution.
>>
>>> I also noticed one not-resolved issue with outer-loop peeling - we don't
>>> consider remainder for possible vectorization of inner-loop as we can see
>>> on th

Re: [PATCH] Simple optimization for MASK_STORE.

2015-07-07 Thread Yuri Rumyantsev
Hi Richard,

Did you have a chance to look at this updated patch?

Thanks.
Yuri.

2015-06-18 17:32 GMT+03:00 Yuri Rumyantsev :
> Richard,
>
> Here is updated patch which does not include your proposal related to
> the target hook deletion.
> You wrote:
>> I still don't understand why you need the new target hook.  If we have a 
>> masked
>> load/store then the mask is computed by an assignment with a VEC_COND_EXPR
>> (in your example) and thus a test for a zero mask is expressible as just
>>
>  > if (vect__ifc__41.17_167 == { 0, 0, 0, 0... })
>>
>> or am I missing something?
> Such vector compare produces vector and does not set up cc flags
> required for conditional branch (GIMPLE_COND).
> If we use such comparison for GIMPLE_COND we got error message, so I
> used target hook which does set up cc flags aka ptest instruction and
> I left this part.
> I moved new routines to loop-vectorizer.c file and both they are static.
> I re-wrote is_valid_sink function to use def-use chain as you proposed.
> I also add new parameter to control such transformation.
> Few redundant tests have also been deleted.
> Any comments will be appreciated.
>
> Thanks.
> Yuri.
>
> 2015-06-18  Yuri Rumyantsev  
>
> * config/i386/i386.c: Include files stringpool.h and tree-ssanames.h.
> (ix86_vectorize_build_zero_vector_test): New function.
> (TARGET_VECTORIZE_BUILD_ZERO_VECTOR_TEST): New target macro
> * doc/tm.texi.in: Add @hook TARGET_VECTORIZE_BUILD_ZERO_VECTOR_TEST.
> * doc/tm.texi: Updated.
> * params.def (PARAM_ZERO_TEST_FOR_STORE_MASK): New DEFPARAM.
> * params.h (ENABLE_ZERO_TEST_FOR_STORE_MASK): new macros.
> * target.def (build_zero_vector_test): New DEFHOOK.
> * tree-vect-stmts.c (vectorizable_mask_load_store): Initialize
> has_mask_store field of vect_info.
> * tree-vectorizer.c: Include files stringpool.h and tree-ssanames.h.
> (is_valid_sink): New function.
> (optimize_mask_stores): New function.
> (vectorize_loops): Invoke optimaze_mask_stores for loops having masked
> stores.
> * tree-vectorizer.h (loop_vec_info): Add new has_mask_store field and
> correspondent macros.
>
> gcc/testsuite/ChangeLog:
> * gcc.target/i386/avx2-vect-mask-store-move1.c: New test.
>
>
>
> 2015-06-09 15:13 GMT+03:00 Richard Biener :
>> On Wed, May 20, 2015 at 4:00 PM, Yuri Rumyantsev  wrote:
>>> Hi All,
>>>
>>> Here is updated patch to optimize mask stores. The main goal of it is
>>> to avoid execution of mask store if its mask is zero vector since
>>> loads that follow it can be blocked.
>>> The following changes were done:
>>>   1.  A test on sink legality was added - it simply prohibits to cross
>>> statements with non-null vdef or vuse.
>>>   2. New phi node is created in join block for moved MASK_STORE statements.
>>>   3. Test was changed to check that 2 MASK_STORE statements are not
>>> moved to the same block.
>>>   4. New field was added to loop_vec_info structure to mark loops
>>> having MASK_STORE's.
>>>
>>> Any comments will be appreciated.
>>> Yuri.
>>
>> I still don't understand why you need the new target hook.  If we have a 
>> masked
>> load/store then the mask is computed by an assignment with a VEC_COND_EXPR
>> (in your example) and thus a test for a zero mask is expressible as just
>>
>>  if (vect__ifc__41.17_167 == { 0, 0, 0, 0... })
>>
>> or am I missing something?  As we dont' want this transform on all targets
>> (or always) we can control it by either a --param or a new flag which default
>> targets can adjust.  [Is the hazard still present with Skylake or other
>> AVX512 implementations for example?]
>>
>> +/* Helper for optimize_mask_stores: returns true if STMT sinking to end
>> +   of BB is valid and false otherwise.  */
>> +
>> +static bool
>> +is_valid_sink (gimple stmt)
>> +{
>>
>> so STMT is either a masked store or a masked load?  You shouldn't
>> walk all stmts here but instead rely on the factored use-def def-use
>> chains of virtual operands.
>>
>> +void
>> +optimize_mask_stores (struct loop *loop)
>> +{
>> +  basic_block bb = loop->header;
>> +  gimple_stmt_iterator gsi;
>> +  gimple stmt;
>> +  auto_vec worklist;
>> +
>> +  if (loop->dont_vectorize
>> +  || loop->num_nodes > 2)
>> +return;
>>
>> looks like no longer needed given the place this is called from now
>> (or does it guard against outer loop vectorization as well?)
>> Also put it into tree-vect-loop.c and make it static.
>

Re: [PATCH PR64434]

2015-01-15 Thread Yuri Rumyantsev
Hi All,

I did a change proposed by Richard - unconditionally allocate from the heap.

Bootstrap and regression testing did not show any new failures.

Is it OK for trunk?

ChangeLog

2015-01-15  Yuri Rumyantsev  

PR tree-optimization/64434
* cfgexpand.c (reorder_operands): New function.
(expand_gimple_basic_block): Insert call of reorder_operands if
optimized is true.

gcc/testsuite/ChangeLog
* gcc.dg/torture/pr64434.c: New test.

2015-01-14 17:07 GMT+03:00 Yuri Rumyantsev :
> Jakub,
>
> I did all changes requested by you.
>
> Here is updated patch.
>
> BTW I thought that gcc performs splitting of blocks with huge  size.
>
>
> 2015-01-14 16:33 GMT+03:00 Jakub Jelinek :
>> On Wed, Jan 14, 2015 at 04:28:42PM +0300, Yuri Rumyantsev wrote:
>>> Hi All,
>>>
>>> I did all changes proposed by Richard and delete check on def in the
>>> same block as Jakub proposed.
>>> I also moved check  on optimization to call site..
>>>
>>> I also checked that bootstrap and regression testing did not show any
>>> new failures.
>>>
>>> Is it OK for trunk?
>>
>> The  | SSA_OP_VUSE is still in there, the testcase is still executable,
>> still doesn't end with newline, and I really think you should replace
>>   lattice = XALLOCAVEC (unsigned int, n);
>> with something like:
>>   if (n >= 10)
>> lattice = XNEWVEC (unsigned int, n);
>>   else
>> lattice = XALLOCAVEC (unsigned int, n);
>> ...
>>   if (n >= 10)
>> XDELETE (lattice);
>> or similar.
>>
>> Jakub


patch2
Description: Binary data


[PATCH PR64746]

2015-01-23 Thread Yuri Rumyantsev
Hi All,

In a process of testing omp-simd tests we found out that couple
important benchmark did not show any speed-up with aggressive
if-conversion since the hottest loops have not been vectorized.
Here is a simple patch that resolves vectorization issue related to
aggressive if-conversion:
(1) Use the only mask for load/store of given size;
(2) Use iterative algorithm to remove all multiple predicate uses in
pattern tree - splitting of predicate definition may introduce new
variable with multiple uses.

Reproducer is included to patch.

Bootstrap and regression testing did not show any new failures.

Is it OK for trunk?

2015-01-23  Yuri Rumyantsev  

PR tree-optimization/64746
* tree-if-conv.c (mask_exists): New function.
(predicate_mem_writes): Save created mask with given size for further
use.
(stmt_is_root_of_bool_pattern): Remove argument VAR and store to it.
(ifcvt_repair_bool_pattern): Collect all statements that are root
of bool pattern and use iterative algorithm to remove multiple uses
of predicates, display number of required iterations.

gcc/testsuite/ChangeLog

* gcc.target/i386/avx2-vect-aggressive-1.c: New test.


64746.patch
Description: Binary data


  1   2   3   >