Re: [RFC] Induction variable candidates not sufficiently general

2018-07-13 Thread Bin.Cheng
On Fri, Jul 13, 2018 at 6:04 AM, Kelvin Nilsen  wrote:
> A somewhat old "issue report" pointed me to the code generated for a 4-fold 
> manually unrolled version of the following loop:
>
>>   while (++len != len_limit) /* this is loop */
>>   if (pb[len] != cur[len])
>>   break;
>
> As unrolled, the loop appears as:
>
>> while (++len != len_limit) /* this is loop */ {
>>   if (pb[len] != cur[len])
>> break;
>>   if (++len == len_limit)  /* unrolled 2nd iteration */
>> break;
>>   if (pb[len] != cur[len])
>> break;
>>   if (++len == len_limit)  /* unrolled 3rd iteration */
>> break;
>>   if (pb[len] != cur[len])
>> break;
>>   if (++len == len_limit)  /* unrolled 4th iteration */
>> break;
>>   if (pb[len] != cur[len])
>> break;
>> }
>
> In examining the behavior of tree-ssa-loop-ivopts.c, I've discovered the only 
> induction variable candidates that are being considered are all forms of the 
> len variable.  We are not considering any induction variables to represent 
> the address expressions &pb[len] and &cur[len].
>
> I rewrote the source code for this loop to make the addressing expressions 
> more explicit, as in the following:
>
>>   cur++;
>>   while (++pb != last_pb) /* this is loop */ {
>>   if (*pb != *cur)
>> break;
>>   ++cur;
>>   if (++pb == last_pb)  /* unrolled 2nd iteration */
>> break;
>>   if (*pb != *cur)
>> break;
>>   ++cur;
>>   if (++pb == last_pb)  /* unrolled 3rd iteration */
>> break;
>>   if (*pb != *cur)
>> break;
>>   ++cur;
>>   if (++pb == last_pb)  /* unrolled 4th iteration */
>> break;
>>   if (*pb != *cur)
>> break;
>>   ++cur;
>>   }
>
> Now, gcc does a better job of identifying the "address expression induction 
> variables".  This version of the loop runs about 10% faster than the original 
> on my target architecture.
>
> This would seem to be a textbook pattern for the induction variable analysis. 
>  Does anyone have any thoughts on the best way to add these candidates to the 
> set of induction variables that are considered by tree-ssa-loop-ivopts.c?
>
> Thanks in advance for any suggestions.
>
Hi,
Could you please file a bug with your original slow test code
attached?  I tried to construct meaningful test case from your code
snippet but not successful.  There is difference in generated
assembly, but it's not that fundamental.  So a bug with preprocessed
test would be high appreciated.
I think there are two potential issues in cost computation for such
case: invariant expression and iv uses outside of loop handled as
inside uses.

Thanks,
bin


Re: [PATCH] When using -fprofile-generate=/some/path mangle absolute path of file (PR lto/85759).

2018-07-19 Thread Bin.Cheng
On Fri, Jun 29, 2018 at 9:54 PM, Martin Liška  wrote:
> On 06/22/2018 10:35 PM, Jeff Law wrote:
>> On 05/16/2018 05:53 AM, Martin Liška wrote:
>>> On 12/21/2017 10:13 AM, Martin Liška wrote:
 On 12/20/2017 06:45 PM, Jakub Jelinek wrote:
> Another thing is that the "/" in there is wrong, so
>   const char dir_separator_str[] = { DIR_SEPARATOR, '\0' };
>   char *b = concat (profile_data_prefix, dir_separator_str, pwd, NULL);
> needs to be used instead.
 This looks much nicer, I forgot about DIR_SEPARATOR.

> Does profile_data_prefix have any dir separators stripped from the end?
 That's easy to achieve..

> Is pwd guaranteed to be relative in this case?
 .. however this is absolute path, which would be problematic on a DOC 
 based FS.
 Maybe we should do the same path mangling as we do for purpose of gcov:

 https://github.com/gcc-mirror/gcc/blob/master/gcc/gcov.c#L2424
>>> Hi.
>>>
>>> I decided to implement that. Which means for:
>>>
>>> $ gcc -fprofile-generate=/tmp/myfolder empty.c -O2 && ./a.out
>>>
>>> we get following file:
>>> /tmp/myfolder/#home#marxin#Programming#testcases#tmp#empty.gcda
>>>
>>> That guarantees we have a unique file path. As seen in the PR it
>>> can produce a funny ICE.
>>>
>>> I've been testing the patch.
>>> Ready after it finishes tests?
>>>
>>> Martin
>>>
 What do you think about it?
 Regarding the string manipulation: I'm not an expert, but work with string 
 in C
 is for me always a pain :)

 Martin

>>>
>>> 0001-When-using-fprofile-generate-some-path-mangle-absolu.patch
>>>
>>>
>>> From 386a4561a4d1501e8959871791289e95f6a89af5 Mon Sep 17 00:00:00 2001
>>> From: marxin 
>>> Date: Wed, 16 Aug 2017 10:22:57 +0200
>>> Subject: [PATCH] When using -fprofile-generate=/some/path mangle absolute 
>>> path
>>>  of file (PR lto/85759).
>>>
>>> gcc/ChangeLog:
>>>
>>> 2018-05-16  Martin Liska  
>>>
>>>  PR lto/85759
>>>  * coverage.c (coverage_init): Mangle full path name.
>>>  * doc/invoke.texi: Document the change.
>>>  * gcov-io.c (mangle_path): New.
>>>  * gcov-io.h (mangle_path): Likewise.
>>>  * gcov.c (mangle_name): Use mangle_path for path mangling.
>> ISTM you can self-approve this now if you want it to move forward :-)
>>
>> jeff
>>
>
> Sure, let me install the patch then.
Hi,
I am a bit confused after path mangling change.
Now with below command line:
$ ./gcc -O2 -fprofile-use=./ sort.c -o sort.c
or
$ ./gcc -O2 -fprofile-use=./sort.gcda sort.c -o sort.c

The da_file_name and the final name used in gcov_open is as:

$ p name
$11 = 0x2e63050
./#home#chengbin.cb#work#gcc-patches#trunk-orig#target.build#bin#sort.gcda
or
p da_file_name
$1 = 0x2e63050 
"sort.gcda/#home#chengbin.cb#work#gcc-patches#trunk-orig#target.build#bin#sort.gcda"


These are not valid paths?  Or how should I modify the command line options?

Thanks,
bin
>
> Martin


Re: [RFC] Induction variable candidates not sufficiently general

2018-07-20 Thread Bin.Cheng
On Tue, Jul 17, 2018 at 2:08 AM, Kelvin Nilsen  wrote:
> Thanks for looking at this for me.  In simplifying the test case for a bug 
> report, I've narrowed the "problem" to integer overflow considerations.  My 
> len variable is declared int, and the target has 64-bit pointers.  I'm 
> gathering that the "manual transformation" I quoted below is not considered 
> "equivalent" to the original source code due to different integer overflow 
> behaviors.  If I redeclare len to be unsigned long long, then I automatically 
> get the optimizations that I was originally expecting.
>
> I suppose this is really NOT a bug?
As your test case demonstrates, it is caused by wrapping unsigned int32.
>
> Is there a compiler optimization flag that allows the optimizer to ignore 
> array index integer overflow in considering legal optimizations?
I am not aware of one for unsigned integer, and I guess it won't be
introduced in the future either?

Thanks,
bin
>
>
>
> On 7/13/18 9:14 PM, Bin.Cheng wrote:
>> On Fri, Jul 13, 2018 at 6:04 AM, Kelvin Nilsen  
>> wrote:
>>> A somewhat old "issue report" pointed me to the code generated for a 4-fold 
>>> manually unrolled version of the following loop:
>>>
>>>>   while (++len != len_limit) /* this is loop */
>>>>   if (pb[len] != cur[len])
>>>>   break;
>>>
>>> As unrolled, the loop appears as:
>>>
>>>> while (++len != len_limit) /* this is loop */ {
>>>>   if (pb[len] != cur[len])
>>>> break;
>>>>   if (++len == len_limit)  /* unrolled 2nd iteration */
>>>> break;
>>>>   if (pb[len] != cur[len])
>>>> break;
>>>>   if (++len == len_limit)  /* unrolled 3rd iteration */
>>>> break;
>>>>   if (pb[len] != cur[len])
>>>> break;
>>>>   if (++len == len_limit)  /* unrolled 4th iteration */
>>>> break;
>>>>   if (pb[len] != cur[len])
>>>> break;
>>>> }
>>>
>>> In examining the behavior of tree-ssa-loop-ivopts.c, I've discovered the 
>>> only induction variable candidates that are being considered are all forms 
>>> of the len variable.  We are not considering any induction variables to 
>>> represent the address expressions &pb[len] and &cur[len].
>>>
>>> I rewrote the source code for this loop to make the addressing expressions 
>>> more explicit, as in the following:
>>>
>>>>   cur++;
>>>>   while (++pb != last_pb) /* this is loop */ {
>>>>   if (*pb != *cur)
>>>> break;
>>>>   ++cur;
>>>>   if (++pb == last_pb)  /* unrolled 2nd iteration */
>>>> break;
>>>>   if (*pb != *cur)
>>>> break;
>>>>   ++cur;
>>>>   if (++pb == last_pb)  /* unrolled 3rd iteration */
>>>> break;
>>>>   if (*pb != *cur)
>>>> break;
>>>>   ++cur;
>>>>   if (++pb == last_pb)  /* unrolled 4th iteration */
>>>> break;
>>>>   if (*pb != *cur)
>>>> break;
>>>>   ++cur;
>>>>   }
>>>
>>> Now, gcc does a better job of identifying the "address expression induction 
>>> variables".  This version of the loop runs about 10% faster than the 
>>> original on my target architecture.
>>>
>>> This would seem to be a textbook pattern for the induction variable 
>>> analysis.  Does anyone have any thoughts on the best way to add these 
>>> candidates to the set of induction variables that are considered by 
>>> tree-ssa-loop-ivopts.c?
>>>
>>> Thanks in advance for any suggestions.
>>>
>> Hi,
>> Could you please file a bug with your original slow test code
>> attached?  I tried to construct meaningful test case from your code
>> snippet but not successful.  There is difference in generated
>> assembly, but it's not that fundamental.  So a bug with preprocessed
>> test would be high appreciated.
>> I think there are two potential issues in cost computation for such
>> case: invariant expression and iv uses outside of loop handled as
>> inside uses.
>>
>> Thanks,
>> bin
>>
>>


Re: [PATCH] Improve debug info in ivopts optimized loops (PR debug/90231)

2019-10-23 Thread Bin.Cheng
On Tue, Oct 22, 2019 at 3:32 PM Jakub Jelinek  wrote:
>
> On Mon, Oct 21, 2019 at 01:24:30PM +0200, Jakub Jelinek wrote:
> > So I wonder if for correctness I don't need to add:
> >
> >   if (!use->iv->no_overflow
> >   && !cand->iv->no_overflow
> >   && !integer_pow2p (cstep))
> > return NULL_TREE;
> >
> > with some of the above as comment explaining why.
> >
> > On the other side, if cand->iv->no_overflow, couldn't we bypass the extra
> > precision test?
>
> Here are these two in patch form.
>
> 2019-10-22  Jakub Jelinek  
>
> PR debug/90231
> * tree-ssa-loop-ivopts.c (get_debug_computation_at): New function.
> (remove_unused_ivs): Use it instead of get_computation_at.  When
> choosing best candidate, only consider candidates where
> get_debug_computation_at actually returns non-NULL.
>
> --- gcc/tree-ssa-loop-ivopts.c.jj   2019-10-21 14:17:57.598198162 +0200
> +++ gcc/tree-ssa-loop-ivopts.c  2019-10-22 09:30:09.782238157 +0200
> @@ -4089,6 +4089,94 @@ get_computation_at (class loop *loop, gi
>return fold_convert (type, aff_combination_to_tree (&aff));
>  }
>
> +/* Like get_computation_at, but try harder, even if the computation
> +   is more expensive.  Intended for debug stmts.  */
> +
> +static tree
> +get_debug_computation_at (class loop *loop, gimple *at,
> + struct iv_use *use, struct iv_cand *cand)
> +{
> +  if (tree ret = get_computation_at (loop, at, use, cand))
> +return ret;
> +
> +  tree ubase = use->iv->base, ustep = use->iv->step;
> +  tree cbase = cand->iv->base, cstep = cand->iv->step;
> +  tree var;
> +  tree utype = TREE_TYPE (ubase), ctype = TREE_TYPE (cbase);
> +  widest_int rat;
> +
> +  /* We must have a precision to express the values of use.  */
> +  if (TYPE_PRECISION (utype) >= TYPE_PRECISION (ctype))
> +return NULL_TREE;
> +
> +  /* Try to handle the case that get_computation_at doesn't,
> + try to express
> + use = ubase + (var - cbase) / ratio.  */
> +  if (!constant_multiple_of (cstep, fold_convert (TREE_TYPE (cstep), ustep),
> +&rat))
> +return NULL_TREE;
> +
> +  bool neg_p = false;
> +  if (wi::neg_p (rat))
> +{
> +  if (TYPE_UNSIGNED (ctype))
> +   return NULL_TREE;
> +  neg_p = true;
> +  rat = wi::neg (rat);
> +}
> +
> +  /* If both IVs can wrap around and CAND doesn't have a power of two step,
> + it is unsafe.  Consider uint16_t CAND with step 9, when wrapping around,
> + the values will be ... 0xfff0, 0xfff9, 2, 11 ... and when use is say
> + uint8_t with step 3, those values divided by 3 cast to uint8_t will be
> + ... 0x50, 0x53, 0, 3 ... rather than expected 0x50, 0x53, 0x56, 0x59.  
> */
Interesting, so we can still get correct debug info for iter in such
special cases.

> +  if (!use->iv->no_overflow
> +  && !cand->iv->no_overflow
> +  && !integer_pow2p (cstep))
> +return NULL_TREE;
> +
> +  int bits = wi::exact_log2 (rat);
> +  if (bits == -1)
> +bits = wi::floor_log2 (rat) + 1;
> +  if (!cand->iv->no_overflow
> +  && TYPE_PRECISION (utype) + bits > TYPE_PRECISION (ctype))
> +return NULL_TREE;
The patch is fine for me.

Just for the record, guess we may try to find (by recording
information in early phase) the correct/bijection candidate in
computing the iv in debuginfo in the future, then those checks would
be unnecessary.

Thanks,
bin
> +
> +  var = var_at_stmt (loop, cand, at);
> +
> +  if (POINTER_TYPE_P (ctype))
> +{
> +  ctype = unsigned_type_for (ctype);
> +  cbase = fold_convert (ctype, cbase);
> +  cstep = fold_convert (ctype, cstep);
> +  var = fold_convert (ctype, var);
> +}
> +
> +  ubase = unshare_expr (ubase);
> +  cbase = unshare_expr (cbase);
> +  if (stmt_after_increment (loop, cand, at))
> +var = fold_build2 (MINUS_EXPR, TREE_TYPE (var), var,
> +  unshare_expr (cstep));
> +
> +  var = fold_build2 (MINUS_EXPR, TREE_TYPE (var), var, cbase);
> +  var = fold_build2 (EXACT_DIV_EXPR, TREE_TYPE (var), var,
> +wide_int_to_tree (TREE_TYPE (var), rat));
> +  if (POINTER_TYPE_P (utype))
> +{
> +  var = fold_convert (sizetype, var);
> +  if (neg_p)
> +   var = fold_build1 (NEGATE_EXPR, sizetype, var);
> +  var = fold_build2 (POINTER_PLUS_EXPR, utype, ubase, var);
> +}
> +  else
> +{
> +  var = fold_convert (utype, var);
> +  var = fold_build2 (neg_p ? MINUS_EXPR : PLUS_EXPR, utype,
> +ubase, var);
> +}
> +  return var;
> +}
> +
>  /* Adjust the cost COST for being in loop setup rather than loop body.
> If we're optimizing for space, the loop setup overhead is constant;
> if we're optimizing for speed, amortize it over the per-iteration cost.
> @@ -7523,6 +7611,7 @@ remove_unused_ivs (struct ivopts_data *d
>   struct iv_use dummy_use;
>   struct iv_cand *best_cand = NULL, *cand;
>   unsigned

Re: [PATCH GCC11]Improve uninitialized warning with value range info

2020-01-12 Thread bin.cheng
--
Sender:Richard Biener 
Sent At:2020 Jan. 9 (Thu.) 20:01
Recipient:Bin.Cheng 
Cc:bin.cheng ; GCC Patches 

Subject:Re: [PATCH GCC11]Improve uninitialized warning with value range info


On Thu, Jan 9, 2020 at 11:17 AM Bin.Cheng  wrote:
> > I am not quite follow here. Do you mean we collect three cases "i <
> > j", "i < min(j)", "max(i) < j" then
> > call prune_uninit_phi_opnds for all three conditions?
> 
> No, I've meant to somehow arrange that the 'preds' passed to
> use_pred_not_overlap_with_undef_path_pred contain all three predicates
> rather than just i < j, thus "expand" fully symbolic predicates.

Seems this would require non-trivial refactoring of the original code.

> > This is another question? because now we simply break out of for loop
> > for finding such condition:
> > 
> > -  if ((gimple_code (flag_def) == GIMPLE_PHI)
> > - && (gimple_bb (flag_def) == gimple_bb (phi))
> > - && find_matching_predicate_in_rest_chains (the_pred, preds,
> > -num_preds))
> > -   break;
> > 
> > It's always possible that this flag_def can't prune use predicates
> > against undefined path predicates, while a later one can prune but is
> > skipped?
> 
> I don't follow but I also don't want to understand the code too much ;)
> 
> I'm fine with the idea and if the patch cannot introudce extra bogus warnings
> let's go with it.  Can you amed the comment before the two find_var_cmp_const
> calls?  I wonder whether eliding the second sweep when the first didn't find
> any predicate it skipped is worth the trouble.

Thanks for the comments, I updated the patch as attached.

Thanks,
bin

2020-01-08  Bin Cheng  

* tree-ssa-uninit.c (find_var_cmp_const): New function.
(use_pred_not_overlap_with_undef_path_pred): Call above.
(find_matching_predicate_in_rest_chains): Remove param.

0001-Fix-false-uninitialized-warning-message.patch
Description: Binary data


Re: [PATCH PR92926]Fix wrong code caused by ctor node translation unit wide sharing

2020-01-14 Thread Bin.Cheng
On Wed, Jan 15, 2020 at 5:04 AM Jeff Law  wrote:
>
> On Thu, 2020-01-09 at 14:20 +0800, Bin.Cheng wrote:
> > On Fri, Dec 20, 2019 at 3:10 PM Richard Biener
> >  wrote:
> > > On December 20, 2019 2:13:47 AM GMT+01:00, "Bin.Cheng" 
> > >  wrote:
> > > > On Fri, Dec 13, 2019 at 11:26 AM bin.cheng
> > > >  wrote:
> > > > > Hi,
> > > > >
> > > > > As reported in PR92926, constant ctor is shared translation unit wide
> > > > because of constexpr_call_table,
> > > > > however, during gimplify, the shared ctor could be modified.  This
> > > > patch fixes the issue by unsharing
> > > > > it before modification in gimplify.  A test is reduced from cppcoro
> > > > library and added.
> > > > > Bootstrap and test ongoing.  Not sure if this is the correct fix
> > > > though, any comments?
> > > > Ping.  Any comment?
> > >
> > > Looks reasonable to me.
> > Given PR92926 is marked as duplicate of PR93143, I updated test case
> > of the patch.
> >
> > Thanks,
> > bin
> >
> > 2019-12-13  Bin Cheng  
> >
> > PR tree-optimization/93143
> > * gimplify.c (gimplify_init_constructor): Unshare ctor node before
> > clearing.
> >
> > gcc/testsuite
> > 2019-12-13  Bin Cheng  
> >
> > PR tree-optimization/93143
> > * g++.dg/pr93143.C: New test.
> Is this still needed?  I think Jason fixed the outstanding sharing
> problems a couple days ago.
Yes, the issue is fixed and this can be discarded.

Thanks,
bin
>
> jeff
>


Re: [C++ coroutines 4/6] Middle end expanders and transforms.

2020-01-15 Thread Bin.Cheng
> > +   gassign *get_res = gimple_build_assign (lhs, done);
> > +   gsi_replace (gsi, get_res, true);
> > +   *handled_ops_p = true;
> > + }
> > + break;
> > +   }
> > +}
> > +  return NULL_TREE;
> > +}
> > +
> > +/* Main entry point for lowering coroutine FE builtins.  */
> > +
> > +static unsigned int
> > +execute_lower_coro_builtins (void)
> > +{
> > +  struct walk_stmt_info wi;
> > +  gimple_seq body;
> > +
> > +  body = gimple_body (current_function_decl);
> > +  memset (&wi, 0, sizeof (wi));
> > +  walk_gimple_seq_mod (&body, lower_coro_builtin, NULL, &wi);
> > +  gimple_set_body (current_function_decl, body);
>
> it would be nice to elide the function walk for functions not
> containing any CO* stuff (you noted that below for other parts).
> We can spend a bit in struct function noting functions with
> coroutine code inside and set the bit from frontends or from
> the gimplifier for example.  Since it's behind the flag_coroutines
> paywall this can be addressed as followup.

Yes, this bit flag is necessary for following optimization passes, I
wonder which bit you would suggest?

Thanks,
bin


[PATCH Coroutines]Fix false warning message about missing return

2020-01-19 Thread bin.cheng
Hi,
The patch sets current_function_returns_value flag in templates for all 
co_return/co_yield/co_await cases, as well as for ramp function.  This
fixes false warning message for case like the added, or various cases in
cppcoro.

Bootstrap and test on X86_64, is it OK?

Thanks,
bin

gcc/cp
2020-01-20  Bin Cheng  

* coroutines.cc (finish_co_await_expr): Set return value flag.
(finish_co_yield_expr, morph_fn_to_coro): Ditto.

gcc/testsuite
2020-01-20  Bin Cheng  

* g++.dg/coroutines/co-return-warning-1.C: New test.

0001-Fix-false-warning-messages-about-missing-return-in-c.patch
Description: Binary data


[PATCH Coroutines]Fix ICE when co_awaiting on void type

2020-01-19 Thread bin.cheng
Hi,

This simple patch skips calling complete_type_or_else for void type, which 
fixes the
corresponding ICE.

Thanks,
bin

gcc/cp
2020-01-20  Bin Cheng  

* coroutines.cc (build_co_await): Skip getting complete type for void.

gcc/testsuite
2020-01-20  Bin Cheng  

* g++.dg/coroutines/co-await-void_type.C: New test.

0001-Fix-ICE-when-co_awaiting-on-void-type.patch
Description: Binary data


Re: [PATCH Coroutines]Fix ICE when co_awaiting on void type

2020-01-20 Thread Bin.Cheng
On Mon, Jan 20, 2020 at 5:28 PM Jakub Jelinek  wrote:
>
> On Mon, Jan 20, 2020 at 08:59:20AM +, Iain Sandoe wrote:
> > Hi Bin,
> >
> > bin.cheng  wrote:
> >
> > > gcc/cp
> > > 2020-01-20  Bin Cheng  
> > >
> > >* coroutines.cc (build_co_await): Skip getting complete type for 
> > > void.
> > >
> > > gcc/testsuite
> > > 2020-01-20  Bin Cheng  
> > >
> > >* g++.dg/coroutines/co-await-void_type.C: New 
> > > test.<0001-Fix-ICE-when-co_awaiting-on-void-type.patch>
> >
> > This LGTM, (borderline obvious, in fact) but you will need to wait for a C++
> > maintainer to approve,
>
> The patch actually looks wrong to me, there is nothing magic about void
> type here, it will ICE on any other incomplete type, because
> complete_type_or_else returns NULL if the type is incomplete.
>
> So, I think you want instead:
>tree o_type = complete_type_or_else (TREE_TYPE (o), o);
> +  if (o_type == NULL_TREE)
> +return error_mark_node;
>if (TREE_CODE (o_type) != RECORD_TYPE)
>
> or similar (the diagnostics should be emitted already, so I don't see
> further point to emit it once again).

Thanks very much for pointing out.  I was trying to bypass generic
void error message to make it more related to coroutine, like:
e1.cc: In function ‘resumable foo(int)’:
e1.cc:45:3: error: awaitable type ‘void’ is not a structure
   45 |   co_yield n + x + y;
  |   ^~~~

vs.

e1.cc: In function ‘resumable foo(int)’:
e1.cc:45:20: error: invalid use of ‘void’
   45 |   co_yield n + x + y;
  |^

Anyway I will update the patch.

Thanks,
bin
>
> But it also means that other uses of complete_type_or_else look broken:
>
>   /* Complete this, we're going to use it.  */
>   coro_info->handle_type = complete_type_or_else (handle_type, fndecl);
>   /* Diagnostic would be emitted by complete_type_or_else.  */
>   if (coro_info->handle_type == error_mark_node)
> return false;
>
> or
>
>   if (!COMPLETE_TYPE_P (actual_type))
> actual_type = complete_type_or_else (actual_type, *stmt);
>
>   if (TREE_CODE (actual_type) == REFERENCE_TYPE)
> actual_type = build_pointer_type (TREE_TYPE (actual_type));
>
> where I think the first one should check for handle_type being NULL_TREE
> rather than error_mark_node, and the latter should handle the case of
> it returning NULL_TREE.
>
> Right below the last hunk, I've noticed:
>
>   size_t namsize = sizeof ("__parm.") + IDENTIFIER_LENGTH (pname) + 1;
>   char *buf = (char *) alloca (namsize);
>   snprintf (buf, namsize, "__parm.%s", IDENTIFIER_POINTER (pname));
>
> Do you really need one byte extra?  I mean, sizeof ("__parm.") already
> includes +1 for the terminating '\0', so unless you append something to
> the buffer later, I don't see the point for the + 1.
> And the middle line could be written as
>   char *buf = XALLOCAVEC (char, namsize);
>
> Jakub
>


Re: [PATCH Coroutines]Fix ICE when co_awaiting on void type

2020-01-20 Thread Bin.Cheng
On Mon, Jan 20, 2020 at 6:31 PM Iain Sandoe  wrote:
>
> Hi Bin,
>
> Bin.Cheng  wrote:
>
> > On Mon, Jan 20, 2020 at 5:28 PM Jakub Jelinek  wrote:
> >> On Mon, Jan 20, 2020 at 08:59:20AM +, Iain Sandoe wrote:
> >>> Hi Bin,
> >>>
> >>> bin.cheng  wrote:
> >>>
> >>>> gcc/cp
> >>>> 2020-01-20  Bin Cheng  
> >>>>
> >>>>   * coroutines.cc (build_co_await): Skip getting complete type for 
> >>>> void.
> >>>>
> >>>> gcc/testsuite
> >>>> 2020-01-20  Bin Cheng  
> >>>>
> >>>>   * g++.dg/coroutines/co-await-void_type.C: New 
> >>>> test.<0001-Fix-ICE-when-co_awaiting-on-void-type.patch>
> >>>
> >>> This LGTM, (borderline obvious, in fact) but you will need to wait for
> >>> a C++
> >>> maintainer to approve,
> >>
> >> The patch actually looks wrong to me, there is nothing magic about void
> >> type here, it will ICE on any other incomplete type, because
> >> complete_type_or_else returns NULL if the type is incomplete.
> >>
> >> So, I think you want instead:
> >>   tree o_type = complete_type_or_else (TREE_TYPE (o), o);
> >> +  if (o_type == NULL_TREE)
> >> +return error_mark_node;
> >>   if (TREE_CODE (o_type) != RECORD_TYPE)
> >>
> >> or similar (the diagnostics should be emitted already, so I don't see
> >> further point to emit it once again).
> >
> > Thanks very much for pointing out.  I was trying to bypass generic
> > void error message to make it more related to coroutine, like:
> > e1.cc: In function ‘resumable foo(int)’:
> > e1.cc:45:3: error: awaitable type ‘void’ is not a structure
> >   45 |   co_yield n + x + y;
> >  |   ^~~~
> >
> > vs.
> >
> > e1.cc: In function ‘resumable foo(int)’:
> > e1.cc:45:20: error: invalid use of ‘void’
> >   45 |   co_yield n + x + y;
> >  |^
> >
> > Anyway I will update the patch.
>
> ...I had already made a start...
Thanks very much!

>
> The patch below gives the improved diagnostic while checking for NULL
> returns fom complete_type_or_else.
>
> OK?
> Iain
>
>
> gcc/cp
>
> 2020-01-20  Bin Cheng  
You can remove me from the ChangeLog or keep it as "bin.cheng"  :-)
>   Iain Sandoe  
>
> * coroutines.cc (coro_promise_type_found_p): Check for NULL return 
> from
> complete_type_or_else.
> (register_param_uses): Likewise.
> (build_co_await): Do not try to use complete_type_or_else for void 
> types,
> otherwise for incomplete types, check for NULL return from 
> complete_type_or_else.
>
> gcc/testsuite
> 2020-01-20  Bin Cheng  
>
>* g++.dg/coroutines/co-await-void_type.C: New 
> test.<0001-Fix-ICE-when-co_awaiting-on-void-type.patch>
>
>
> diff --git a/gcc/cp/coroutines.cc b/gcc/cp/coroutines.cc
> index f0febfe0c8a..4e1ba81fb46 100644
> --- a/gcc/cp/coroutines.cc
> +++ b/gcc/cp/coroutines.cc
> @@ -428,8 +428,9 @@ coro_promise_type_found_p (tree fndecl, location_t loc)
>
> /* Complete this, we're going to use it.  */
> coro_info->handle_type = complete_type_or_else (handle_type, fndecl);
> +
> /* Diagnostic would be emitted by complete_type_or_else.  */
> -  if (coro_info->handle_type == error_mark_node)
> +  if (!coro_info->handle_type)
> return false;
>
> /* Build a proxy for a handle to "self" as the param to
> @@ -651,8 +652,11 @@ build_co_await (location_t loc, tree a,
> suspend_point_kind suspend_kind)
>   o = a; /* This is most likely about to fail anyway.  */
>
> tree o_type = TREE_TYPE (o);
> -  if (!VOID_TYPE_P (o_type))
> -o_type = complete_type_or_else (TREE_TYPE (o), o);
> +  if (o_type && !VOID_TYPE_P (o_type) && !COMPLETE_TYPE_P (o_type))
IIUC, Jakub doesn't like checking void type specially here?

> +o_type = complete_type_or_else (o_type, o);
> +
> +  if (!o_type)
> +return error_mark_node;
>
> if (TREE_CODE (o_type) != RECORD_TYPE)
>   {
> @@ -2746,6 +2750,10 @@ register_param_uses (tree *stmt, int *do_subtree
> ATTRIBUTE_UNUSED, void *d)
> if (!COMPLETE_TYPE_P (actual_type))
> actual_type = complete_type_or_else (actual_type, *stmt);
>
> +  if (actual_type == NULL_TREE)
> +   /* Diagnostic emitted by complete_type_or_else.  */
> +   actual_type = error_mark_node;
> +
> if (TREE_CODE (actual_type) == REFERENCE_TYPE)
> actual_type = build_pointer_type (TREE_TYPE (actual_type));
>
>

Thanks,
bin


[PATCH Coroutines]Access promise via actor function's frame pointer argument

2020-01-20 Thread bin.cheng
Hi,

By standard, coroutine body should be encapsulated in try-catch block
as following:
  try {
// coroutine body
  } catch(...) {
promise.unhandled_exception();
  }
Given above try-catch block is implemented in the coroutine actor
function called by coroutine ramp function, so the promise should
be accessed via actor function's coroutine frame pointer argument,
rather than the ramp function's coroutine frame variable.

This patch also refactored code to make the fix easy, unfortunately,
I failed to reduce a test case from cpproro.  Bootstrap and test ongoing.  Is 
it OK?

Thanks,
bin

gcc/cp
2020-01-20  Bin Cheng  

* coroutines.cc (act_des_fn, wrap_coroutine_body): New.
(morph_fn_to_coro): Call act_des_fn to build actor/destroy decls, as
well as access promise via actor function's frame pointer argument.
Refactor code into above functions.
(build_actor_fn, build_destroy_fn): Use frame pointer argument.

0001-Use-promise-in-coroutine-frame-in-actor-function.patch
Description: Binary data


Re: [PATCH Coroutines]Access promise via actor function's frame pointer argument

2020-01-20 Thread bin.cheng
On Mon, Jan 20, 2020 at 10:59 PM Iain Sandoe  wrote:
>
> Hi Bin,
>
> bin.cheng  wrote:
>
> > By standard, coroutine body should be encapsulated in try-catch block
> > as following:
> >  try {
> >    // coroutine body
> >  } catch(...) {
> >    promise.unhandled_exception();
> >  }
> > Given above try-catch block is implemented in the coroutine actor
> > function called by coroutine ramp function, so the promise should
> > be accessed via actor function's coroutine frame pointer argument,
> > rather than the ramp function's coroutine frame variable.
>
> thanks, good catch!
> it has not triggered for me (including on some more complex test-cases that 
> are not useable
> in the testsuite).
>
> > This patch also refactored code to make the fix easy, unfortunately,
>
> see below,
>
> > I failed to reduce a test case from cpproro.
>
> it would be good if we could try to find a reproducer.  I’m happy to try and 
> throw
> creduce at a preprocessed file, if you have one.
>
> > gcc/cp
> > 2020-01-20  Bin Cheng  
> >
> >        * coroutines.cc (act_des_fn, wrap_coroutine_body): New.
> >        (morph_fn_to_coro): Call act_des_fn to build actor/destroy decls, as
> >        well as access promise via actor function's frame pointer argument.
> >        Refactor code into above functions.
> >        (build_actor_fn, build_destroy_fn): Use frame pointer argument
> +      /* We still try to look for the promise method and warn if it's not
> +        present.  */
> +      tree ueh_meth
> +       = lookup_promise_method (orig, coro_unhandled_exception_identifier,
> +                                loc, /*musthave=*/ false);
> +      if (!ueh_meth || ueh_meth == error_mark_node)
> +       warning_at (loc, 0, "no member named %qE in %qT",
> +                   coro_unhandled_exception_identifier,
> +                   get_coroutine_promise_type (orig));
> +    }
> +  /* Else we don't check and don't care if the method is missing.  */
> +
> +  return fnbody;
> +}
>
> IMO this ^^^ obfuscates the fix, and I don’t think it should be done at the 
> same time.
Sure, given we are in this stage, I split the patch and leave refactoring to 
future.  IMHO, the function is too long so self-contained operations worth 
outlined function even it's called once.

Patch updated as attached.

Thanks,
bin

gcc/cp
2020-01-20  Bin Cheng  
        * coroutines.cc (act_des_fn): New.
        (morph_fn_to_coro): Call act_des_fn to build actor/destroy decls.
        Access promise via actor function's frame pointer argument.
        (build_actor_fn, build_destroy_fn): Use frame pointer argument.

0001-Use-promise-in-coroutine-frame-in-actor-function.patch
Description: Binary data


[PATCH Coroutines]Fix an ICE case in expanding co_await expression

2020-01-22 Thread bin.cheng
Hi,

Though function co_await_expander may need to be further revised, this simple 
patch fixes an ICE case in co_await_expander, 

Handle CO_AWAIT_EXPR in conversion in co_await_expander.

Function co_await_expander expands CO_AWAIT_EXPR and inserts expanded
code before result of co_await is used, however, it doesn't cover the
type conversion case and leads to gimplify ICE.  This patch fixes it.

Bootstrap and test on x86_64.  Is it OK?

Thanks,
bin

gcc/cp
2020-01-22  Bin Cheng  

* coroutines.cc (co_await_expander): Handle type conversion case.

gcc/testsuite
2020-01-22  Bin Cheng  

* g++.dg/coroutines/co-await-syntax-09-convert.C: New test.

0001-Handle-CO_AWAIT_EXPR-in-conversion-in-co_await_expan.patch
Description: Binary data


[PATCH PR93334][GCC11]Refine data dependence of two refs storing the same constant with the same bytes

2020-01-29 Thread bin.cheng
Hi,

As discussed in the PR, this simple patch refines data dependence of two write
references storing the same constant with the same bytes.  It simply detects
the case with some restrictions and treats it as no dependence.  For now the
added interface in tree-data-ref.c is only used in loop distribution, which 
might
be generalized in the future.

Bootstrap and test on x86_64.  Any comments?

Thanks,
bin

2020-01-29  Bin Cheng  

* tree-data-ref.c (refine_affine_dependence): New function.
* tree-data-ref.h (refine_affine_dependence): New declaration.
* tree.h (const_with_all_bytes_same): External declaration.
* tree.c (const_with_all_bytes_same): Moved from...
* tree-loop-distribution.c (const_with_all_bytes_same): ...here.  Call
refine_affine_dependence

gcc/testsuite
2020-01-29  Bin Cheng  

* gcc/testsuite/gcc.dg/tree-ssa/pr93334.c: New test.

0001-pr93334.patch
Description: Binary data


Re: [PATCH, ivopts] Fix fast-math-pr55281.c ICE

2020-01-30 Thread Bin.Cheng
On Thu, Jan 30, 2020 at 8:53 PM Andrew Stubbs  wrote:
>
> On 29/01/2020 08:24, Richard Biener wrote:
> > On Tue, Jan 28, 2020 at 5:53 PM Andrew Stubbs  wrote:
> >>
> >> This patch fixes an ICE compiling fast-math-pr55281.c for amdgcn.
> >>
> >> The problem is that an "iv" is created in which both base and step are
> >> pointer types,
> >
> > How did you get a POINTER_TYPE step?  That's where the issue lies
> > I think.
>
> It can come from "find_inv_vars_cb":
>
>set_iv (idata, op, op, build_int_cst (TREE_TYPE (op), 0), true);

This is recording invariant with zero step.  It seems we are using
wrong type building the zero-step.  How about detecting that op has
pointer type and using integer type here?

Thanks,
bin
>
> whenever "op" has a pointer type.
>
> Similarly for "get_iv":
>
>set_iv (data, var, var, build_int_cst (type, 0), true);
>
> whenever "var" has a pointer type.
>
> In this particular case, I traced the origin back to the second one, I
> think (but it's somewhat hard to unpick).
>
> I could change one or both of those, but I don't understand enough about
> the consequences of that to be sure it's the right thing to do. I can
> confirm that converting the step at this point does appear to have the
> desired effect in this instance.
>
> At least at the point of writing it to gimple I can determine what is
> definitely malformed.
>
> Andrew


[PATCH Coroutines]Insert the default return_void call at correct position

2020-02-02 Thread bin.cheng
Hi,

Exception in coroutine is not correctly handled because the default
return_void call is now inserted before the finish suspend point,
rather than at the end of the original coroutine body.  This patch
fixes the issue by generating following code:
  co_await promise.initial_suspend();
  try {
// The original coroutine body

promise.return_void(); // The default return_void call.
  } catch (...) {
promise.unhandled_exception();
  }
  final_suspend:
  // ...

Bootstrap and test on x86_64.  Is it OK?

Thanks,
bin

gcc/cp
2020-02-03  Bin Cheng  

* coroutines.cc (build_actor_fn): Factor out code inserting the
default return_void call to...
(morph_fn_to_coro): ...here, also hoist local var declarations.

gcc/testsuite
2020-02-03  Bin Cheng  

* g++.dg/coroutines/torture/co-ret-15-default-return_void.C: New.

0001-Insert-default-return_void-at-the-end-of-coroutine-b.patch
Description: Binary data


[PATCH Coroutines]Pickup more CO_AWAIT_EXPR expanding cases

2020-02-04 Thread bin.cheng
Hi,
This simple patch actually is a missing part of previous one at
https://gcc.gnu.org/ml/gcc-patches/2020-01/msg01451.html
It picks up more CO_AWAIT_EXPR expanding cases.  Test is also added.

Bootstrap and test on x86_64. Is it OK?

Thanks,
bin

gcc/cp
2020-02-05  Bin Cheng  

* coroutines.cc (co_await_expander): Handle more CO_AWAIT_EXPR
expanding cases.

gcc/testsuite
2020-02-05  Bin Cheng  

* g++.dg/coroutines/co-await-syntax-10.C: New test.

0001-Pickup-more-CO_AWAIT_EXPR-expanding-cases.patch
Description: Binary data


[PATCH Coroutines][OBVIOUS]Fix typo of temporary variable name

2020-02-05 Thread bin.cheng
Hi,
This is an obvious patch fixing typo in maybe_promote_captured_temps,
previously, the vnum (== 0) is used for all captured temps...

Will commit later.

Thanks,
bin

gcc/cp
2020-02-05  Bin Cheng  

* coroutines.cc (maybe_promote_captured_temps): Increase the index
number for temporary variables' name.

obvious-vnum.patch
Description: Binary data


Re: [PATCH coroutines] Change lowering behavior of alias variable from copy to substitute

2020-02-06 Thread Bin.Cheng
On Thu, Feb 6, 2020 at 5:12 PM Iain Sandoe  wrote:
>
> Hi JunMa,
>
> JunMa  wrote:
>
> > 在 2020/2/4 下午8:17, JunMa 写道:
> >> Hi
> >> When testing coroutines with lambda function, I find we copy each captured
> >> variable to frame. However, according to gimplify pass, for each
> >> declaration
> >> that is an alias for another expression(DECL_VALUE_EXPR), we can
> >> substitute them directly.
> >>
> >> Since lambda captured variables is one of this kind. It is better to
> >> replace them
> >> rather than copy them, This can reduce frame size (all of the captured
> >> variables
> >> are field of closure class) and avoid extra copy behavior as well.
> >>
> >> This patch remove all of the code related to copy captured variable.
> >> Instead, we first rewrite DECL_VALUE_EXPR with frame field, then we check
> >> every variable whether it has DECL_VALUE_EXPR, and then substitute it,
> >> this
> >> patch does not create frame field for such variables.
> >>
> >> Bootstrap and test on X86_64, is it OK?
>
> > minor update: only handle var_decl when iterate BIND_EXPR_VARS
> > in register_local_var_uses.
>
> Do you have any other local patches applied along with this?
>
> Testing locally (on Darwin), I see regressions with optimisation  O2/O3/Os
> e.g:
>
> class-05-lambda-capture-copy-local.C   -O2  (internal compiler error)
> class-06-lambda-capture-ref.C   -O2  (internal compiler error)
> lambda-05-capture-copy-local.C   -O2  (internal compiler error)
> lambda-06-multi-capture.C   -O2  (internal compiler error)
> lambda-07-multi-capture.C   -O2  (internal compiler error)
> lambda-08-co-ret-parm-ref.C   -O3 -g  (internal compiler error)
>
> I have applied this to master, and on top of the patches posted by you and
> Bin, but the results are the same.
Hi Iains,

Thanks for helping.
Yes, there will be another patch fixing the O2/O3 issues.  Will send
it out for review soon.

Thanks,
bin
>
> thanks
> Iain
>
> >> gcc/cp
> >> 2020-02-04  Jun Ma 
> >>
> >> * coroutines.cc (morph_fn_to_coro): Remove code related to
> >> copy captured variable.
> >> (register_local_var_uses):  Ditto.
> >> (register_param_uses):  Collect use of parameters inside
> >> DECL_VALUE_EXPR.
> >> (transform_local_var_uses): Substitute the alias variable
> >> with DECL_VALUE_EXPR if it has one.
> >>
> >>
> >> gcc/testsuite
> >> 2020-02-04  Jun Ma 
> >>
> >> * g++.dg/coroutines/lambda-07-multi-capture.C: New test.
> >
> >
> > <0001-fix-alias-variable.patch>
>
>


Re: [PATCH Coroutines]Pickup more CO_AWAIT_EXPR expanding cases

2020-02-10 Thread bin.cheng
Hi,

We found more ICEs because of unexpanded CO_AWAIT_EXPR, it turned out we
can fix these issues with more simplification in function co_await_expander.  
Here
is the patch with a new test.

Bootstrap and test on x86_64.  Is it OK?

Thanks,
bin

gcc/cp
2020-02-10  Bin Cheng  

* coroutines.cc (co_await_expander): Simplify.

gcc/testsuite
2020-02-10  Bin Cheng  

* g++.dg/coroutines/co-await-syntax-10.C: New test.
* g++.dg/coroutines/co-await-syntax-11.C: New test.


--
Sender:bin.cheng 
Sent At:2020 Feb. 5 (Wed.) 09:34
Recipient:GCC Patches 
Cc:Iain Sandoe ; Nathan Sidwell 
Subject:[PATCH Coroutines]Pickup more CO_AWAIT_EXPR expanding cases


Hi,
This simple patch actually is a missing part of previous one at
https://gcc.gnu.org/ml/gcc-patches/2020-01/msg01451.html
It picks up more CO_AWAIT_EXPR expanding cases.  Test is also added.

Bootstrap and test on x86_64. Is it OK?

Thanks,
bin

0002-Simplify-co_await_expander.patch
Description: Binary data


Re: [PATCH Coroutines]Insert the default return_void call at correct position

2020-02-10 Thread Bin.Cheng
Ping.

Thanks,
bin

On Mon, Feb 3, 2020 at 1:55 PM bin.cheng  wrote:
>
> Hi,
>
> Exception in coroutine is not correctly handled because the default
> return_void call is now inserted before the finish suspend point,
> rather than at the end of the original coroutine body.  This patch
> fixes the issue by generating following code:
>   co_await promise.initial_suspend();
>   try {
> // The original coroutine body
>
> promise.return_void(); // The default return_void call.
>   } catch (...) {
> promise.unhandled_exception();
>   }
>   final_suspend:
>   // ...
>
> Bootstrap and test on x86_64.  Is it OK?
>
> Thanks,
> bin
>
> gcc/cp
> 2020-02-03  Bin Cheng  
>
> * coroutines.cc (build_actor_fn): Factor out code inserting the
> default return_void call to...
> (morph_fn_to_coro): ...here, also hoist local var declarations.
>
> gcc/testsuite
> 2020-02-03  Bin Cheng  
>
> * g++.dg/coroutines/torture/co-ret-15-default-return_void.C: New.


[PATCH PR92926]Fix wrong code caused by ctor node translation unit wide sharing

2019-12-12 Thread bin.cheng
Hi,

As reported in PR92926, constant ctor is shared translation unit wide because 
of constexpr_call_table,
however, during gimplify, the shared ctor could be modified.  This patch fixes 
the issue by unsharing
it before modification in gimplify.  A test is reduced from cppcoro library and 
added.

Bootstrap and test ongoing.  Not sure if this is the correct fix though, any 
comments?

Thanks,
bin

2019-12-13  Bin Cheng  

PR tree-optimization/92926
* gimplify.c (gimplify_init_constructor): Unshare ctor node before
clearing.

gcc/testsuite
2019-12-13  Bin Cheng  

PR tree-optimization/92926
* g++.dg/pr92926.C: New test.

0001-pr92926.patch
Description: Binary data


Re: [PATCH PR92926]Fix wrong code caused by ctor node translation unit wide sharing

2019-12-19 Thread Bin.Cheng
On Fri, Dec 13, 2019 at 11:26 AM bin.cheng  wrote:
>
> Hi,
>
> As reported in PR92926, constant ctor is shared translation unit wide because 
> of constexpr_call_table,
> however, during gimplify, the shared ctor could be modified.  This patch 
> fixes the issue by unsharing
> it before modification in gimplify.  A test is reduced from cppcoro library 
> and added.
>
> Bootstrap and test ongoing.  Not sure if this is the correct fix though, any 
> comments?
Ping.  Any comment?

Thanks,
bin
>
> Thanks,
> bin
>
> 2019-12-13  Bin Cheng  
>
> PR tree-optimization/92926
> * gimplify.c (gimplify_init_constructor): Unshare ctor node before
> clearing.
>
> gcc/testsuite
> 2019-12-13  Bin Cheng  
>
> PR tree-optimization/92926
> * g++.dg/pr92926.C: New test.


[PATCH GCC11]Improve uninitialized warning with value range info

2020-01-07 Thread bin.cheng
Hi,

Function use_pred_not_overlap_with_undef_path_pred of 
pass_late_warn_uninitialized
checks if predicate of variable use overlaps with predicate of undefined 
control flow path.
For now, it only checks ssa_var comparing against constant, this can be 
improved where
ssa_var compares against another ssa_var with value range info, as described in 
comment:

+ /* Check value range info of rhs, do following transforms:
+  flag_var < [min, max]  ->  flag_var < max
+  flag_var > [min, max]  ->  flag_var > min
+
+We can also transform LE_EXPR/GE_EXPR to LT_EXPR/GT_EXPR:
+  flag_var <= [min, max] ->  flag_var < [min, max+1]
+  flag_var >= [min, max] ->  flag_var > [min-1, max]
+if no overflow/wrap.  */

This change can avoid some false warning.  Bootstrap and test on x86_64, any 
comment?

Thanks,
bin

2020-01-08  Bin Cheng  

* tree-ssa-uninit.c (find_var_cmp_const): New function.
(use_pred_not_overlap_with_undef_path_pred): Call above.

Re: [PATCH GCC11]Improve uninitialized warning with value range info

2020-01-07 Thread bin.cheng
Sorry, here is the patch.
--
Sender:bin.cheng 
Sent At:2020 Jan. 8 (Wed.) 12:58
Recipient:GCC Patches 
Subject:[PATCH GCC11]Improve uninitialized warning with value range info


Hi,

Function use_pred_not_overlap_with_undef_path_pred of 
pass_late_warn_uninitialized
checks if predicate of variable use overlaps with predicate of undefined 
control flow path.
For now, it only checks ssa_var comparing against constant, this can be 
improved where
ssa_var compares against another ssa_var with value range info, as described in 
comment:

+ /* Check value range info of rhs, do following transforms:
+  flag_var < [min, max]  ->  flag_var < max
+  flag_var > [min, max]  ->  flag_var > min
+
+We can also transform LE_EXPR/GE_EXPR to LT_EXPR/GT_EXPR:
+  flag_var <= [min, max] ->  flag_var < [min, max+1]
+  flag_var >= [min, max] ->  flag_var > [min-1, max]
+if no overflow/wrap.  */

This change can avoid some false warning.  Bootstrap and test on x86_64, any 
comment?

Thanks,
bin

2020-01-08  Bin Cheng  

 * tree-ssa-uninit.c (find_var_cmp_const): New function.
 (use_pred_not_overlap_with_undef_path_pred): Call above.

check-uninit_warning-with-vrinfo-20200108.txt
Description: Binary data


Re: [RFC] IVOPTs select cand with preferred D-form access

2020-01-08 Thread Bin.Cheng
On Tue, Jan 7, 2020 at 6:48 PM Kewen.Lin  wrote:
>
> on 2020/1/7 下午5:14, Richard Biener wrote:
> > On Mon, 6 Jan 2020, Kewen.Lin wrote:
> >
> >> We are thinking whether it can be handled in IVOPTs instead of one RTL 
> >> pass.
> >>
> >> During IVOPTs selecting IV cands, it doesn't know the loop will be 
> >> unrolled so
> >> it doesn't count the possible step cost in with X-form.  If we can teach 
> >> it to
> >> consider the case, the IV cands which plays with D-form can be preferred.
> >> Currently unrolling (incomplete) happens in RTL, it looks we have to 
> >> predict
> >> the loop whether unroll in IVOPTs.  Since there is some parameter checks 
> >> on RTL
> >> insn counts and target hooks, it seems not easy to get that.  Besides, we 
> >> need
> >> to check the step is valid to put into D-form field (eg: DQ-form requires 
> >> divide
> >> 16 exactly), to ensure no extra ADDIs needed.
> >>
> >> I'm not sure whether it's a good idea to implement in IVOPTs, but I did 
> >> some
> >> changes in IVOPTs to prove it's doable to get expected codes, the patch is 
> >> attached.
> >>
> >> Any comments/suggestions are highly appreiciated!
> >
> > Is the unrolled code better than the not unrolled code (assuming
> > optimal IV choice)?  Then IMHO IVOPTs should drive the unrolling,
> > either by actually doing it or by forcing it via the loop->unroll
> > setting.  I don't think second-guessing the RTL unroller at this
> > point is going to work.  Alternatively turn X-form into D-form during
> > RTL unrolling?
> >
>
> Hi Richard,
>
> Thanks for the comments!
>
> Yes, unrolled version is better on Power9 for both forms, but D-form unrolled 
> is better
> than X-form unrolled.  If we drive unrolling in IVOPTs, not sure it will be a 
> concern
> that IVOPTs becomes too heavy? or too rude with forced UF if imprecise? Do we 
> still
> have the plan to introduce one middle-end unroll pass, does it help if yes?
I am a bit worried that would make IVOPTs heavy too, it might be
possible to compute heuristics whether loop should be unrolled as a
post-IVOPTs transformation.  Of course the transformation needs to do
more work than simply unrolling in order to take advantage of
aforementioned addressing mode.
BTW, unrolled loop won't perform as good as ppc if the target doesn't
support [base + register + offset] addressing mode?

Another point, in case of multiple passes doing unrolling, the
"already unrolled" information may need to be recorded as a flag of
loop properties.

Thanks,
bin
> The quoted RTL patch is to propose one RTL pass after RTL loop passes, it 
> also sounds
> good to check whether RTL unrolling is a good place!
>
>
> BR,
> Kewen
>


Re: [PATCH GCC11]Improve uninitialized warning with value range info

2020-01-08 Thread Bin.Cheng
On Wed, Jan 8, 2020 at 6:31 PM Richard Biener
 wrote:
>
> On Wed, Jan 8, 2020 at 6:01 AM bin.cheng  wrote:
> >
> > Sorry, here is the patch.
> > --
> > Sender:bin.cheng 
> > Sent At:2020 Jan. 8 (Wed.) 12:58
> > Recipient:GCC Patches 
> > Subject:[PATCH GCC11]Improve uninitialized warning with value range info
> >
> >
> > Hi,
> >
> > Function use_pred_not_overlap_with_undef_path_pred of 
> > pass_late_warn_uninitialized
> > checks if predicate of variable use overlaps with predicate of undefined 
> > control flow path.
> > For now, it only checks ssa_var comparing against constant, this can be 
> > improved where
> > ssa_var compares against another ssa_var with value range info, as 
> > described in comment:
> >
> > + /* Check value range info of rhs, do following transforms:
> > +  flag_var < [min, max]  ->  flag_var < max
> > +  flag_var > [min, max]  ->  flag_var > min
> > +
> > +We can also transform LE_EXPR/GE_EXPR to LT_EXPR/GT_EXPR:
> > +  flag_var <= [min, max] ->  flag_var < [min, max+1]
> > +  flag_var >= [min, max] ->  flag_var > [min-1, max]
> > +if no overflow/wrap.  */
> >
> > This change can avoid some false warning.  Bootstrap and test on x86_64, 
> > any comment?
>
> Definitely a good idea - the refactoring makes the patch hard to
> follow though.  The
> original code seems to pick any (the "first") compare against a constant.  You
> return the "first" but maybe from range info that might also be
> [-INF,+INF].  It seems
> that we'd want to pick the best so eventually sort the predicate chain
> so that compares
> against constants come first at least?  Not sure if it really makes a
> difference but
I don't know either, but I simply tried to not break existing code int
the patch.
Function prune_uninit_phi_opnds is called for the first compares against
constant, actually it should be called for each comparison, but I guess it's
just avoiding O(M*N) complexity here.


> even currently we could have i < 5, i < 1 so the "better" one later?
> It might also make
> sense to simply push three predicates for i < j, namely i < j (if ever
> useful), i < min(j)
> and max(i) < j to avoid repeatedly doing the range computations.
IIUC, with current implementation, it's not useful to check value rang
info for both sides of comparison because prune_uninit_phi_opnds
requires the flag_var be defined by PHI node in the same basic block
as PHI parameter.

Thanks,
bin
>
> Thanks,
> Richard.
>
> > Thanks,
> > bin
> >
> > 2020-01-08  Bin Cheng  
> >
> >  * tree-ssa-uninit.c (find_var_cmp_const): New function.
> >  (use_pred_not_overlap_with_undef_path_pred): Call above.


Re: [PATCH PR92926]Fix wrong code caused by ctor node translation unit wide sharing

2020-01-08 Thread Bin.Cheng
On Fri, Dec 20, 2019 at 3:10 PM Richard Biener
 wrote:
>
> On December 20, 2019 2:13:47 AM GMT+01:00, "Bin.Cheng" 
>  wrote:
> >On Fri, Dec 13, 2019 at 11:26 AM bin.cheng
> > wrote:
> >>
> >> Hi,
> >>
> >> As reported in PR92926, constant ctor is shared translation unit wide
> >because of constexpr_call_table,
> >> however, during gimplify, the shared ctor could be modified.  This
> >patch fixes the issue by unsharing
> >> it before modification in gimplify.  A test is reduced from cppcoro
> >library and added.
> >>
> >> Bootstrap and test ongoing.  Not sure if this is the correct fix
> >though, any comments?
> >Ping.  Any comment?
>
> Looks reasonable to me.
Given PR92926 is marked as duplicate of PR93143, I updated test case
of the patch.

Thanks,
bin

2019-12-13  Bin Cheng  

PR tree-optimization/93143
* gimplify.c (gimplify_init_constructor): Unshare ctor node before
clearing.

gcc/testsuite
2019-12-13  Bin Cheng  

PR tree-optimization/93143
* g++.dg/pr93143.C: New test.
From 77252c3bb41887af1daa9e83615a8aa32dc330f9 Mon Sep 17 00:00:00 2001
From: "bin.cheng" 
Date: Thu, 9 Jan 2020 14:13:08 +0800
Subject: [PATCH] Fix pr93143.

---
 gcc/gimplify.c |  2 ++
 gcc/testsuite/g++.dg/pr93143.C | 73 ++
 2 files changed, 75 insertions(+)
 create mode 100644 gcc/testsuite/g++.dg/pr93143.C

diff --git a/gcc/gimplify.c b/gcc/gimplify.c
index 73fb2e7..55d7a93 100644
--- a/gcc/gimplify.c
+++ b/gcc/gimplify.c
@@ -5083,6 +5083,8 @@ gimplify_init_constructor (tree *expr_p, gimple_seq *pre_p, gimple_seq *post_p,
 	/* Zap the CONSTRUCTOR element list, which simplifies this case.
 	   Note that we still have to gimplify, in order to handle the
 	   case of variable sized types.  Avoid shared tree structures.  */
+	ctor = unshare_expr (ctor);
+	TREE_OPERAND (*expr_p, 1) = ctor;
 	CONSTRUCTOR_ELTS (ctor) = NULL;
 	TREE_SIDE_EFFECTS (ctor) = 0;
 	object = unshare_expr (object);
diff --git a/gcc/testsuite/g++.dg/pr93143.C b/gcc/testsuite/g++.dg/pr93143.C
new file mode 100644
index 000..40710cf
--- /dev/null
+++ b/gcc/testsuite/g++.dg/pr93143.C
@@ -0,0 +1,73 @@
+// { dg-do run }
+// { dg-options "-O3 -std=c++14" }
+
+struct array
+{
+constexpr unsigned char operator[](int i) const noexcept
+{
+return arr[i];
+}
+
+unsigned char arr[16];
+};
+
+
+class v6 {
+public:
+using bytes_type = array;
+constexpr v6(bytes_type const & bytes);
+constexpr bool is_loopback() const noexcept;
+static constexpr v6 loopback() noexcept
+{
+return v6(v6::bytes_type{0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1});
+}
+private:
+bytes_type bytes_;
+};
+
+
+
+constexpr v6::v6(bytes_type const & bytes)
+: bytes_(bytes)
+{}
+
+constexpr
+bool v6::is_loopback() const noexcept
+{
+return bytes_[0] == 0 &&
+bytes_[1] == 0 &&
+bytes_[2] == 0 &&
+bytes_[3] == 0 &&
+bytes_[4] == 0 &&
+bytes_[5] == 0 &&
+bytes_[6] == 0 &&
+bytes_[7] == 0 &&
+bytes_[8] == 0 &&
+bytes_[9] == 0 &&
+bytes_[10] == 0 &&
+bytes_[11] == 0 &&
+bytes_[12] == 0 &&
+bytes_[13] == 0 &&
+bytes_[14] == 0 &&
+bytes_[15] == 1;
+}
+
+void v6_type()
+{
+[[maybe_unused]] constexpr auto loopback = v6::loopback();
+}
+
+int main()
+{
+v6_type();
+
+constexpr auto a = v6::loopback();
+if (!a.is_loopback())
+__builtin_abort();
+
+auto b = v6::loopback();
+if (!b.is_loopback())
+__builtin_abort();
+
+return 0;
+}
-- 
1.8.3.1



Re: [PATCH GCC11]Improve uninitialized warning with value range info

2020-01-09 Thread Bin.Cheng
On Wed, Jan 8, 2020 at 7:50 PM Richard Biener
 wrote:
>
> On Wed, Jan 8, 2020 at 12:30 PM Bin.Cheng  wrote:
> >
> > On Wed, Jan 8, 2020 at 6:31 PM Richard Biener
> >  wrote:
> > >
> > > On Wed, Jan 8, 2020 at 6:01 AM bin.cheng  
> > > wrote:
> > > >
> > > > Sorry, here is the patch.
> > > > --
> > > > Sender:bin.cheng 
> > > > Sent At:2020 Jan. 8 (Wed.) 12:58
> > > > Recipient:GCC Patches 
> > > > Subject:[PATCH GCC11]Improve uninitialized warning with value range info
> > > >
> > > >
> > > > Hi,
> > > >
> > > > Function use_pred_not_overlap_with_undef_path_pred of 
> > > > pass_late_warn_uninitialized
> > > > checks if predicate of variable use overlaps with predicate of 
> > > > undefined control flow path.
> > > > For now, it only checks ssa_var comparing against constant, this can be 
> > > > improved where
> > > > ssa_var compares against another ssa_var with value range info, as 
> > > > described in comment:
> > > >
> > > > + /* Check value range info of rhs, do following transforms:
> > > > +  flag_var < [min, max]  ->  flag_var < max
> > > > +  flag_var > [min, max]  ->  flag_var > min
> > > > +
> > > > +We can also transform LE_EXPR/GE_EXPR to LT_EXPR/GT_EXPR:
> > > > +  flag_var <= [min, max] ->  flag_var < [min, max+1]
> > > > +  flag_var >= [min, max] ->  flag_var > [min-1, max]
> > > > +if no overflow/wrap.  */
> > > >
> > > > This change can avoid some false warning.  Bootstrap and test on 
> > > > x86_64, any comment?
> > >
> > > Definitely a good idea - the refactoring makes the patch hard to
> > > follow though.  The
> > > original code seems to pick any (the "first") compare against a constant. 
> > >  You
> > > return the "first" but maybe from range info that might also be
> > > [-INF,+INF].  It seems
> > > that we'd want to pick the best so eventually sort the predicate chain
> > > so that compares
> > > against constants come first at least?  Not sure if it really makes a
> > > difference but
> > I don't know either, but I simply tried to not break existing code int
> > the patch.
> > Function prune_uninit_phi_opnds is called for the first compares against
> > constant, actually it should be called for each comparison, but I guess it's
> > just avoiding O(M*N) complexity here.
>
> Yeah.  I'm just worried finding a "bad" value-range predicate cuts the search
> in a way that causes extra bogus warnings?
hmm, the code is now as:
+  cmp_code = find_var_cmp_const (preds, phi, &flag_def, &boundary_cst, false);
+  if (cmp_code == ERROR_MARK)
+cmp_code = find_var_cmp_const (preds, phi, &flag_def, &boundary_cst, true);
+  if (cmp_code == ERROR_MARK)
 return false;

First call to find_var_cmp_const preserves the original behavior,
while the second call
only finds value range case if there is no comparison cont is found by
the first call.
So there is no extra bogus warnings?

>
> >
> > > even currently we could have i < 5, i < 1 so the "better" one later?
> > > It might also make
> > > sense to simply push three predicates for i < j, namely i < j (if ever
> > > useful), i < min(j)
> > > and max(i) < j to avoid repeatedly doing the range computations.
> > IIUC, with current implementation, it's not useful to check value rang
> > info for both sides of comparison because prune_uninit_phi_opnds
> > requires the flag_var be defined by PHI node in the same basic block
> > as PHI parameter.
>
> Yes, but without remembering the code very well my suggestion allows
> "new" predicates to be gathered during collecting phase while your patch
> adjusts the query phase?
I am not quite follow here. Do you mean we collect three cases "i <
j", "i < min(j)", "max(i) < j" then
call prune_uninit_phi_opnds for all three conditions?
This is another question? because now we simply break out of for loop
for finding such condition:

-  if ((gimple_code (flag_def) == GIMPLE_PHI)
- && (gimple_bb (flag_def) == gimple_bb (phi))
- && find_matching_predicate_in_rest_chains (the_pred, preds,
-num_preds))
-   break;

It's always possible that this flag_def can't prune use predicates
against undefined path predicates, while a later one can prune but is
skipped?

Thanks,
bin
>
> Richard.
>
> > Thanks,
> > bin
> > >
> > > Thanks,
> > > Richard.
> > >
> > > > Thanks,
> > > > bin
> > > >
> > > > 2020-01-08  Bin Cheng  
> > > >
> > > >  * tree-ssa-uninit.c (find_var_cmp_const): New function.
> > > >  (use_pred_not_overlap_with_undef_path_pred): Call above.


RE: [PING][PATCH ARM]Extend thumb1_reorg to save more comparison instructions

2013-08-23 Thread bin.cheng
Ping^2

Thanks.
bin

-Original Message-
From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org]
On Behalf Of bin.cheng
Sent: Thursday, August 08, 2013 4:51 PM
To: gcc-patches@gcc.gnu.org
Cc: Richard Earnshaw
Subject: [PING][PATCH ARM]Extend thumb1_reorg to save more comparison
instructions

Ping this patch, http://gcc.gnu.org/ml/gcc-patches/2013-04/msg01057.html

Thanks.
bin









[PATCH ARM]Refine scaled address expression on ARM

2013-08-28 Thread bin.cheng
Hi,

This patch refines scaled address expression on ARM.  It supports
"base+index*scale" in arm_legitimate_address_outer_p.  It also tries to
legitimize "base + index * scale + offset" with "reg <- base + offset;  reg
+ index * scale" by introducing thumb2_legitimize_address.  For now function
thumb2_legitimize_address is a kind of placeholder and just does the
mentioned transformation by calling to try_multiplier_address.  Hoping we
can improve it in the future.

With this patch:
1) "base+index*scale" is recognized.
2) PR57540 is fixed.

Bootstrapped and Tested on A15.  Is it OK?

Thanks.
Bin

2013-08-28  Bin Cheng  

* config/arm/arm.c (arm_legitimate_address_outer_p):
Support addressing mode like "base + index * scale".
(try_multiplier_address): New function.
(arm_legitimize_address): Call try_multiplier_address.
Index: gcc/config/arm/arm.c
===
--- gcc/config/arm/arm.c(revision 200774)
+++ gcc/config/arm/arm.c(working copy)
@@ -5931,7 +5931,9 @@ arm_legitimate_address_outer_p (enum machine_mode
&& arm_legitimate_index_p (mode, xop1, outer, strict_p))
   || (!strict_p && will_be_in_index_register (xop1
  || (arm_address_register_rtx_p (xop1, strict_p)
- && arm_legitimate_index_p (mode, xop0, outer, strict_p)));
+ && arm_legitimate_index_p (mode, xop0, outer, strict_p))
+ || (arm_address_register_rtx_p (xop0, strict_p)
+ && arm_legitimate_index_p (mode, xop1, outer, strict_p)));
 }
 
 #if 0
@@ -6652,6 +6654,106 @@ legitimize_tls_address (rtx x, rtx reg)
 }
 }
 
+/* Try to find address expression like base + index * scale + offset
+   in X.  If we find one, force base + offset into register and
+   construct new expression reg + index * scale; return the new
+   address expression if it's valid.  Otherwise return X.  */
+static rtx
+try_multiplier_address (rtx x, enum machine_mode mode ATTRIBUTE_UNUSED)
+{
+  rtx tmp, base_reg, new_rtx;
+  rtx base = NULL_RTX, index = NULL_RTX, scale = NULL_RTX, offset = NULL_RTX;
+
+  gcc_assert (GET_CODE (x) == PLUS);
+
+  /* Try to find and record base/index/scale/offset in X. */
+  if (GET_CODE (XEXP (x, 1)) == MULT)
+{
+  tmp = XEXP (x, 0);
+  index = XEXP (XEXP (x, 1), 0);
+  scale = XEXP (XEXP (x, 1), 1);
+  if (GET_CODE (tmp) != PLUS)
+   return x;
+
+  base = XEXP (tmp, 0);
+  offset = XEXP (tmp, 1);
+}
+  else
+{
+  tmp = XEXP (x, 0);
+  offset = XEXP (x, 1);
+  if (GET_CODE (tmp) != PLUS)
+   return x;
+
+  base = XEXP (tmp, 0);
+  scale = XEXP (tmp, 1);
+  if (GET_CODE (base) == MULT)
+   {
+ tmp = base;
+ base = scale;
+ scale = tmp;
+   }
+  if (GET_CODE (scale) != MULT)
+   return x;
+
+  index = XEXP (scale, 0);
+  scale = XEXP (scale, 1);
+}
+
+  if (CONST_INT_P (base))
+{
+  tmp = base;
+  base = offset;
+  offset = tmp;
+}
+
+  if (CONST_INT_P (index))
+{
+  tmp = index;
+  index = scale;
+  scale = tmp;
+}
+
+  /* ARM only supports constant scale in address.  */
+  if (!CONST_INT_P (scale))
+return x;
+
+  if (GET_MODE (base) != SImode || GET_MODE (index) != SImode)
+return x;
+
+  /* Only register/constant are allowed in each part.  */
+  if (!symbol_mentioned_p (base)
+  && !symbol_mentioned_p (offset)
+  && !symbol_mentioned_p (index)
+  && !symbol_mentioned_p (scale))
+{
+  /* Force "base+offset" into register and construct
+"register+index*scale".  Return the new expression
+only if it's valid.  */
+  tmp = gen_rtx_PLUS (SImode, base, offset);
+  base_reg = force_reg (SImode, tmp);
+  tmp = gen_rtx_fmt_ee (MULT, SImode, index, scale);
+  new_rtx = gen_rtx_PLUS (SImode, base_reg, tmp);
+  return new_rtx;
+}
+
+  return x;
+}
+
+/* Try machine-dependent ways of modifying an illegitimate Thumb2 address
+   to be legitimate.  If we find one, return the new address.
+
+   TODO: legitimize_address for Thumb2.  */
+static rtx
+thumb2_legitimize_address (rtx x, rtx orig_x ATTRIBUTE_UNUSED,
+  enum machine_mode mode)
+{
+  if (GET_CODE (x) == PLUS)
+return try_multiplier_address (x, mode);
+
+  return x;
+}
+
 /* Try machine-dependent ways of modifying an illegitimate address
to be legitimate.  If we find one, return the new, valid address.  */
 rtx
@@ -6659,9 +6761,9 @@ arm_legitimize_address (rtx x, rtx orig_x, enum ma
 {
   if (!TARGET_ARM)
 {
-  /* TODO: legitimize_address for Thumb2.  */
   if (TARGET_THUMB2)
-return x;
+   return thumb2_legitimize_address (x, orig_x, mode);
+
   return thumb_legitimize_address (x, orig_x, mode);
 }
 
@@ -6673,6 +6775,10 @@ arm_legitimize_address (rtx x, rtx orig_x, enum ma
   rtx xop0 = XEXP (x, 0);

[PATCH ARM/Embedded-4_8-branch]disable rtl loop invariant when optimizing for size

2013-08-28 Thread bin.cheng
Hi,

The attached patch disables rtl loop invariant when optimizing for code
size.  Committed to ARM/Embedded-4_8-branch as r202067.

Thanks.
bin

2013-08-29  Zhenqiang Chen  

* config/arm/arm.c (arm_option_override): Disable loop2_invariant
pass when optimize_size and ira-loop-pressure is not enabled.

Index: gcc/ChangeLog.arm
===
--- gcc/ChangeLog.arm   (revision 202066)
+++ gcc/ChangeLog.arm   (revision 202067)
@@ -1,3 +1,8 @@
+2013-08-29  Zhenqiang Chen  
+
+   * config/arm/arm.c (arm_option_override): Disable loop2_invariant
+   pass when optimize_size and ira-loop-pressure is not enabled.
+
 2013-08-05  Terry Guo  
 
Backport from mainline r197956
Index: gcc/config/arm/arm.c
===
--- gcc/config/arm/arm.c(revision 202066)
+++ gcc/config/arm/arm.c(revision 202067)
@@ -2134,6 +2134,13 @@
  global_options.x_param_values,
  global_options_set.x_param_values);
 
+  /* Do not move invariants out of loops since it tends to increase register
+ pressure.  The heuristic to estimate register pressure does not fit for
+ ARM.  -fira-loop-pressure tends to get more precise estimation.  But it
+ still need more tuning.  */
+  if (optimize_function_for_size_p (cfun) && !flag_ira_loop_pressure)
+flag_move_loop_invariants = 0;
+
   /* Register global variables with the garbage collector.  */
   arm_add_gc_roots ();
 }


[PATCH GCC]Catch more MEM_REFs sharing common addressing part in gimple strength reduction

2013-09-01 Thread bin.cheng
Hi,

The gimple-ssa-strength-reduction pass handles CAND_REFs in order to find
different MEM_REFs sharing common part in addressing expression.  If such
MEM_REFs are found, the pass rewrites MEM_REFs, and produces more efficient
addressing expression during the RTL passes.
The pass analyzes addressing expression in each MEM_REF to see if it can be
formalized as follows:
 base:MEM_REF (T1, C1)
 offset:  MULT_EXPR (PLUS_EXPR (T2, C2), C3)
 bitpos:  C4 * BITS_PER_UNIT
Then restructures it into below form:
 MEM_REF (POINTER_PLUS_EXPR (T1, MULT_EXPR (T2, C3)),
  C1 + (C2 * C3) + C4)
At last, rewrite the MEM_REFs if there are two or more sharing common
(non-constant) part.
The problem is it doesn't back trace T2.  If T2 is recorded as a CAND_ADD in
form of "T2' + C5", the MEM_REF should be restructure into:
 MEM_REF (POINTER_PLUS_EXPR (T1, MULT_EXPR (T2', C3)),
  C1 + (C2 * C3) + C4 + (C5 * C3))

The patch also includes a test case to illustrate the problem.

Bootstrapped and tested on x86/x86_64/arm-a15, is it ok?

Thanks.
bin

2013-09-02  Bin Cheng  

* gimple-ssa-strength-reduction.c (backtrace_base_for_ref): New.
(restructure_reference): Call backtrace_base_for_ref.

gcc/testsuite/ChangeLog
2013-09-02  Bin Cheng  

* gcc.dg/tree-ssa/slsr-39.c: New test.Index: gcc/testsuite/gcc.dg/tree-ssa/slsr-39.c
===
--- gcc/testsuite/gcc.dg/tree-ssa/slsr-39.c (revision 0)
+++ gcc/testsuite/gcc.dg/tree-ssa/slsr-39.c (revision 0)
@@ -0,0 +1,26 @@
+/* Verify straight-line strength reduction for back-tracing
+   CADN_ADD for T2 in:
+
+*PBASE:T1
+*POFFSET:  MULT_EXPR (T2, C3)
+*PINDEX:   C1 + (C2 * C3) + C4  */
+
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-slsr" } */
+
+typedef int arr_2[50][50];
+
+void foo (arr_2 a2, int v1)
+{
+  int i, j;
+
+  i = v1 + 5;
+  j = i;
+  a2 [i] [j++] = i;
+  a2 [i] [j++] = i;
+  a2 [i] [i-1] += 1;
+  return;
+}
+
+/* { dg-final { scan-tree-dump-times "MEM" 4 "slsr" } } */
+/* { dg-final { cleanup-tree-dump "slsr" } } */
Index: gcc/gimple-ssa-strength-reduction.c
===
--- gcc/gimple-ssa-strength-reduction.c (revision 202067)
+++ gcc/gimple-ssa-strength-reduction.c (working copy)
@@ -750,6 +750,57 @@ slsr_process_phi (gimple phi, bool speed)
   add_cand_for_stmt (phi, c);
 }
 
+/* Given PBASE which is a pointer to tree, loop up the defining
+   statement for it and check whether the candidate is in the
+   form of:
+
+ X = B + (1 * S), S is integer constant
+ X = B + (i * S), S is integer one
+
+   If so, set PBASE to the candiate's base_expr and return double
+   int (i * S).
+   Otherwise, just return double int zero.  */
+
+static double_int
+backtrace_base_for_ref (tree *pbase)
+{
+  tree base_in = *pbase;
+  slsr_cand_t base_cand;
+
+  STRIP_NOPS (base_in);
+  if (TREE_CODE (base_in) != SSA_NAME)
+return tree_to_double_int (integer_zero_node);
+
+  base_cand = base_cand_from_table (base_in);
+
+  while (base_cand && base_cand->kind != CAND_PHI)
+{
+  if (base_cand->kind == CAND_ADD
+ && base_cand->index.is_one ()
+ && TREE_CODE (base_cand->stride) == INTEGER_CST)
+   {
+ /* X = B + (1 * S), S is integer constant.  */
+ *pbase = base_cand->base_expr;
+ return tree_to_double_int (base_cand->stride);
+   }
+  else if (base_cand->kind == CAND_ADD
+  && TREE_CODE (base_cand->stride) == INTEGER_CST
+  && integer_onep (base_cand->stride))
+{
+ /* X = B + (i * S), S is integer one.  */
+ *pbase = base_cand->base_expr;
+ return base_cand->index;
+   }
+
+  if (base_cand->next_interp)
+   base_cand = lookup_cand (base_cand->next_interp);
+  else
+   base_cand = NULL;
+}
+
+  return tree_to_double_int (integer_zero_node);
+}
+
 /* Look for the following pattern:
 
 *PBASE:MEM_REF (T1, C1)
@@ -767,8 +818,15 @@ slsr_process_phi (gimple phi, bool speed)
 
 *PBASE:T1
 *POFFSET:  MULT_EXPR (T2, C3)
-*PINDEX:   C1 + (C2 * C3) + C4  */
+*PINDEX:   C1 + (C2 * C3) + C4
 
+   When T2 is recorded by an CAND_ADD in the form of (T2' + C5), It
+   will be further restructured to:
+
+*PBASE:T1
+*POFFSET:  MULT_EXPR (T2', C3)
+*PINDEX:   C1 + (C2 * C3) + C4 + (C5 * C3)  */
+
 static bool
 restructure_reference (tree *pbase, tree *poffset, double_int *pindex,
   tree *ptype)
@@ -777,7 +835,7 @@ restructure_reference (tree *pbase, tree *poffset,
   double_int index = *pindex;
   double_int bpu = double_int::from_uhwi (BITS_PER_UNIT);
   tree mult_op0, mult_op1, t1, t2, type;
-  double_int c1, c2, c3, c4;
+  double_int c1, c2, c3, c4, c5;
 
   if (!base
   || !offset
@@ -823,11 +881,12 @@ restructure_reference (tree *pbase, tree *poffset,
 }
 
   c4 = index.udiv (bp

[PATCH GCC]Find auto-increment use both before and after candidate's increment in IVOPT

2013-09-02 Thread bin.cheng
Hi,
For now set_autoinc_for_original_candidates only searches auto-inc uses
before candidate's increment, causing pre-increment opportunities missed for
original candidates.  This is a straightforward fix by searching after
candidate's increment too.

The patch also includes a test case to illustrate the problem.  Without the
patch, assembly of the test is:
foo:
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
movwr3, #:lower16:__ctype_ptr__
ldrbr2, [r0]@ zero_extendqisi2
movtr3, #:upper16:__ctype_ptr__
ldr r1, [r3]
addsr3, r1, r2
ldrbr3, [r3, #1]@ zero_extendqisi2
lslsr3, r3, #29
bmi .L2
addsr3, r0, #1
.L3:
mov r0, r3
addsr3, r3, #1
ldrbr2, [r0]@ zero_extendqisi2
add r2, r2, r1
ldrbr2, [r2, #1]@ zero_extendqisi2
lslsr2, r2, #29
bpl .L3
.L2:
bx  lr
.size   foo, .-foo

Which can be optimized into below:
foo:
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
movwr3, #:lower16:__ctype_ptr__
ldrbr1, [r0]@ zero_extendqisi2
movtr3, #:upper16:__ctype_ptr__
ldr r2, [r3]
addsr3, r2, r1
ldrbr3, [r3, #1]@ zero_extendqisi2
lslsr1, r3, #29
bmi .L2
.L3:
ldrbr3, [r0, #1]!   @ zero_extendqisi2
add r3, r3, r2
ldrbr3, [r3, #1]@ zero_extendqisi2
lslsr3, r3, #29
bpl .L3
.L2:
bx  lr
.size   foo, .-foo

Bootstrapped and tested on arm a15, is it OK?

Thanks.
bin

2013-09-02  Bin Cheng  

* tree-ssa-loop-ivopts.c (set_autoinc_for_original_candidates):
Find auto-increment use both before and after candidate.

gcc/testsuite/ChangeLog
2013-09-02  Bin Cheng  

* gcc.target/arm/ivopts-orig_biv-inc.c: New test.Index: gcc/testsuite/gcc.target/arm/ivopts-orig_biv-inc.c
===
--- gcc/testsuite/gcc.target/arm/ivopts-orig_biv-inc.c  (revision 0)
+++ gcc/testsuite/gcc.target/arm/ivopts-orig_biv-inc.c  (revision 0)
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-ivopts-details" } */
+/* { dg-skip-if "" { arm_thumb1 } } */
+
+extern char *__ctype_ptr__;
+
+unsigned char * foo(unsigned char *ReadPtr)
+{
+
+ unsigned char c;
+
+ while (!(((__ctype_ptr__+sizeof(""[*ReadPtr]))[(int)(*ReadPtr)])&04) == 
(!(0)))
+  ReadPtr++;
+
+ return ReadPtr;
+}
+
+/* { dg-final { scan-tree-dump-times "original biv" 2 "ivopts"} } */
+/* { dg-final { cleanup-tree-dump "ivopts" } } */
Index: gcc/tree-ssa-loop-ivopts.c
===
--- gcc/tree-ssa-loop-ivopts.c  (revision 200774)
+++ gcc/tree-ssa-loop-ivopts.c  (working copy)
@@ -4876,22 +4876,36 @@ set_autoinc_for_original_candidates (struct ivopts
   for (i = 0; i < n_iv_cands (data); i++)
 {
   struct iv_cand *cand = iv_cand (data, i);
-  struct iv_use *closest = NULL;
+  struct iv_use *closest_before = NULL;
+  struct iv_use *closest_after = NULL;
   if (cand->pos != IP_ORIGINAL)
continue;
+
   for (j = 0; j < n_iv_uses (data); j++)
{
  struct iv_use *use = iv_use (data, j);
  unsigned uid = gimple_uid (use->stmt);
- if (gimple_bb (use->stmt) != gimple_bb (cand->incremented_at)
- || uid > gimple_uid (cand->incremented_at))
+
+ if (gimple_bb (use->stmt) != gimple_bb (cand->incremented_at))
continue;
- if (closest == NULL || uid > gimple_uid (closest->stmt))
-   closest = use;
+
+ if (uid < gimple_uid (cand->incremented_at)
+ && (closest_before == NULL
+ || uid > gimple_uid (closest_before->stmt)))
+   closest_before = use;
+
+ if (uid > gimple_uid (cand->incremented_at)
+ && (closest_after == NULL
+ || uid < gimple_uid (closest_after->stmt)))
+   closest_after = use;
}
-  if (closest == NULL || !autoinc_possible_for_pair (data, closest, cand))
-   continue;
-  cand->ainc_use = closest;
+
+  if (closest_before != NULL
+ && autoinc_possible_for_pair (data, closest_before, cand))
+   cand->ainc_use = closest_before;
+  else if (closest_after != NULL
+  && autoinc_possible_for_pair (data, closest_after, cand))
+   cand->ainc_use = closest_after;
 }
 }
 


RE: [PATCH ARM]Refine scaled address expression on ARM

2013-09-02 Thread bin.cheng


> -Original Message-
> From: Richard Earnshaw 
> Sent: Thursday, August 29, 2013 9:06 PM
> To: Bin Cheng
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH ARM]Refine scaled address expression on ARM
>
> On 28/08/13 08:00, bin.cheng wrote:
> > Hi,
> > 
> > This patch refines scaled address expression on ARM.  It supports 
> > "base+index*scale" in arm_legitimate_address_outer_p.  It also tries 
> > to legitimize "base + index * scale + offset" with "reg <- base + 
> > offset;  reg
> > + index * scale" by introducing thumb2_legitimize_address.  For now 
> > + function
> > thumb2_legitimize_address is a kind of placeholder and just does the 
> > mentioned transformation by calling to try_multiplier_address.  Hoping 
> > we can improve it in the future.
> > 
> > With this patch:
> > 1) "base+index*scale" is recognized.
>
> That's because (PLUS (REG) (MULT (REG) (CONST))) is not canonical form.
>  So this shouldn't be necessary.  Can you identify where this
non-canoncial form is being generated?
>

Oh, for now ivopt constructs "index*scale" to test whether backend supports
scaled addressing mode, which is not valid on ARM, so I was going to
construct "base + index*scale" instead.  Since "base + index * scale" is not
canonical form, I will construct the canonical form and drop this part of
the patch.

Is rest of this patch OK?

Thanks.
bin





Re: [PATCH][RFC] Move IVOPTs closer to RTL expansion

2013-09-08 Thread Bin.Cheng
On Wed, Sep 4, 2013 at 5:20 PM, Richard Biener  wrote:
>
> The patch below moves IVOPTs out of the GIMPLE loop pipeline more
> closer to RTL expansion.  That's done for multiple reasons.
>
> First, the loop passes that at the moment preceede IVOPTs leave
> around IL that is in desparate need of basic re-optimization
> like CSE, constant propagation and DCE.  That puts extra load
> on IVOPTs and its cost model, increasing compile-time and
> possibly confusing it.
>
> Second, IVOPTs introduces lowered memory accesses that it
> expects to stay as is, likewise it produces auto-inc/dec
> sequences that it expects to stay as is until RTL expansion.
> Passes such as DOM can break this expectation and make the
> work done by IVOPTs a waste.
>
> I remember doing this excercise in the GCC 4.3 timeframe where
> benchmarking on x86_64 showed no gains or losses (but x86_64
> isn't very sensitive to IV choices).
>
> Any help with benchmarking this on targets other than x86_64
> is appreciated (I'll re-do x86_64).
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu.
>
> General comments are of course also welcome.
>
> Thanks,
> Richard.
>
> 2013-09-04  Richard Biener  
>
> * passes.def: Move IVOPTs before final DCE pass.
> * tree-ssa-loop.c (tree_ssa_loop_ivopts): Adjust for being
> outside of the loop pipeline.
>
> * gcc.dg/tree-ssa/ivopts-3.c: Scan non-details dump.
> * gcc.dg/tree-ssa/reassoc-19.c: Be more permissive.
>
> Index: gcc/passes.def
> ===
> *** gcc/passes.def.orig 2013-09-04 10:57:33.0 +0200
> --- gcc/passes.def  2013-09-04 11:11:27.535952665 +0200
> *** along with GCC; see the file COPYING3.
> *** 221,227 
>   NEXT_PASS (pass_complete_unroll);
>   NEXT_PASS (pass_slp_vectorize);
>   NEXT_PASS (pass_loop_prefetch);
> - NEXT_PASS (pass_iv_optimize);
>   NEXT_PASS (pass_lim);
>   NEXT_PASS (pass_tree_loop_done);
> POP_INSERT_PASSES ()
> --- 221,226 
> *** along with GCC; see the file COPYING3.
> *** 237,242 
> --- 236,246 
>  opportunities.  */
> NEXT_PASS (pass_phi_only_cprop);
> NEXT_PASS (pass_vrp);
> +   /* IVOPTs lowers memory accesses and exposes auto-inc/dec
> +  opportunities.  Run it after the above passes cleaned up
> +the loop optimized IL but before DCE as IVOPTs generates
> +quite some garbage.  */
> +   NEXT_PASS (pass_iv_optimize);
> NEXT_PASS (pass_cd_dce);
> NEXT_PASS (pass_tracer);
>
> Index: gcc/tree-ssa-loop.c
> ===
> *** gcc/tree-ssa-loop.c.orig2013-09-04 10:57:32.0 +0200
> --- gcc/tree-ssa-loop.c 2013-09-04 11:11:27.536952677 +0200
> *** make_pass_loop_prefetch (gcc::context *c
> *** 906,915 
>   static unsigned int
>   tree_ssa_loop_ivopts (void)
>   {
> !   if (number_of_loops (cfun) <= 1)
> ! return 0;
>
> -   tree_ssa_iv_optimize ();
> return 0;
>   }
>
> --- 906,924 
>   static unsigned int
>   tree_ssa_loop_ivopts (void)
>   {
> !   loop_optimizer_init (LOOPS_NORMAL
> !  | LOOPS_HAVE_RECORDED_EXITS);
> !
> !   if (number_of_loops (cfun) > 1)
> ! {
> !   rewrite_into_loop_closed_ssa (NULL, TODO_update_ssa);
> !   scev_initialize ();
> !   tree_ssa_iv_optimize ();
> !   scev_finalize ();
> ! }
> !
> !   loop_optimizer_finalize ();
>
> return 0;
>   }
>
> Index: gcc/testsuite/gcc.dg/tree-ssa/ivopts-3.c
> ===
> *** gcc/testsuite/gcc.dg/tree-ssa/ivopts-3.c.orig   2013-09-04 
> 10:57:33.0 +0200
> --- gcc/testsuite/gcc.dg/tree-ssa/ivopts-3.c2013-09-04 11:11:27.559952952 
> +0200
> ***
> *** 1,5 
>   /* { dg-do compile } */
> ! /* { dg-options "-O2 -fdump-tree-ivopts-details" } */
>
>   void main (void)
>   {
> --- 1,5 
>   /* { dg-do compile } */
> ! /* { dg-options "-O2 -fdump-tree-ivopts" } */
>
>   void main (void)
>   {
> *** void main (void)
> *** 8,12 
>   f2 ();
>   }
>
> ! /* { dg-final { scan-tree-dump-times "!= 0" 5 "ivopts" } }  */
>   /* { dg-final { cleanup-tree-dump "ivopts" } }  */
> --- 8,12 
>   f2 ();
>   }
>
> ! /* { dg-final { scan-tree-dump-times "!= 0" 1 "ivopts" } }  */
>   /* { dg-final { cleanup-tree-dump "ivopts" } }  */
> Index: gcc/testsuite/gcc.dg/tree-ssa/reassoc-19.c
> ===
> *** gcc/testsuite/gcc.dg/tree-ssa/reassoc-19.c.orig 2012-12-18 
> 14:24:58.0 +0100
> --- gcc/testsuite/gcc.dg/tree-ssa/reassoc-19.c  2013-09-04 11:13:30.895416700 
> +0200
> *** void foo(char* left, char* rite, int ele
> *** 16,22 
> }
>   }
>
> ! /* { dg-final { scan-tree-dump-times "= \\\(sizetype\\\) element" 1 
> "optimized" } } */

Re: [PATCH][RFC] Move IVOPTs closer to RTL expansion

2013-09-08 Thread Bin.Cheng
On Mon, Sep 9, 2013 at 10:01 AM, Bin.Cheng  wrote:
> On Wed, Sep 4, 2013 at 5:20 PM, Richard Biener  wrote:
>>
>> The patch below moves IVOPTs out of the GIMPLE loop pipeline more
>> closer to RTL expansion.  That's done for multiple reasons.
>>
>> First, the loop passes that at the moment preceede IVOPTs leave
>> around IL that is in desparate need of basic re-optimization
>> like CSE, constant propagation and DCE.  That puts extra load
>> on IVOPTs and its cost model, increasing compile-time and
>> possibly confusing it.
>>
>> Second, IVOPTs introduces lowered memory accesses that it
>> expects to stay as is, likewise it produces auto-inc/dec
>> sequences that it expects to stay as is until RTL expansion.
>> Passes such as DOM can break this expectation and make the
>> work done by IVOPTs a waste.
>>
>> I remember doing this excercise in the GCC 4.3 timeframe where
>> benchmarking on x86_64 showed no gains or losses (but x86_64
>> isn't very sensitive to IV choices).
>>
>> Any help with benchmarking this on targets other than x86_64
>> is appreciated (I'll re-do x86_64).

I will bootstrap and make check on arm a15.

-- 
Best Regards.


RE: [PATCH GCC]Catch more MEM_REFs sharing common addressing part in gimple strength reduction

2013-09-08 Thread bin.cheng
Thanks for reviewing, I will correct all stupid spelling problem in the next 
version of patch.

On Mon, Sep 9, 2013 at 8:15 AM, Bill Schmidt  
wrote:
>
>>>+   int (i * S).
>>>+   Otherwise, just return double int zero.  */
>
> This is sufficient, since you are properly checking the next_interp
> chain.  Another possible form would be
>
> X = (B + i) * 1,
>
> but if this is present, then one of the forms you're checking for should
> also be present, so there's no need to check the MULT_CANDs.
I'm not very sure here since I didn't check MULT_CAND in the patch.  Could you 
please explain more about this?

>
>>>+
>>>+static double_int
>>>+backtrace_base_for_ref (tree *pbase)
>>>+{
>>>+  tree base_in = *pbase;
>>>+  slsr_cand_t base_cand;
>>>+
>>>+  STRIP_NOPS (base_in);
>>>+  if (TREE_CODE (base_in) != SSA_NAME)
>>>+return tree_to_double_int (integer_zero_node);
>>>+
>>>+  base_cand = base_cand_from_table (base_in);
>>>+
>>>+  while (base_cand && base_cand->kind != CAND_PHI)
>>>+{
>>>+  if (base_cand->kind == CAND_ADD
>>>+   && base_cand->index.is_one ()
>>>+   && TREE_CODE (base_cand->stride) == INTEGER_CST)
>>>+ {
>>>+   /* X = B + (1 * S), S is integer constant.  */
>>>+   *pbase = base_cand->base_expr;
>>>+   return tree_to_double_int (base_cand->stride);
>>>+ }
>>>+  else if (base_cand->kind == CAND_ADD
>>>+&& TREE_CODE (base_cand->stride) == INTEGER_CST
>>>+&& integer_onep (base_cand->stride))
>>>+{
>>>+   /* X = B + (i * S), S is integer one.  */
>>>+   *pbase = base_cand->base_expr;
>>>+   return base_cand->index;
>>>+ }
>>>+
>>>+  if (base_cand->next_interp)
>>>+ base_cand = lookup_cand (base_cand->next_interp);
>>>+  else
>>>+ base_cand = NULL;
>>>+}
>>>+
>>>+  return tree_to_double_int (integer_zero_node);
>>>+}
>>>+
>>> /* Look for the following pattern:
>>>
>>> *PBASE:MEM_REF (T1, C1)
>>>@@ -767,8 +818,15 @@ slsr_process_phi (gimple phi, bool speed)
>>>
>>> *PBASE:T1
>>> *POFFSET:  MULT_EXPR (T2, C3)
>>>-*PINDEX:   C1 + (C2 * C3) + C4  */
>>>+*PINDEX:   C1 + (C2 * C3) + C4
>>>
>>>+   When T2 is recorded by an CAND_ADD in the form of (T2' + C5), It
>  ^  ^
>  a  it
>
>>>+   will be further restructured to:
>>>+
>>>+*PBASE:T1
>>>+*POFFSET:  MULT_EXPR (T2', C3)
>>>+*PINDEX:   C1 + (C2 * C3) + C4 + (C5 * C3)  */
>>>+
>>> static bool
>>> restructure_reference (tree *pbase, tree *poffset, double_int
> *pindex,
>>> tree *ptype)
>>>@@ -777,7 +835,7 @@ restructure_reference (tree *pbase, tree *poffset,
>>>   double_int index = *pindex;
>>>   double_int bpu = double_int::from_uhwi (BITS_PER_UNIT);
>>>   tree mult_op0, mult_op1, t1, t2, type;
>>>-  double_int c1, c2, c3, c4;
>>>+  double_int c1, c2, c3, c4, c5;
>>>
>>>   if (!base
>>>   || !offset
>>>@@ -823,11 +881,12 @@ restructure_reference (tree *pbase, tree
> *poffset,
>>> }
>>>
>>>   c4 = index.udiv (bpu, FLOOR_DIV_EXPR);
>>>+  c5 = backtrace_base_for_ref (&t2);
>>>
>>>   *pbase = t1;
>>>-  *poffset = fold_build2 (MULT_EXPR, sizetype, t2,
>>>-   double_int_to_tree (sizetype, c3));
>>>-  *pindex = c1 + c2 * c3 + c4;
>>>+  *poffset = size_binop (MULT_EXPR, fold_convert (sizetype, t2),
>>>+  double_int_to_tree (sizetype, c3));
>
> I am not sure why you changed this call.  fold_build2 is a more
> efficient call than size_binop.  size_binop makes several checks that
> will fail in this case, and then calls fold_build2_loc, right?  Not a
> big deal but seems like changing it back would be better.  Perhaps I'm
> missing something (as usual ;).
I rely on size_binop to convert T2 into sizetype, because T2' may be in other 
kind of type.  Otherwise there will be ssa_verify error later.

Thanks.
bin






Re: [PATCH][RFC] Move IVOPTs closer to RTL expansion

2013-09-10 Thread Bin.Cheng
On Tue, Sep 10, 2013 at 3:30 AM, Steven Bosscher  wrote:
> On Mon, Sep 9, 2013 at 10:01 AM, Richard Biener wrote:
>>> >> First, the loop passes that at the moment preceede IVOPTs leave
>>> >> around IL that is in desparate need of basic re-optimization
>>> >> like CSE, constant propagation and DCE.  That puts extra load
>>> >> on IVOPTs and its cost model, increasing compile-time and
>>> >> possibly confusing it.
>
> So why not just run DCE and DOM just before IVOPTs?
>
>
>>> >> Second, IVOPTs introduces lowered memory accesses that it
>>> >> expects to stay as is, likewise it produces auto-inc/dec
>>> >> sequences that it expects to stay as is until RTL expansion.
>>> >> Passes such as DOM can break this expectation and make the
>>> >> work done by IVOPTs a waste.
>
> But IVOPTs leaves its own messy code behind (see below).
>
>
>>> >> I remember doing this excercise in the GCC 4.3 timeframe where
>>> >> benchmarking on x86_64 showed no gains or losses (but x86_64
>>> >> isn't very sensitive to IV choices).
>>> >>
>>> >> Any help with benchmarking this on targets other than x86_64
>>> >> is appreciated (I'll re-do x86_64).
>
> Targets like POWER and ARM would be interesting to test on.
Yes, I will benchmark on ARM 32bit.
>
>
>> We already run LIM twice, moving the one that is currently after
>> IVOPTs as well should be easy.  But of course as you note IVOPTs
>> may introduce loop invariant code it also may introduce full
>> redundancies in the way it re-writes IVs.  And for both people may
>> claim that we have both CSE and LIM on the RTL level, too.
>
> I would claim that relying on RTL (G)CSE and RTL LIM is a step in the
> wrong direction. You end up creating a lot of garbage RTL, and many
> transformations that are easy on GIMPLE cannot be performed anymore on
> RTL.
>
> Is it possible to make IVOPTs clean up after itself? It should be easy
> for IVOPTs to notice that it creates loop-invariant code, and position
It would be even better to have some loop-invariant sensitive
re-association, because IVOPT (by using general tree association)
splits loop-invariant expression into different gimple, making lim's
life hard.

> it on the loop pre-header edge. I suppose full redundancies are
> harder, but I would expect that to happen less frequently (the only
> situation I can think of right now is where a loop is rewritten with
> two IVs where the two IVs share a common sub-expression).
>
> Ciao!
> Steven



-- 
Best Regards.


RE: [PATCH GCC]Catch more MEM_REFs sharing common addressing part in gimple strength reduction

2013-09-10 Thread bin.cheng
On Mon, Sep 9, 2013 at 11:35 PM, Bill Schmidt  
wrote:
>
>> > I rely on size_binop to convert T2 into sizetype, because T2' may be in 
>> > other kind of type.  Otherwise there will be ssa_verify error later.
>>
>> OK, I see now.  I had thought this was handled by fold_build2, but
>> apparently not.  I guess all T2's formerly handled were already sizetype
>> as expected.  Thanks for the explanation!
>
> So, wouldn't it suffice to change t2 to fold_convert (sizetype, t2) in
> the argument list to fold_build2?  It's picking nits, but that would be
> slightly more efficient.

Hi Bill,

This is the 2nd version of patch with your comments incorporated.
Bootstrap and re-test on x86.  Re-test on ARM ongoing.  Is it ok if tests pass?

Thanks.
binIndex: gcc/testsuite/gcc.dg/tree-ssa/slsr-39.c
===
--- gcc/testsuite/gcc.dg/tree-ssa/slsr-39.c (revision 0)
+++ gcc/testsuite/gcc.dg/tree-ssa/slsr-39.c (revision 0)
@@ -0,0 +1,26 @@
+/* Verify straight-line strength reduction for back-tracing
+   CAND_ADD for T2 in:
+
+*PBASE:T1
+*POFFSET:  MULT_EXPR (T2, C3)
+*PINDEX:   C1 + (C2 * C3) + C4  */
+
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-slsr" } */
+
+typedef int arr_2[50][50];
+
+void foo (arr_2 a2, int v1)
+{
+  int i, j;
+
+  i = v1 + 5;
+  j = i;
+  a2 [i] [j++] = i;
+  a2 [i] [j++] = i;
+  a2 [i] [i-1] += 1;
+  return;
+}
+
+/* { dg-final { scan-tree-dump-times "MEM" 4 "slsr" } } */
+/* { dg-final { cleanup-tree-dump "slsr" } } */
Index: gcc/gimple-ssa-strength-reduction.c
===
--- gcc/gimple-ssa-strength-reduction.c (revision 202067)
+++ gcc/gimple-ssa-strength-reduction.c (working copy)
@@ -750,6 +750,57 @@ slsr_process_phi (gimple phi, bool speed)
   add_cand_for_stmt (phi, c);
 }
 
+/* Given PBASE which is a pointer to tree, look up the defining
+   statement for it and check whether the candidate is in the
+   form of:
+
+ X = B + (1 * S), S is integer constant
+ X = B + (i * S), S is integer one
+
+   If so, set PBASE to the candidate's base_expr and return double
+   int (i * S).
+   Otherwise, just return double int zero.  */
+
+static double_int
+backtrace_base_for_ref (tree *pbase)
+{
+  tree base_in = *pbase;
+  slsr_cand_t base_cand;
+
+  STRIP_NOPS (base_in);
+  if (TREE_CODE (base_in) != SSA_NAME)
+return tree_to_double_int (integer_zero_node);
+
+  base_cand = base_cand_from_table (base_in);
+
+  while (base_cand && base_cand->kind != CAND_PHI)
+{
+  if (base_cand->kind == CAND_ADD
+ && base_cand->index.is_one ()
+ && TREE_CODE (base_cand->stride) == INTEGER_CST)
+   {
+ /* X = B + (1 * S), S is integer constant.  */
+ *pbase = base_cand->base_expr;
+ return tree_to_double_int (base_cand->stride);
+   }
+  else if (base_cand->kind == CAND_ADD
+  && TREE_CODE (base_cand->stride) == INTEGER_CST
+  && integer_onep (base_cand->stride))
+{
+ /* X = B + (i * S), S is integer one.  */
+ *pbase = base_cand->base_expr;
+ return base_cand->index;
+   }
+
+  if (base_cand->next_interp)
+   base_cand = lookup_cand (base_cand->next_interp);
+  else
+   base_cand = NULL;
+}
+
+  return tree_to_double_int (integer_zero_node);
+}
+
 /* Look for the following pattern:
 
 *PBASE:MEM_REF (T1, C1)
@@ -767,8 +818,15 @@ slsr_process_phi (gimple phi, bool speed)
 
 *PBASE:T1
 *POFFSET:  MULT_EXPR (T2, C3)
-*PINDEX:   C1 + (C2 * C3) + C4  */
+*PINDEX:   C1 + (C2 * C3) + C4
 
+   When T2 is recorded by a CAND_ADD in the form of (T2' + C5), it
+   will be further restructured to:
+
+*PBASE:T1
+*POFFSET:  MULT_EXPR (T2', C3)
+*PINDEX:   C1 + (C2 * C3) + C4 + (C5 * C3)  */
+
 static bool
 restructure_reference (tree *pbase, tree *poffset, double_int *pindex,
   tree *ptype)
@@ -777,7 +835,7 @@ restructure_reference (tree *pbase, tree *poffset,
   double_int index = *pindex;
   double_int bpu = double_int::from_uhwi (BITS_PER_UNIT);
   tree mult_op0, mult_op1, t1, t2, type;
-  double_int c1, c2, c3, c4;
+  double_int c1, c2, c3, c4, c5;
 
   if (!base
   || !offset
@@ -823,11 +881,12 @@ restructure_reference (tree *pbase, tree *poffset,
 }
 
   c4 = index.udiv (bpu, FLOOR_DIV_EXPR);
+  c5 = backtrace_base_for_ref (&t2);
 
   *pbase = t1;
-  *poffset = fold_build2 (MULT_EXPR, sizetype, t2,
+  *poffset = fold_build2 (MULT_EXPR, sizetype, fold_convert (sizetype, t2),
  double_int_to_tree (sizetype, c3));
-  *pindex = c1 + c2 * c3 + c4;
+  *pindex = c1 + c2 * c3 + c4 + c5 * c3;
   *ptype = type;
 
   return true;


RE: [PATCH GCC]Catch more MEM_REFs sharing common addressing part in gimple strength reduction

2013-09-11 Thread bin.cheng

On Tue, Sep 10, 2013 at 9:30 PM, Bill Schmidt  
wrote:
>
>
> On Tue, 2013-09-10 at 15:41 +0800, bin.cheng wrote:
>> On Mon, Sep 9, 2013 at 11:35 PM, Bill Schmidt  
>> wrote:
>> >
>> >> > I rely on size_binop to convert T2 into sizetype, because T2' may be in 
>> >> > other kind of type.  Otherwise there will be ssa_verify error later.
>> >>
>> >> OK, I see now.  I had thought this was handled by fold_build2, but
>> >> apparently not.  I guess all T2's formerly handled were already sizetype
>> >> as expected.  Thanks for the explanation!
>> >
>> > So, wouldn't it suffice to change t2 to fold_convert (sizetype, t2) in
>> > the argument list to fold_build2?  It's picking nits, but that would be
>> > slightly more efficient.
>>
>> Hi Bill,
>>
>> This is the 2nd version of patch with your comments incorporated.
>> Bootstrap and re-test on x86.  Re-test on ARM ongoing.  Is it ok if tests 
>> pass?
>
> Looks good to me!  Thanks, Bin.
>

Sorry I have to hold on this patch since it causes several tests failed on ARM. 
 Will investigate it and get back ASAP.

Thanks.
bin





RE: [PING][PATCH ARM]Extend thumb1_reorg to save more comparison instructions

2013-09-12 Thread bin.cheng
Ping ^ 3


-Original Message-
From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org]
On Behalf Of bin.cheng
Sent: Friday, August 23, 2013 3:23 PM
To: gcc-patches@gcc.gnu.org
Cc: Richard Earnshaw
Subject: RE: [PING][PATCH ARM]Extend thumb1_reorg to save more comparison
instructions

Ping^2

Thanks.
bin

-Original Message-
From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org]
On Behalf Of bin.cheng
Sent: Thursday, August 08, 2013 4:51 PM
To: gcc-patches@gcc.gnu.org
Cc: Richard Earnshaw
Subject: [PING][PATCH ARM]Extend thumb1_reorg to save more comparison
instructions

Ping this patch, http://gcc.gnu.org/ml/gcc-patches/2013-04/msg01057.html

Thanks.
bin













RE: [PATCH ARM]Extend thumb1_reorg to save more comparison instructions

2013-09-16 Thread bin.cheng


> -Original Message-
> From: Richard Earnshaw
> Sent: Thursday, September 12, 2013 11:24 PM
> To: Bin Cheng
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH ARM]Extend thumb1_reorg to save more comparison
> instructions
> 
> On 18/04/13 06:34, Bin Cheng wrote:
> 
> Sorry for the delay, I've been trying to get my head around this one.
> 
> > thumb1_reorg-20130417.txt
> >
> >
> > Index: gcc/config/arm/arm.c
> >
> ==
> =
> > --- gcc/config/arm/arm.c(revision 197562)
> > +++ gcc/config/arm/arm.c(working copy)
> > @@ -14026,6 +14026,7 @@ thumb1_reorg (void)
> >rtx set, dest, src;
> >rtx pat, op0;
> >rtx prev, insn = BB_END (bb);
> > +  bool insn_clobbered = false;
> >
> >while (insn != BB_HEAD (bb) && DEBUG_INSN_P (insn))
> > insn = PREV_INSN (insn);
> > @@ -14034,12 +14035,29 @@ thumb1_reorg (void)
> >if (INSN_CODE (insn) != CODE_FOR_cbranchsi4_insn)
> > continue;
> >
> > -  /* Find the first non-note insn before INSN in basic block BB.
*/
> > +  /* Get the register with which we are comparing.  */
> > +  pat = PATTERN (insn);
> > +  op0 = XEXP (XEXP (SET_SRC (pat), 0), 0);
> > +
> > +  /* Find the first flag setting insn before INSN in basic block
> > + BB.  */
> >gcc_assert (insn != BB_HEAD (bb));
> > -  prev = PREV_INSN (insn);
> > -  while (prev != BB_HEAD (bb) && (NOTE_P (prev) || DEBUG_INSN_P
> (prev)))
> > -   prev = PREV_INSN (prev);
> > +  for (prev = PREV_INSN (insn);
> > +  (!insn_clobbered
> > +   && prev != BB_HEAD (bb)
> > +   && (NOTE_P (prev)
> > +   || DEBUG_INSN_P (prev)
> > +   || (GET_CODE (prev) == SET
> 
> This can't be right.  prev is an insn of some form, so the test that it is
a SET will
> always fail.
> 
> What you need to do here is to initialize 'set' to null before the loop
and then
> have something like
> 
>   || ((set = single_set (prev)) != NULL
> 
> > +   && get_attr_conds (prev) == CONDS_NOCOND)));
> > +  prev = PREV_INSN (prev))
> > +   {
> > + if (reg_set_p (op0, prev))
> > +   insn_clobbered = true;
> > +   }
> >
> > +  /* Skip if op0 is clobbered by insn other than prev. */
> > +  if (insn_clobbered)
> > +   continue;
> > +
> >set = single_set (prev);
> 
> This now becomes redundant and ...
> 
> >if (!set)
> > continue;
> 
> This will be based on the set you extracted above.
> 

Hi Richard, here is the updated patch according to your comments.  Tested on
thumb1, please review.

Thanks.
binIndex: gcc/config/arm/arm.c
===
--- gcc/config/arm/arm.c(revision 202599)
+++ gcc/config/arm/arm.c(working copy)
@@ -14265,9 +14265,10 @@ thumb1_reorg (void)
 
   FOR_EACH_BB (bb)
 {
-  rtx set, dest, src;
-  rtx pat, op0;
+  rtx dest, src;
+  rtx pat, op0, set = NULL;
   rtx prev, insn = BB_END (bb);
+  bool insn_clobbered = false;
 
   while (insn != BB_HEAD (bb) && DEBUG_INSN_P (insn))
insn = PREV_INSN (insn);
@@ -14276,13 +14277,29 @@ thumb1_reorg (void)
   if (INSN_CODE (insn) != CODE_FOR_cbranchsi4_insn)
continue;
 
-  /* Find the first non-note insn before INSN in basic block BB.  */
+  /* Get the register with which we are comparing.  */
+  pat = PATTERN (insn);
+  op0 = XEXP (XEXP (SET_SRC (pat), 0), 0);
+
+  /* Find the first flag setting insn before INSN in basic block BB.  */
   gcc_assert (insn != BB_HEAD (bb));
-  prev = PREV_INSN (insn);
-  while (prev != BB_HEAD (bb) && (NOTE_P (prev) || DEBUG_INSN_P (prev)))
-   prev = PREV_INSN (prev);
+  for (prev = PREV_INSN (insn);
+  (!insn_clobbered
+   && prev != BB_HEAD (bb)
+   && (NOTE_P (prev)
+   || DEBUG_INSN_P (prev)
+   || ((set = single_set (prev)) != NULL
+   && get_attr_conds (prev) == CONDS_NOCOND)));
+  prev = PREV_INSN (prev))
+   {
+ if (reg_set_p (op0, prev))
+   insn_clobbered = true;
+   }
 
-  set = single_set (prev);
+  /* Skip if op0 is clobbered by insn other than prev. */
+  if (insn_clobbered)
+   continue;
+
   if (!set)
continue;
 
@@ -14292,12 +14309,9 @@ thumb1_reorg (void)
  || !low_register_operand (src, SImode))
continue;
 
-  pat = PATTERN (insn);
-  op0 = XEXP (XEXP (SET_SRC (pat), 0), 0);
   /* Rewrite move into subtract of 0 if its operand is compared with ZERO
-in INSN. Don't need to check dest since cprop_hardreg pass propagates
-src into INSN.  */
-  if (REGNO (op0) == REGNO (src))
+in INSN.  Both src and dest of the move insn are checked.  */
+  if (REGNO (op0) == REGNO (src) || REGNO (op0) == REGNO (dest))
{
  dest = copy_rtx (dest);
  src = copy_rtx (src)

RE: [PATCH GCC]Catch more MEM_REFs sharing common addressing part in gimple strength reduction

2013-09-16 Thread bin.cheng


> -Original Message-
> From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-
> ow...@gcc.gnu.org] On Behalf Of bin.cheng
> Sent: Thursday, September 12, 2013 2:13 PM
> To: 'Bill Schmidt'; Yufeng Zhang; Yufeng Zhang
> Cc: Richard Biener; GCC Patches
> Subject: RE: [PATCH GCC]Catch more MEM_REFs sharing common addressing
> part in gimple strength reduction
> 
> 
> On Tue, Sep 10, 2013 at 9:30 PM, Bill Schmidt 
> wrote:
> >
> >
> > On Tue, 2013-09-10 at 15:41 +0800, bin.cheng wrote:
> >> On Mon, Sep 9, 2013 at 11:35 PM, Bill Schmidt
>  wrote:
> >> >
> >> >> > I rely on size_binop to convert T2 into sizetype, because T2' may be
> in other kind of type.  Otherwise there will be ssa_verify error later.
> >> >>
> >> >> OK, I see now.  I had thought this was handled by fold_build2, but
> >> >> apparently not.  I guess all T2's formerly handled were already
> >> >> sizetype as expected.  Thanks for the explanation!
> >> >
> >> > So, wouldn't it suffice to change t2 to fold_convert (sizetype, t2)
> >> > in the argument list to fold_build2?  It's picking nits, but that
> >> > would be slightly more efficient.
> >>
> >> Hi Bill,
> >>
> >> This is the 2nd version of patch with your comments incorporated.
> >> Bootstrap and re-test on x86.  Re-test on ARM ongoing.  Is it ok if tests
> pass?
> >
> > Looks good to me!  Thanks, Bin.
> >
> 
> Sorry I have to hold on this patch since it causes several tests failed on 
> ARM.
> Will investigate it and get back ASAP.
> 
The reported failure is false alarm and happens on trunk too.  I must have 
compared wrong testing results.
Since there is no regression and the patch is approved before, I will apply it 
to trunk.

Thanks.
bin






RE: [PATCH GCC]Catch more MEM_REFs sharing common addressing part in gimple strength reduction

2013-09-17 Thread bin.cheng


> -Original Message-
> From: Dominique Dhumieres [mailto:domi...@lps.ens.fr]
> Sent: Wednesday, September 18, 2013 1:47 AM
> To: gcc-patches@gcc.gnu.org
> Cc: hjl.to...@gmail.com; Bin Cheng
> Subject: Re: [PATCH GCC]Catch more MEM_REFs sharing common
> addressing part in gimple strength reduction
> 
> The new test gcc.dg/tree-ssa/slsr-39.c fails in 64 bit mode (see
> http://gcc.gnu.org/ml/gcc-regression/2013-09/msg00455.html ).
> Looking for MEM in the dump returns
> 
>   _12 = MEM[(int[50] *)_17];
>   MEM[(int[50] *)_20] = _13;
> 

Thanks for reporting, I think this can be fixed by patch:
http://gcc.gnu.org/ml/gcc-patches/2013-09/msg00761.html

Thanks.
bin





RE: [PATCH ARM]Refine scaled address expression on ARM

2013-09-18 Thread bin.cheng


> -Original Message-
> From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-
> ow...@gcc.gnu.org] On Behalf Of bin.cheng
> Sent: Monday, September 02, 2013 3:09 PM
> To: Richard Earnshaw
> Cc: gcc-patches@gcc.gnu.org
> Subject: RE: [PATCH ARM]Refine scaled address expression on ARM
> 
> 
> 
> > -Original Message-
> > From: Richard Earnshaw
> > Sent: Thursday, August 29, 2013 9:06 PM
> > To: Bin Cheng
> > Cc: gcc-patches@gcc.gnu.org
> > Subject: Re: [PATCH ARM]Refine scaled address expression on ARM
> >
> > On 28/08/13 08:00, bin.cheng wrote:
> > > Hi,
> > >
> > > This patch refines scaled address expression on ARM.  It supports
> > > "base+index*scale" in arm_legitimate_address_outer_p.  It also tries
> > > to legitimize "base + index * scale + offset" with "reg <- base +
> > > offset;  reg
> > > + index * scale" by introducing thumb2_legitimize_address.  For now
> > > + function
> > > thumb2_legitimize_address is a kind of placeholder and just does the
> > > mentioned transformation by calling to try_multiplier_address.
> > > Hoping we can improve it in the future.
> > >
> > > With this patch:
> > > 1) "base+index*scale" is recognized.
> >
> > That's because (PLUS (REG) (MULT (REG) (CONST))) is not canonical form.
> >  So this shouldn't be necessary.  Can you identify where this
> non-canoncial form is being generated?
> >
> 
> Oh, for now ivopt constructs "index*scale" to test whether backend
> supports scaled addressing mode, which is not valid on ARM, so I was going
> to construct "base + index*scale" instead.  Since "base + index * scale"
is not
> canonical form, I will construct the canonical form and drop this part of
the
> patch.
> 
> Is rest of this patch OK?
> 
Hi Richard, I removed the part over which you concerned and created this
updated patch.

Is it OK?

Thanks.
bin

2013-09-18  Bin Cheng  

* config/arm/arm.c (try_multiplier_address): New function.
(thumb2_legitimize_address): New function.
(arm_legitimize_address): Call try_multiplier_address and
thumb2_legitimize_address.Index: gcc/config/arm/arm.c
===
--- gcc/config/arm/arm.c(revision 200774)
+++ gcc/config/arm/arm.c(working copy)
@@ -6652,6 +6654,106 @@ legitimize_tls_address (rtx x, rtx reg)
 }
 }
 
+/* Try to find address expression like base + index * scale + offset
+   in X.  If we find one, force base + offset into register and
+   construct new expression reg + index * scale; return the new
+   address expression if it's valid.  Otherwise return X.  */
+static rtx
+try_multiplier_address (rtx x, enum machine_mode mode ATTRIBUTE_UNUSED)
+{
+  rtx tmp, base_reg, new_rtx;
+  rtx base = NULL_RTX, index = NULL_RTX, scale = NULL_RTX, offset = NULL_RTX;
+
+  gcc_assert (GET_CODE (x) == PLUS);
+
+  /* Try to find and record base/index/scale/offset in X. */
+  if (GET_CODE (XEXP (x, 1)) == MULT)
+{
+  tmp = XEXP (x, 0);
+  index = XEXP (XEXP (x, 1), 0);
+  scale = XEXP (XEXP (x, 1), 1);
+  if (GET_CODE (tmp) != PLUS)
+   return x;
+
+  base = XEXP (tmp, 0);
+  offset = XEXP (tmp, 1);
+}
+  else
+{
+  tmp = XEXP (x, 0);
+  offset = XEXP (x, 1);
+  if (GET_CODE (tmp) != PLUS)
+   return x;
+
+  base = XEXP (tmp, 0);
+  scale = XEXP (tmp, 1);
+  if (GET_CODE (base) == MULT)
+   {
+ tmp = base;
+ base = scale;
+ scale = tmp;
+   }
+  if (GET_CODE (scale) != MULT)
+   return x;
+
+  index = XEXP (scale, 0);
+  scale = XEXP (scale, 1);
+}
+
+  if (CONST_INT_P (base))
+{
+  tmp = base;
+  base = offset;
+  offset = tmp;
+}
+
+  if (CONST_INT_P (index))
+{
+  tmp = index;
+  index = scale;
+  scale = tmp;
+}
+
+  /* ARM only supports constant scale in address.  */
+  if (!CONST_INT_P (scale))
+return x;
+
+  if (GET_MODE (base) != SImode || GET_MODE (index) != SImode)
+return x;
+
+  /* Only register/constant are allowed in each part.  */
+  if (!symbol_mentioned_p (base)
+  && !symbol_mentioned_p (offset)
+  && !symbol_mentioned_p (index)
+  && !symbol_mentioned_p (scale))
+{
+  /* Force "base+offset" into register and construct
+"register+index*scale".  Return the new expression
+only if it's valid.  */
+  tmp = gen_rtx_PLUS (SImode, base, offset);
+  base_reg = force_reg (SImode, tmp);
+  tmp = gen_rtx_fmt_ee (MULT, SImode, index, scale);
+  new_rtx = gen_rtx_PLUS (SImode, base_reg, tmp);
+  ret

[PATCH]Construct canonical scaled address expression in IVOPT

2013-09-20 Thread bin.cheng
Hi,
For now IVOPT constructs scaled address expression in the form of
"scaled*index" and checks whether backend supports it. The problem is the
address expression is invalid on ARM, causing scaled expression disabled in
IVOPT on ARM.  This patch fixes the IVOPT part by constructing rtl address
expression like "index*scaled+base".

Hi Richard,
I thought about the suggestion constructing TARGET_MEM[.] and adding new
target hook to check whether backend supports such target memory accesses,
but still want to give this patch a try because:
1) RTL pattern "index*scaled+base" is some kind of canonical form of scaled
address expression and it works fine.
2) It won't save us any inconvenience by constructing TARGET_MEM node, on
contrary, we have to add new target hook checking whether scaled addressing
mode is supported, which in essence is nothing else than current
implementation.

Also "base+index*scaled" is re-structured to canonical form
"index*scaled+base", I constructed the latter form in patch.
Bootstrapped and tested on x86_64 and arm_a15. Is it OK?

Thanks.
bin


2013-09-20  Bin Cheng  

* tree-ssa-loop-ivopts.c (multiplier_allowed_in_address_p):
Construct canonical scaled rtl address expression.Index: gcc/tree-ssa-loop-ivopts.c
===
--- gcc/tree-ssa-loop-ivopts.c  (revision 202599)
+++ gcc/tree-ssa-loop-ivopts.c  (working copy)
@@ -3134,15 +3134,22 @@ multiplier_allowed_in_address_p (HOST_WIDE_INT rat
 {
   enum machine_mode address_mode = targetm.addr_space.address_mode (as);
   rtx reg1 = gen_raw_REG (address_mode, LAST_VIRTUAL_REGISTER + 1);
-  rtx addr;
+  rtx addr, index;
   HOST_WIDE_INT i;
 
   valid_mult = sbitmap_alloc (2 * MAX_RATIO + 1);
   bitmap_clear (valid_mult);
-  addr = gen_rtx_fmt_ee (MULT, address_mode, reg1, NULL_RTX);
+  /* Construct address expression in the canonical form of
+"base+index*scale" and call memory_address_addr_space_p
+to see whether it is allowed by target processors.  */
+  index = gen_rtx_fmt_ee (MULT, address_mode, reg1, NULL_RTX);
   for (i = -MAX_RATIO; i <= MAX_RATIO; i++)
{
- XEXP (addr, 1) = gen_int_mode (i, address_mode);
+ if (i == 1)
+   continue;
+
+ XEXP (index, 1) = gen_int_mode (i, address_mode);
+ addr = gen_rtx_fmt_ee (PLUS, address_mode, index, reg1);
  if (memory_address_addr_space_p (mode, addr, as))
bitmap_set_bit (valid_mult, i + MAX_RATIO);
}


RE: [PATCH]Construct canonical scaled address expression in IVOPT

2013-09-23 Thread bin.cheng


> -Original Message-
> From: Richard Biener [mailto:richard.guent...@gmail.com]
> Sent: Monday, September 23, 2013 8:08 PM
> To: Bin Cheng
> Cc: GCC Patches
> Subject: Re: [PATCH]Construct canonical scaled address expression in IVOPT
> 
> On Fri, Sep 20, 2013 at 12:00 PM, bin.cheng  wrote:
> > Hi,
> > For now IVOPT constructs scaled address expression in the form of
> > "scaled*index" and checks whether backend supports it. The problem is
> > the address expression is invalid on ARM, causing scaled expression
> > disabled in IVOPT on ARM.  This patch fixes the IVOPT part by
> > constructing rtl address expression like "index*scaled+base".
> 
> -  addr = gen_rtx_fmt_ee (MULT, address_mode, reg1, NULL_RTX);
> +  /* Construct address expression in the canonical form of
> + "base+index*scale" and call memory_address_addr_space_p to see
> whether
> + it is allowed by target processors.  */
> +  index = gen_rtx_fmt_ee (MULT, address_mode, reg1, NULL_RTX);
>for (i = -MAX_RATIO; i <= MAX_RATIO; i++)
>   {
> -  XEXP (addr, 1) = gen_int_mode (i, address_mode);
> +  if (i == 1)
> +continue;
> +
> +  XEXP (index, 1) = gen_int_mode (i, address_mode);  addr =
> + gen_rtx_fmt_ee (PLUS, address_mode, index, reg1);
>if (memory_address_addr_space_p (mode, addr, as))
>  bitmap_set_bit (valid_mult, i + MAX_RATIO);
> 
> so you now build reg1*scale+reg1 - which checks if offset and scale work
at
> the same time (and with the same value - given this is really
reg1*(scale+1)).
> It might be that this was implicitely assumed (or that no targets allow
scale
> but no offset), but it's a change that requires audit of behavior on more
> targets.
So literally, function multiplier_allowed_in_address_p should check whether
multiplier is allowed in any kind of addressing mode, like [reg*scale +
offset] and [reg*scale + reg].   Apparently it's infeasible to check every
possibility for each architecture, is it ok we at least check "index", then
"addr" if "index" is failed?  By "any kind of addressing modes", I mean
modes supported by function get_address_cost,  i.e., in form of "[base] +
[off] + [var] + (reg|reg*scale)".

> 
> The above also builds more RTX waste which you can fix by re-using the
PLUS
> by building it up-front similar to the multiplication.  You also miss the
Yes, this can be fixed.

> opportunity to have scale == 1 denote as to whether reg1 + reg2 is valid.
I
> would expect that many targets support reg1 * scale + constant-offset but
> not many reg1 * scale + reg2.
I thought scale==1 is unnecessary because the addressing mode degrades into
"reg" or "reg+reg".  Moreover, calls of multiplier_allowed_in_address_p in
both get_address_cost and get_computation_cost_at have scale other than 1.

> 
> So no, the helper now checks sth completely different.  What's the problem
> with arm supporting reg1 * scale?  Why shouldn't it being able to handle
the
> implicit zero offset?

As Richard clarified, ARM does not support scaled addressing mode without
base register.

Thanks.
bin





[PATCH]Fix computation of offset in ivopt

2013-09-24 Thread bin.cheng
Hi,
This patch fix two minor bugs when computing offset in IVOPT.
1) Considering below example:
#define MAX 100
struct tag
{
  int i;
  int j;
}
struct tag arr[MAX]

int foo (int len)
{
  int i = 0;
  for (; i < len; i++)
  {
access arr[i].j;
  }
}

Without this patch, the offset computed by strip_offset_1 for address
arr[i].j is ZERO, which is apparently not.

2) Considering below example:
//...
  :
  KeyIndex_66 = KeyIndex_194 + 4294967295;
  if (KeyIndex_66 != 0)
goto ;
  else
goto ;

  :

  :
  # KeyIndex_194 = PHI 
  _62 = KeyIndex_194 + 1073741823;
  _63 = _62 * 4;
  _64 = pretmp_184 + _63;
  _65 = *_64;
  if (_65 == 0)
goto ;
  else
goto ;
//...

There are iv use and candidate like:

use 1
  address
  in statement _65 = *_64;

  at position *_64
  type handletype *
  base pretmp_184 + ((sizetype) KeyIndex_180 + 1073741823) * 4
  step 4294967292
  base object (void *) pretmp_184
  related candidates 

candidate 6
  var_before ivtmp.16
  var_after ivtmp.16
  incremented before use 1
  type unsigned int
  base (unsigned int) (pretmp_184 + (sizetype) KeyIndex_180 * 4)
  step 4294967292
  base object (void *) pretmp_184
Candidate 6 is related to use 1

In function get_computation_cost_at for use 1 using cand 6, ubase and cbase
are:
pretmp_184 + ((sizetype) KeyIndex_180 + 1073741823) * 4
pretmp_184 + (sizetype) KeyIndex_180 * 4

The cstepi computed in HOST_WIDE_INT is :  0xfffc, while offset
computed in TYPE(utype) is : 0xfffc.  Though they both stand for value
"-4" in different precision, statement "offset -= ratio * cstepi" returns
0x1, which is wrong.

Tested on x86_64 and arm.  Is it OK?

Thanks.
bin


2013-09-24  Bin Cheng  

* tree-ssa-loop-ivopts.c (strip_offset_1): Count
DECL_FIELD_BIT_OFFSET when computing offset for COMPONENT_REF.
(get_computation_cost_at): Sign extend offset.Index: gcc/tree-ssa-loop-ivopts.c
===
--- gcc/tree-ssa-loop-ivopts.c  (revision 200774)
+++ gcc/tree-ssa-loop-ivopts.c  (working copy)
@@ -2133,19 +2133,28 @@ strip_offset_1 (tree expr, bool inside_addr, bool
   break;
 
 case COMPONENT_REF:
-  if (!inside_addr)
-   return orig_expr;
+  {
+   tree field;
+   HOST_WIDE_INT boffset = 0;
+   if (!inside_addr)
+ return orig_expr;
 
-  tmp = component_ref_field_offset (expr);
-  if (top_compref
- && cst_and_fits_in_hwi (tmp))
-   {
- /* Strip the component reference completely.  */
- op0 = TREE_OPERAND (expr, 0);
- op0 = strip_offset_1 (op0, inside_addr, top_compref, &off0);
- *offset = off0 + int_cst_value (tmp);
- return op0;
-   }
+   field = TREE_OPERAND (expr, 1);
+   if (DECL_FIELD_BIT_OFFSET (field)
+   && cst_and_fits_in_hwi (DECL_FIELD_BIT_OFFSET (field)))
+ boffset = int_cst_value (DECL_FIELD_BIT_OFFSET (field));
+
+   tmp = component_ref_field_offset (expr);
+   if (top_compref
+   && cst_and_fits_in_hwi (tmp))
+ {
+   /* Strip the component reference completely.  */
+   op0 = TREE_OPERAND (expr, 0);
+   op0 = strip_offset_1 (op0, inside_addr, top_compref, &off0);
+   *offset = off0 + int_cst_value (tmp) + boffset / BITS_PER_UNIT;
+   return op0;
+ }
+  }
   break;
 
 case ADDR_EXPR:
@@ -4133,6 +4142,9 @@ get_computation_cost_at (struct ivopts_data *data,
 bitmap_clear (*depends_on);
 }
 
+  /* Sign-extend offset if utype has lower precision than HOST_WIDE_INT.  */
+  offset = sext_hwi (offset, TYPE_PRECISION (utype));
+
   /* If we are after the increment, the value of the candidate is higher by
  one iteration.  */
   if (stmt_is_after_inc)


Re: [PATCH]Construct canonical scaled address expression in IVOPT

2013-09-24 Thread Bin.Cheng
On Tue, Sep 24, 2013 at 6:12 PM, Richard Biener
 wrote:
> On Tue, Sep 24, 2013 at 8:20 AM, bin.cheng  wrote:
>>
>>
>>> -Original Message-
>
> Or even [reg*scale] (not sure about that).  But yes, at least reg*scale + 
> offset
> and reg*scale + reg.
>
>>   Apparently it's infeasible to check every
>> possibility for each architecture, is it ok we at least check "index", then
>> "addr" if "index" is failed?  By "any kind of addressing modes", I mean
>> modes supported by function get_address_cost,  i.e., in form of "[base] +
>> [off] + [var] + (reg|reg*scale)".
>
> I suppose so, but it of course depends on what IVOPTs uses the answer
> for in the end.  Appearantly it doesn't distinguish between the various cases
> even though TARGET_MEM_REF does support all the variants in question
> (reg * scale, reg * scale + reg, reg * scale + const, reg * scale +
> reg + const).
>
> So the better answer may be to teach the costs about the differences?
Ideally, IVOPT should be aware whether scaling is allowed in every
kind of addressing modes and account cost of multiplier accordingly.
For current code, there are two scenarios here
1) If target supports reg*scale+reg, but not reg*scale, in this case,
IVOPT considers multiplier is not allowed in any addressing mode and
account multiplier with high cost.  This is the problem arm having.
2) If target supports reg*scale, but not some kind of addressing mode
(saying reg*scale+reg), in this case, IVOPT still constructs various
scaled addressing mode in get_address_cost and depends on address_cost
to compute correct cost for that addressing expression.  I think this
happens to work even IVOPT doesn't know "reg*scale+reg" is actually
not supported.

>
>>> The above also builds more RTX waste which you can fix by re-using the
>> PLUS
>>> by building it up-front similar to the multiplication.  You also miss the
>> Yes, this can be fixed.
>>
>>> opportunity to have scale == 1 denote as to whether reg1 + reg2 is valid.
>> I
>>> would expect that many targets support reg1 * scale + constant-offset but
>>> not many reg1 * scale + reg2.
>> I thought scale==1 is unnecessary because the addressing mode degrades into
>> "reg" or "reg+reg".  Moreover, calls of multiplier_allowed_in_address_p in
>> both get_address_cost and get_computation_cost_at have scale other than 1.
>
> Ok.
>
>>>
>>> So no, the helper now checks sth completely different.  What's the problem
>>> with arm supporting reg1 * scale?  Why shouldn't it being able to handle
>> the
>>> implicit zero offset?
>>
>> As Richard clarified, ARM does not support scaled addressing mode without
>> base register.
>
> I see.
>
Also from the newer comments:

> Btw, it should be reasonably possible to compute the whole
> multiplier_allowed_in_address_p table for all primary and secondary archs
> (simply build cross-cc1) and compare the results before / after a patch
> candidate.  Querying both reg * scale and reg + reg * scale if the first
> fails sounds like a good solution to me.
I take this as we should do minimal change by checking reg + reg *
scale if reg * scale is failed,  right?

Thanks.
bin
-- 
Best Regards.


RE: [PATCH]Construct canonical scaled address expression in IVOPT

2013-09-26 Thread bin.cheng


> -Original Message-
> From: Richard Biener [mailto:richard.guent...@gmail.com]
> Sent: Tuesday, September 24, 2013 7:58 PM
> To: Bin.Cheng
> Cc: Bin Cheng; GCC Patches; Richard Earnshaw
> Subject: Re: [PATCH]Construct canonical scaled address expression in IVOPT
> 
> On Tue, Sep 24, 2013 at 1:40 PM, Bin.Cheng 
> wrote:
> > On Tue, Sep 24, 2013 at 6:12 PM, Richard Biener
> >  wrote:
> >> On Tue, Sep 24, 2013 at 8:20 AM, bin.cheng  wrote:
> >>>
> >>>
> >>>> -Original Message-
> >>
> >> Or even [reg*scale] (not sure about that).  But yes, at least
> >> reg*scale + offset and reg*scale + reg.
> >>
> >>>   Apparently it's infeasible to check every possibility for each
> >>> architecture, is it ok we at least check "index", then "addr" if
> >>> "index" is failed?  By "any kind of addressing modes", I mean modes
> >>> supported by function get_address_cost,  i.e., in form of "[base] +
> >>> [off] + [var] + (reg|reg*scale)".
> >>
> >> I suppose so, but it of course depends on what IVOPTs uses the answer
> >> for in the end.  Appearantly it doesn't distinguish between the
> >> various cases even though TARGET_MEM_REF does support all the
> >> variants in question (reg * scale, reg * scale + reg, reg * scale +
> >> const, reg * scale + reg + const).
> >>
> >> So the better answer may be to teach the costs about the differences?
> > Ideally, IVOPT should be aware whether scaling is allowed in every
> > kind of addressing modes and account cost of multiplier accordingly.
> > For current code, there are two scenarios here
> > 1) If target supports reg*scale+reg, but not reg*scale, in this case,
> > IVOPT considers multiplier is not allowed in any addressing mode and
> > account multiplier with high cost.  This is the problem arm having.
> > 2) If target supports reg*scale, but not some kind of addressing mode
> > (saying reg*scale+reg), in this case, IVOPT still constructs various
> > scaled addressing mode in get_address_cost and depends on address_cost
> > to compute correct cost for that addressing expression.  I think this
> > happens to work even IVOPT doesn't know "reg*scale+reg" is actually
> > not supported.
> >
> >>
> >>>> The above also builds more RTX waste which you can fix by re-using
> >>>> the
> >>> PLUS
> >>>> by building it up-front similar to the multiplication.  You also
> >>>> miss the
> >>> Yes, this can be fixed.
> >>>
> >>>> opportunity to have scale == 1 denote as to whether reg1 + reg2 is
valid.
> >>> I
> >>>> would expect that many targets support reg1 * scale +
> >>>> constant-offset but not many reg1 * scale + reg2.
> >>> I thought scale==1 is unnecessary because the addressing mode
> >>> degrades into "reg" or "reg+reg".  Moreover, calls of
> >>> multiplier_allowed_in_address_p in both get_address_cost and
> get_computation_cost_at have scale other than 1.
> >>
> >> Ok.
> >>
> >>>>
> >>>> So no, the helper now checks sth completely different.  What's the
> >>>> problem with arm supporting reg1 * scale?  Why shouldn't it being
> >>>> able to handle
> >>> the
> >>>> implicit zero offset?
> >>>
> >>> As Richard clarified, ARM does not support scaled addressing mode
> >>> without base register.
> >>
> >> I see.
> >>
> > Also from the newer comments:
> >
> >> Btw, it should be reasonably possible to compute the whole
> >> multiplier_allowed_in_address_p table for all primary and secondary
> >> archs (simply build cross-cc1) and compare the results before / after
> >> a patch candidate.  Querying both reg * scale and reg + reg * scale
> >> if the first fails sounds like a good solution to me.
> > I take this as we should do minimal change by checking reg + reg *
> > scale if reg * scale is failed,  right?
> 
> Yes, you can share a single RTL expression for all this and I think
querying reg
> + reg * scale first makes sense (then fallback to reg * scale for
compatibility).
> 
I updated the patch as discussed.  Please review.
It bootstraps and passes regression test on x86/x86_64, but fails
tree-ssa/scev-4.c on arm cortex-a15.
Hi Richard, I investigated the failure and found out it reveals two ot

RE: [PATCH]Construct canonical scaled address expression in IVOPT

2013-09-26 Thread bin.cheng
Sorry for missing the patch.

Thanks.
bin

> -Original Message-
> From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-
> ow...@gcc.gnu.org] On Behalf Of bin.cheng
> Sent: Thursday, September 26, 2013 8:05 PM
> To: 'Richard Biener'; Bin.Cheng
> Cc: GCC Patches; Richard Earnshaw
> Subject: RE: [PATCH]Construct canonical scaled address expression in IVOPT
> 
> 
> 
> > -Original Message-
> > From: Richard Biener [mailto:richard.guent...@gmail.com]
> > Sent: Tuesday, September 24, 2013 7:58 PM
> > To: Bin.Cheng
> > Cc: Bin Cheng; GCC Patches; Richard Earnshaw
> > Subject: Re: [PATCH]Construct canonical scaled address expression in
> > IVOPT
> >
> > On Tue, Sep 24, 2013 at 1:40 PM, Bin.Cheng 
> > wrote:
> > > On Tue, Sep 24, 2013 at 6:12 PM, Richard Biener
> > >  wrote:
> > >> On Tue, Sep 24, 2013 at 8:20 AM, bin.cheng 
> wrote:
> > >>>
> > >>>
> > >>>> -Original Message-
> > >>
> > >> Or even [reg*scale] (not sure about that).  But yes, at least
> > >> reg*scale + offset and reg*scale + reg.
> > >>
> > >>>   Apparently it's infeasible to check every possibility for each
> > >>> architecture, is it ok we at least check "index", then "addr" if
> > >>> "index" is failed?  By "any kind of addressing modes", I mean
> > >>> modes supported by function get_address_cost,  i.e., in form of
> > >>> "[base] + [off] + [var] + (reg|reg*scale)".
> > >>
> > >> I suppose so, but it of course depends on what IVOPTs uses the
> > >> answer for in the end.  Appearantly it doesn't distinguish between
> > >> the various cases even though TARGET_MEM_REF does support all the
> > >> variants in question (reg * scale, reg * scale + reg, reg * scale +
> > >> const, reg * scale + reg + const).
> > >>
> > >> So the better answer may be to teach the costs about the differences?
> > > Ideally, IVOPT should be aware whether scaling is allowed in every
> > > kind of addressing modes and account cost of multiplier accordingly.
> > > For current code, there are two scenarios here
> > > 1) If target supports reg*scale+reg, but not reg*scale, in this
> > > case, IVOPT considers multiplier is not allowed in any addressing
> > > mode and account multiplier with high cost.  This is the problem arm
> having.
> > > 2) If target supports reg*scale, but not some kind of addressing
> > > mode (saying reg*scale+reg), in this case, IVOPT still constructs
> > > various scaled addressing mode in get_address_cost and depends on
> > > address_cost to compute correct cost for that addressing expression.
> > > I think this happens to work even IVOPT doesn't know "reg*scale+reg"
> > > is actually not supported.
> > >
> > >>
> > >>>> The above also builds more RTX waste which you can fix by
> > >>>> re-using the
> > >>> PLUS
> > >>>> by building it up-front similar to the multiplication.  You also
> > >>>> miss the
> > >>> Yes, this can be fixed.
> > >>>
> > >>>> opportunity to have scale == 1 denote as to whether reg1 + reg2
> > >>>> is
> valid.
> > >>> I
> > >>>> would expect that many targets support reg1 * scale +
> > >>>> constant-offset but not many reg1 * scale + reg2.
> > >>> I thought scale==1 is unnecessary because the addressing mode
> > >>> degrades into "reg" or "reg+reg".  Moreover, calls of
> > >>> multiplier_allowed_in_address_p in both get_address_cost and
> > get_computation_cost_at have scale other than 1.
> > >>
> > >> Ok.
> > >>
> > >>>>
> > >>>> So no, the helper now checks sth completely different.  What's
> > >>>> the problem with arm supporting reg1 * scale?  Why shouldn't it
> > >>>> being able to handle
> > >>> the
> > >>>> implicit zero offset?
> > >>>
> > >>> As Richard clarified, ARM does not support scaled addressing mode
> > >>> without base register.
> > >>
> > >> I see.
> > >>
> > > Also from the newer comments:
> > >
> > >> Btw, it should be reasonably possible to compute t

RE: [PATCH]Fix computation of offset in ivopt

2013-09-26 Thread bin.cheng


> -Original Message-
> From: Richard Biener [mailto:richard.guent...@gmail.com]
> Sent: Tuesday, September 24, 2013 6:31 PM
> To: Bin Cheng
> Cc: GCC Patches
> Subject: Re: [PATCH]Fix computation of offset in ivopt
> 
> On Tue, Sep 24, 2013 at 11:13 AM, bin.cheng  wrote:
> 
> +   field = TREE_OPERAND (expr, 1);
> +   if (DECL_FIELD_BIT_OFFSET (field)
> +   && cst_and_fits_in_hwi (DECL_FIELD_BIT_OFFSET (field)))
> + boffset = int_cst_value (DECL_FIELD_BIT_OFFSET (field));
> +
> +   tmp = component_ref_field_offset (expr);
> +   if (top_compref
> +   && cst_and_fits_in_hwi (tmp))
> + {
> +   /* Strip the component reference completely.  */
> +   op0 = TREE_OPERAND (expr, 0);
> +   op0 = strip_offset_1 (op0, inside_addr, top_compref, &off0);
> +   *offset = off0 + int_cst_value (tmp) + boffset /
BITS_PER_UNIT;
> +   return op0;
> + }
> 
> the failure paths seem mangled, that is, if cst_and_fits_in_hwi is false
for
> either offset part you may end up doing half accounting and not stripping.
> 
> Btw, DECL_FIELD_BIT_OFFSET is always non-NULL.  I suggest to rewrite to
> 
>  if (!inside_addr)
>return orig_expr;
> 
>  tmp = component_ref_field_offset (expr);
>  field = TREE_OPERAND (expr, 1);
>  if (top_compref
>  && cst_and_fits_in_hwi (tmp)
>  && cst_and_fits_in_hwi (DECL_FIELD_BIT_OFFSET (field)))
> {
>   ...
> }
Will be refined.

> 
> note that this doesn't really handle overflows correctly as
> 
> +   *offset = off0 + int_cst_value (tmp) + boffset /
> + BITS_PER_UNIT;
> 
> may still overflow.
Since it's "unsigned + signed + signed", according to implicit conversion,
the signed operand will be converted to unsigned, so the overflow would only
happen when off0 is huge number and tmp/boffset is large positive number,
right?  Do I need to check whether off0 is larger than the overflowed
result?  Also there is signed->unsigned problem here, see below.

> 
> @@ -4133,6 +4142,9 @@ get_computation_cost_at (struct ivopts_data *data,
>  bitmap_clear (*depends_on);
>  }
> 
> +  /* Sign-extend offset if utype has lower precision than
> + HOST_WIDE_INT.  */  offset = sext_hwi (offset, TYPE_PRECISION
> + (utype));
> +
> 
> offset is computed elsewhere in difference_cost and the bug to me seems
> that it is unsigned.  sign-extending it here is odd at least (and the
extension
> should probably happen at sizetype precision, not that of utype).
I agree, The root cause is in split_offset_1, in which offset is computed.
Every time offset is computed in this function with a signed operand (like
"int_cst_value (tmp)" above), we need to take care the possible negative
number problem.   Take this case as an example, we need to do below change:

  case INTEGER_CST:
  //...
  *offset = int_cst_value (expr); 
change to 
  case INTEGER_CST:
  //...
  *offset = sext_hwi (int_cst_value (expr), type);

and
  case MULT_EXPR: 
  //...
  *offset = sext_hwi (int_cst_value (expr), type);
to
  case MULT_EXPR: 
  //...
 HOST_WIDE_INT xxx = (HOST_WIDE_INT)off0 * int_cst_value (op1);
  *offset = sext_hwi (xxx, type);

Any comments?

Thanks.
bin





RE: [PATCH]Fix computation of offset in ivopt

2013-09-27 Thread bin.cheng


> -Original Message-
> From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-
> ow...@gcc.gnu.org] On Behalf Of bin.cheng
> Sent: Friday, September 27, 2013 1:07 PM
> To: 'Richard Biener'
> Cc: GCC Patches
> Subject: RE: [PATCH]Fix computation of offset in ivopt
> 
> 
> 
> > -Original Message-
> > From: Richard Biener [mailto:richard.guent...@gmail.com]
> > Sent: Tuesday, September 24, 2013 6:31 PM
> > To: Bin Cheng
> > Cc: GCC Patches
> > Subject: Re: [PATCH]Fix computation of offset in ivopt
> >
> > On Tue, Sep 24, 2013 at 11:13 AM, bin.cheng  wrote:
> >
> > +   field = TREE_OPERAND (expr, 1);
> > +   if (DECL_FIELD_BIT_OFFSET (field)
> > +   && cst_and_fits_in_hwi (DECL_FIELD_BIT_OFFSET (field)))
> > + boffset = int_cst_value (DECL_FIELD_BIT_OFFSET (field));
> > +
> > +   tmp = component_ref_field_offset (expr);
> > +   if (top_compref
> > +   && cst_and_fits_in_hwi (tmp))
> > + {
> > +   /* Strip the component reference completely.  */
> > +   op0 = TREE_OPERAND (expr, 0);
> > +   op0 = strip_offset_1 (op0, inside_addr, top_compref, &off0);
> > +   *offset = off0 + int_cst_value (tmp) + boffset /
> BITS_PER_UNIT;
> > +   return op0;
> > + }
> >
> > the failure paths seem mangled, that is, if cst_and_fits_in_hwi is
> > false
> for
> > either offset part you may end up doing half accounting and not
stripping.
> >
> > Btw, DECL_FIELD_BIT_OFFSET is always non-NULL.  I suggest to rewrite
> > to
> >
> >  if (!inside_addr)
> >return orig_expr;
> >
> >  tmp = component_ref_field_offset (expr);
> >  field = TREE_OPERAND (expr, 1);
> >  if (top_compref
> >  && cst_and_fits_in_hwi (tmp)
> >  && cst_and_fits_in_hwi (DECL_FIELD_BIT_OFFSET (field)))
> > {
> >   ...
> > }
> Will be refined.
> 
> >
> > note that this doesn't really handle overflows correctly as
> >
> > +   *offset = off0 + int_cst_value (tmp) + boffset /
> > + BITS_PER_UNIT;
> >
> > may still overflow.
> Since it's "unsigned + signed + signed", according to implicit conversion,
the
> signed operand will be converted to unsigned, so the overflow would only
> happen when off0 is huge number and tmp/boffset is large positive number,
> right?  Do I need to check whether off0 is larger than the overflowed
result?
> Also there is signed->unsigned problem here, see below.
> 
> >
> > @@ -4133,6 +4142,9 @@ get_computation_cost_at (struct ivopts_data
> *data,
> >  bitmap_clear (*depends_on);
> >  }
> >
> > +  /* Sign-extend offset if utype has lower precision than
> > + HOST_WIDE_INT.  */  offset = sext_hwi (offset, TYPE_PRECISION
> > + (utype));
> > +
> >
> > offset is computed elsewhere in difference_cost and the bug to me
> > seems that it is unsigned.  sign-extending it here is odd at least
> > (and the
> extension
> > should probably happen at sizetype precision, not that of utype).
> I agree, The root cause is in split_offset_1, in which offset is computed.
> Every time offset is computed in this function with a signed operand (like
> "int_cst_value (tmp)" above), we need to take care the possible negative
> number problem.   Take this case as an example, we need to do below
> change:
> 
>   case INTEGER_CST:
>   //...
>   *offset = int_cst_value (expr);
> change to
>   case INTEGER_CST:
>   //...
>   *offset = sext_hwi (int_cst_value (expr), type);
> 
> and
>   case MULT_EXPR:
>   //...
>   *offset = sext_hwi (int_cst_value (expr), type); to
>   case MULT_EXPR:
>   //...
>  HOST_WIDE_INT xxx = (HOST_WIDE_INT)off0 * int_cst_value (op1);
>   *offset = sext_hwi (xxx, type);
> 
> Any comments?

Thought twice, I guess we can compute signed offset in strip_offset_1 and
sign extend it for strip_offset, thus we don't need to change every
computation of offset in that function.

Thanks.
bin





RE: [PATCH]Fix computation of offset in ivopt

2013-09-29 Thread bin.cheng


> -Original Message-
> From: Richard Biener [mailto:richard.guent...@gmail.com]
> Sent: Friday, September 27, 2013 4:30 PM
> To: Bin Cheng
> Cc: GCC Patches
> Subject: Re: [PATCH]Fix computation of offset in ivopt
> 
> On Fri, Sep 27, 2013 at 7:07 AM, bin.cheng  wrote:
> >
> >
> >   case INTEGER_CST:
> >   //...
> >   *offset = int_cst_value (expr);
> > change to
> >   case INTEGER_CST:
> >   //...
> >   *offset = sext_hwi (int_cst_value (expr), type);
> >
> > and
> >   case MULT_EXPR:
> >   //...
> >   *offset = sext_hwi (int_cst_value (expr), type); to
> >   case MULT_EXPR:
> >   //...
> >  HOST_WIDE_INT xxx = (HOST_WIDE_INT)off0 * int_cst_value (op1);
> >   *offset = sext_hwi (xxx, type);
> >
> > Any comments?
> 
> The issue is of course that we end up converting offsets to sizetype at
some
> point which makes them all appear unsigned.  The fix for this is to simply
> interpret them as signed ... but it's really a mess ;)
> 

Hi,  this is updated patch which calculates signed offset in strip_offset_1
then sign extend it in strip_offset.

Bootstrap and test on x86_64/x86/arm. Is it OK?

Thanks.
bin

2013-09-30  Bin Cheng  

* tree-ssa-loop-ivopts.c (strip_offset_1): Change parameter type.
Count DECL_FIELD_BIT_OFFSET when computing offset for COMPONENT_REF.
(strip_offset): Sign extend before return.Index: gcc/tree-ssa-loop-ivopts.c
===
--- gcc/tree-ssa-loop-ivopts.c  (revision 202599)
+++ gcc/tree-ssa-loop-ivopts.c  (working copy)
@@ -2037,12 +2037,12 @@ find_interesting_uses (struct ivopts_data *data)
 
 static tree
 strip_offset_1 (tree expr, bool inside_addr, bool top_compref,
-   unsigned HOST_WIDE_INT *offset)
+   HOST_WIDE_INT *offset)
 {
   tree op0 = NULL_TREE, op1 = NULL_TREE, tmp, step;
   enum tree_code code;
   tree type, orig_type = TREE_TYPE (expr);
-  unsigned HOST_WIDE_INT off0, off1, st;
+  HOST_WIDE_INT off0, off1, st;
   tree orig_expr = expr;
 
   STRIP_NOPS (expr);
@@ -2133,19 +2133,26 @@ strip_offset_1 (tree expr, bool inside_addr, bool
   break;
 
 case COMPONENT_REF:
-  if (!inside_addr)
-   return orig_expr;
+  {
+   tree field;
+   HOST_WIDE_INT boffset = 0;
+   if (!inside_addr)
+ return orig_expr;
 
-  tmp = component_ref_field_offset (expr);
-  if (top_compref
- && cst_and_fits_in_hwi (tmp))
-   {
- /* Strip the component reference completely.  */
- op0 = TREE_OPERAND (expr, 0);
- op0 = strip_offset_1 (op0, inside_addr, top_compref, &off0);
- *offset = off0 + int_cst_value (tmp);
- return op0;
-   }
+   tmp = component_ref_field_offset (expr);
+   field = TREE_OPERAND (expr, 1);
+   if (top_compref
+   && cst_and_fits_in_hwi (tmp)
+   && cst_and_fits_in_hwi (DECL_FIELD_BIT_OFFSET (field)))
+ {
+   /* Strip the component reference completely.  */
+   op0 = TREE_OPERAND (expr, 0);
+   op0 = strip_offset_1 (op0, inside_addr, top_compref, &off0);
+   boffset = int_cst_value (DECL_FIELD_BIT_OFFSET (field));
+   *offset = off0 + int_cst_value (tmp) + boffset / BITS_PER_UNIT;
+   return op0;
+ }
+  }
   break;
 
 case ADDR_EXPR:
@@ -2196,7 +2203,16 @@ strip_offset_1 (tree expr, bool inside_addr, bool
 static tree
 strip_offset (tree expr, unsigned HOST_WIDE_INT *offset)
 {
-  return strip_offset_1 (expr, false, false, offset);
+  HOST_WIDE_INT off;
+  tree core = strip_offset_1 (expr, false, false, &off);
+
+  /* Sign extend off if expr is in type which has lower precision
+ than HOST_WIDE_INT.  */
+  if (TYPE_PRECISION (TREE_TYPE (expr)) <= HOST_BITS_PER_WIDE_INT)
+off = sext_hwi (off, TYPE_PRECISION (TREE_TYPE (expr)));
+
+  *offset = off;
+  return core;
 }
 
 /* Returns variant of TYPE that can be used as base for different uses.


RE: [PATCH]Fix computation of offset in ivopt

2013-09-29 Thread bin.cheng


> -Original Message-
> From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-
> ow...@gcc.gnu.org] On Behalf Of Oleg Endo
> Sent: Wednesday, September 25, 2013 1:41 AM
> To: Richard Biener
> Cc: Bin Cheng; GCC Patches
> Subject: Re: [PATCH]Fix computation of offset in ivopt
> 
> >
> 
> After reading "overflow" and "ivopt", I was wondering whether
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55190 is somehow related.

No, this patch is irrelevant. But I do have some thought on the pr55190 and 
will follow in the bug entry.

Thanks.
bin





Re: [PATCH]Fix computation of offset in ivopt

2013-10-01 Thread Bin.Cheng
On Tue, Oct 1, 2013 at 6:50 PM, Richard Biener
 wrote:
> On Mon, Sep 30, 2013 at 7:39 AM, bin.cheng  wrote:
>>
>>
>
> I don't think you need
>
> +  /* Sign extend off if expr is in type which has lower precision
> + than HOST_WIDE_INT.  */
> +  if (TYPE_PRECISION (TREE_TYPE (expr)) <= HOST_BITS_PER_WIDE_INT)
> +off = sext_hwi (off, TYPE_PRECISION (TREE_TYPE (expr)));
>
> at least it would be suspicious if you did ...
There is a problem for example of the first message.  The iv base if like:
pretmp_184 + ((sizetype) KeyIndex_180 + 1073741823) * 4
I am not sure why but it seems (-4/0xFFFC) is represented by
(1073741823*4).  For each operand strip_offset_1 returns exactly the
positive number and result of multiplication never get its chance of
sign extend.  That's why the sign extend is necessary to fix the
problem. Or it should be fixed elsewhere by representing iv base with:
"pretmp_184 + ((sizetype) KeyIndex_180 + 4294967295) * 4" in the first place.

>
> The only case that I can think of points to a bug in strip_offset_1
> again, namely if sizetype (the type of all offsets) is smaller than
> a HOST_WIDE_INT in which case
>
> +boffset = int_cst_value (DECL_FIELD_BIT_OFFSET (field));
> +*offset = off0 + int_cst_value (tmp) + boffset / BITS_PER_UNIT;
>
> is wrong as boffset / BITS_PER_UNIT does not do a signed division
> then (for negative boffset which AFAIK does not happen - but it would
> be technically allowed).  Thus, the predicates like
>
> +&& cst_and_fits_in_hwi (tmp)
>
> would need to be amended with a check that the MSB is not set.
So I can handle it like:

+abs_boffset = abs_hwi (boffset);
+x = abs_boffset / BITS_PER_UNIT;
+if (boffset < 0)
+  x = -x;
+*offset = off0 + int_cst_value (tmp) + x;

Right?

>
> Btw, the cst_and_fits_in_hwi implementation is odd:
>
> bool
> cst_and_fits_in_hwi (const_tree x)
> {
>   if (TREE_CODE (x) != INTEGER_CST)
> return false;
>
>   if (TYPE_PRECISION (TREE_TYPE (x)) > HOST_BITS_PER_WIDE_INT)
> return false;
>
>   return (TREE_INT_CST_HIGH (x) == 0
>   || TREE_INT_CST_HIGH (x) == -1);
> }
>
> the precision check seems totally pointless and I wonder what's
> the point of this routine as there is host_integerp () already
> and tree_low_cst instead of int_cst_value - oh, I see, the latter
> forcefully sign-extends  that should make the extension
> not necessary.
See above.

Thanks.
bin


Re: Fix twolf -funroll-loops -O3 miscompilation (a semi-latent web.c bug)

2012-10-18 Thread Bin.Cheng
On Wed, Oct 17, 2012 at 6:54 AM, Steven Bosscher  wrote:
> On Tue, Oct 16, 2012 at 1:38 PM, Paolo Bonzini wrote:
 Bottom line: This is a bug in df_kill_notes.
>>
>> Yep, and it could cause wrong code generation, couldn't it?  Because the
>> new (reg:DI 72) is obviously not equal to (plus:DI (reg:DI 72)
>> (const_int 24)), but it could be propagated to subsequent insns.
>
> Right.
>
>
>> So I think this patch should be backported to all release branches after
>> regstrapping.
>
> Agreed, but let's wait a few days at least to see what happens with
> the patch on the trunk.
>
>
>> Ok after bootstrap, regtest and checking that it actually fixes the bug.
>
> I made sure of that before posting the patch ;-)
>
> Bootstrap®test is running on x86_64-unknown-linux-gnu and
> powerpc64-unknown-linux-gnu. I'll commit the patch tomorrow if there
> are no test suite regressions.
>
> Well done to us all for probing until we found the root cause of this bug!
>

Hi Jan,

For the test case:

main()
{
  configure2();
}

Please make main return ZERO by default.

When cross testing with qemu/simulator, the wrap exit checks whether
the case by verifying the return value.
For ARM target, R0 is checked while it may be clobbered(thus non-ZERO)
if main does not return any value, which causes failure of test case.

Thanks very much.

-- 
Best Regards.


Re: [patch] Unify bitmap interface.

2012-10-29 Thread Bin.Cheng
On Tue, Oct 30, 2012 at 8:23 AM, Lawrence Crowl  wrote:
> On 10/29/12, Diego Novillo  wrote:
>> On Oct 29, 2012 Diego Novillo  wrote:
>> > Just to make sure.  Testing on ppc should be fast, for example.
>>
>> And useless.  Your patch does not touch ppc.
>
> I've fixed the #if 0 and the remaining suggestions will happen in
> another patch.  I've committed this one.
>
> ===
>
> This patch implements the unification of the *bitmap interfaces as discussed.
> Essentially, we rename ebitmap and sbitmap functions to use the same names
> as the bitmap functions.  This rename works because we can now overload
> on the bitmap type.  Some macros now become inline functions to enable
> that overloading.
>
> The sbitmap non-bool returning bitwise operations have been merged with
> the bool versions.  Sometimes this merge involved modifying the non-bool
> version to compute the bool value, and sometimes modifying bool version to
> add additional work from the non-bool version.  The redundant routines have
> been removed.
>
> The allocation functions have not been renamed, because we often do not
> have an argument on which to overload.  The cardinality functions have not
> been renamed, because they have different parameters, and are thus not
> interchangable.  The iteration functions have not been renamed, because
> they are functionally different.
>
> Tested on x86_64, contrib/config-list.mk testing passed.
> Index: gcc/ChangeLog
>

Just one question: Should we change the name of functions
"sbitmap_intersection_of_succs/sbitmap_intersection_of_preds/sbitmap_union_of_succs/sbitmap_union_of_preds"
too? It might be a little confusing that sbitmap_* is used among
bitmap_*.

-- 
Best Regards.


Re: [PATCH]Fix computation of offset in ivopt

2013-10-02 Thread Bin.Cheng
Sorry that I don't understand tree type system well, so here are two
more questions, could you please explain little bit more?  Thanks very
much.

On Tue, Oct 1, 2013 at 6:50 PM, Richard Biener
 wrote:
> On Mon, Sep 30, 2013 at 7:39 AM, bin.cheng  wrote:
>
> I don't think you need
>
> +  /* Sign extend off if expr is in type which has lower precision
> + than HOST_WIDE_INT.  */
> +  if (TYPE_PRECISION (TREE_TYPE (expr)) <= HOST_BITS_PER_WIDE_INT)
> +off = sext_hwi (off, TYPE_PRECISION (TREE_TYPE (expr)));
Is it possible to have type of expr smaller than the type of embedded
offset (i.e., sizetype)?  Should the sign extend be against sizetype
or it's safe with TREE_TYPE(expr)?

>
> at least it would be suspicious if you did ...
>
> The only case that I can think of points to a bug in strip_offset_1
> again, namely if sizetype (the type of all offsets) is smaller than
> a HOST_WIDE_INT in which case
>
> +boffset = int_cst_value (DECL_FIELD_BIT_OFFSET (field));
> +*offset = off0 + int_cst_value (tmp) + boffset / BITS_PER_UNIT;
>
> is wrong as boffset / BITS_PER_UNIT does not do a signed division
> then (for negative boffset which AFAIK does not happen - but it would
> be technically allowed).  Thus, the predicates like
>
> +&& cst_and_fits_in_hwi (tmp)
>
> would need to be amended with a check that the MSB is not set.
I am confused, why "boffset / BITS_PER_UNIT" won't do signed division
and how it relates to "sizetype smaller than HOST_WIDE_INT"?  If
sizetype is smaller, won't int_cst_value sign extends it into
HOST_WIDE_INT?

Thanks.
bin

-- 
Best Regards.


Re: [PATCH]Fix computation of offset in ivopt

2013-10-07 Thread Bin.Cheng
On Tue, Oct 1, 2013 at 6:50 PM, Richard Biener
 wrote:
> On Mon, Sep 30, 2013 at 7:39 AM, bin.cheng  wrote:
>>

>
> I don't think you need
>
> +  /* Sign extend off if expr is in type which has lower precision
> + than HOST_WIDE_INT.  */
> +  if (TYPE_PRECISION (TREE_TYPE (expr)) <= HOST_BITS_PER_WIDE_INT)
> +off = sext_hwi (off, TYPE_PRECISION (TREE_TYPE (expr)));
>
> at least it would be suspicious if you did ...
>
> The only case that I can think of points to a bug in strip_offset_1
> again, namely if sizetype (the type of all offsets) is smaller than
> a HOST_WIDE_INT in which case
>
> +boffset = int_cst_value (DECL_FIELD_BIT_OFFSET (field));
> +*offset = off0 + int_cst_value (tmp) + boffset / BITS_PER_UNIT;
>
> is wrong as boffset / BITS_PER_UNIT does not do a signed division
> then (for negative boffset which AFAIK does not happen - but it would
> be technically allowed).  Thus, the predicates like
>
> +&& cst_and_fits_in_hwi (tmp)
>
> would need to be amended with a check that the MSB is not set.
I think it twice and have below understandings.
1) "boffset / BITS_PER_UNIT" does not do signed division because
boffset might be negative and BIT_PER_UNIT could be defined as
unsigned for a target.
2) if sizetype is smaller enough than HOST_WIDE_INT, and if field
offset is large enough to have MSB set, then boffset would be computed
negative by int_cst_value.
3) If sizetype is as large as HOST_WIDE_INT, boffset would be negative
only if the field offset is a very large number with MSB set, but that
would be impossible for any structure.

Are these understandings right? And sorry to bother with the stupid
questions.  Thanks.

-- 
Best Regards.


RE: [PATCH]Fix computation of offset in ivopt

2013-10-16 Thread bin.cheng
Hi,
As noted in previous messages, GCC forces offset to unsigned in middle end.
It also gets offset value and stores it in HOST_WIDE_INT then uses it in
various computation.  In this scenario, It should use int_cst_value to do
additional sign extension against the type of tree node, otherwise we might
lose sign information.  Take function fold_plusminus_mult_expr as an
example, the sign information of arg01/arg11 might be lost because it uses
TREE_INT_CST_LOW directly.  I think this is the offender of the problem in
this thread.  I think we should fix the offender, rather the hacking
strip_offset as in the original patch, so I split original patch into two as
attached, with one fixing the offset of COMPONENT_REF in strip_offset_1, the
other fixing the mentioned sign extension problem.

Bootstrap and test on x86/x86_64/arm.  Is it OK?

Thanks.
bin

Patch a:
2013-10-17  Bin Cheng  

* tree-ssa-loop-ivopts.c (strip_offset_1): Change parameter type.
Count DECL_FIELD_BIT_OFFSET when computing offset for COMPONENT_REF.

Patch b: 
2013-10-17  Bin Cheng  

* fold-const.c (fold_plusminus_mult_expr): Use int_cst_value instead
of TREE_INT_CST_LOW.

gcc/testsuite/ChangeLog
2013-10-17  Bin Cheng  

* gcc.dg/tree-ssa/ivopts-outside-loop-use-1.c: New test.

> -Original Message-
> From: Richard Biener [mailto:richard.guent...@gmail.com]
> Sent: Wednesday, October 02, 2013 4:32 PM
> To: Bin.Cheng
> Cc: Bin Cheng; GCC Patches
> Subject: Re: [PATCH]Fix computation of offset in ivopt
> 
> On Tue, Oct 1, 2013 at 6:13 PM, Bin.Cheng  wrote:
> > On Tue, Oct 1, 2013 at 6:50 PM, Richard Biener
> >  wrote:
> >> On Mon, Sep 30, 2013 at 7:39 AM, bin.cheng 
> wrote:
> >>>
> >>>
> >>
> >> I don't think you need
> >>
> >> +  /* Sign extend off if expr is in type which has lower precision
> >> + than HOST_WIDE_INT.  */
> >> +  if (TYPE_PRECISION (TREE_TYPE (expr)) <= HOST_BITS_PER_WIDE_INT)
> >> +off = sext_hwi (off, TYPE_PRECISION (TREE_TYPE (expr)));
> >>
> >> at least it would be suspicious if you did ...
> > There is a problem for example of the first message.  The iv base if
like:
> > pretmp_184 + ((sizetype) KeyIndex_180 + 1073741823) * 4 I am not sure
> > why but it seems (-4/0xFFFC) is represented by (1073741823*4).
> > For each operand strip_offset_1 returns exactly the positive number
> > and result of multiplication never get its chance of sign extend.
> > That's why the sign extend is necessary to fix the problem. Or it
> > should be fixed elsewhere by representing iv base with:
> > "pretmp_184 + ((sizetype) KeyIndex_180 + 4294967295) * 4" in the first
> place.
> 
> Yeah, that's why I said the whole issue with forcing all offsets to be
unsigned
> is a mess ...
> 
> There is really no good answer besides not doing that I fear.
> 
> Yes, in the above case we could fold the whole thing differently
(interpret
> the offset of a POINTER_PLUS_EXPR as signed).  You can try tracking down
> the offender, but it'll get non-trivial easily as you have to consider the
fact
> that GCC will treat signed operations as having undefined behavior on
> overflow.
> 
> So I see why you want to do the extension above (re-interpret the result),
I
> suppose we can live with it but please make sure to add a big fat ???
> comment before it explaining why it is necessary.
> 
> Richard.
> 
> >>
> >> The only case that I can think of points to a bug in strip_offset_1
> >> again, namely if sizetype (the type of all offsets) is smaller than a
> >> HOST_WIDE_INT in which case
> >>
> >> +boffset = int_cst_value (DECL_FIELD_BIT_OFFSET (field));
> >> +*offset = off0 + int_cst_value (tmp) + boffset / BITS_PER_UNIT;
> >>
> >> is wrong as boffset / BITS_PER_UNIT does not do a signed division
> >> then (for negative boffset which AFAIK does not happen - but it would
> >> be technically allowed).  Thus, the predicates like
> >>
> >> +&& cst_and_fits_in_hwi (tmp)
> >>
> >> would need to be amended with a check that the MSB is not set.
> > So I can handle it like:
> >
> > +abs_boffset = abs_hwi (boffset);
> > +x = abs_boffset / BITS_PER_UNIT;
> > +if (boffset < 0)
> > +  x = -x;
> > +*offset = off0 + int_cst_value (tmp) + x;
> >
> > Right?
> >
> >>
> >> Btw, the cst_and_fits_in_hwi implementation is odd:
> >>
> >> bool
> >> cst_and_fits_in_hwi (const_tree x)
> >> {
> >>   if (TREE_CODE (x) != INTEGER_CST)
> >

Re: [PATCH]Fix computation of offset in ivopt

2013-10-18 Thread Bin.Cheng
Hi Richard,
Is the first patch OK?  Since there is a patch depending on it.
http://gcc.gnu.org/ml/gcc-patches/2013-09/msg01931.html

Thanks.
bin

On Fri, Oct 18, 2013 at 7:18 PM, Richard Biener
 wrote:
> On Thu, Oct 17, 2013 at 7:52 AM, bin.cheng  wrote:
>> Hi,
>> As noted in previous messages, GCC forces offset to unsigned in middle end.
>> It also gets offset value and stores it in HOST_WIDE_INT then uses it in
>> various computation.  In this scenario, It should use int_cst_value to do
>> additional sign extension against the type of tree node, otherwise we might
>> lose sign information.  Take function fold_plusminus_mult_expr as an
>> example, the sign information of arg01/arg11 might be lost because it uses
>> TREE_INT_CST_LOW directly.  I think this is the offender of the problem in
>> this thread.  I think we should fix the offender, rather the hacking
>> strip_offset as in the original patch, so I split original patch into two as
>> attached, with one fixing the offset of COMPONENT_REF in strip_offset_1, the
>> other fixing the mentioned sign extension problem.
>
> Index: gcc/fold-const.c
> ===
> --- gcc/fold-const.c (revision 203267)
> +++ gcc/fold-const.c (working copy)
> @@ -7270,8 +7270,8 @@ fold_plusminus_mult_expr (location_t loc, enum tre
>HOST_WIDE_INT int01, int11, tmp;
>bool swap = false;
>tree maybe_same;
> -  int01 = TREE_INT_CST_LOW (arg01);
> -  int11 = TREE_INT_CST_LOW (arg11);
> +  int01 = int_cst_value (arg01);
> +  int11 = int_cst_value (arg11);
>
> this is not correct - it will mishandle all large unsigned numbers.
>
> The real issue is that we rely on twos-complement arithmetic to work
> when operating on pointer offsets (because we convert them all to
> unsigned sizetype).  That makes interpreting the offsets or expressions
> that compute offsets hard (or impossible in some cases), as you can
> see from the issues in IVOPTs.
>
> The above means that strip_offset_1 cannot reliably look through
> MULT_EXPRs as that can make an offset X * CST that is
> "effectively" signed "surely" unsigned in the process.  Think of
> this looking into X * CST as performing a division - whose result
> is dependent on the sign of X which we lost with our unsigned
> casting.  Now, removing the MULT_EXPR look-through might
> be too pessimizing ... but it may be worth trying to see if it fixes
> your issue?  In this context I also remember a new bug filed
> recently of us not folding x /[ex] 4 * 4 to x.
>
> In the past I was working multiple times on stopping doing that
> (make all offsets unsigned), but I never managed to finish ...
>
> Richard.
>
>> Bootstrap and test on x86/x86_64/arm.  Is it OK?
>>
>> Thanks.
>> bin
>>
>> Patch a:
>> 2013-10-17  Bin Cheng  
>>
>> * tree-ssa-loop-ivopts.c (strip_offset_1): Change parameter type.
>> Count DECL_FIELD_BIT_OFFSET when computing offset for COMPONENT_REF.
>>
>> Patch b:
>> 2013-10-17  Bin Cheng  
>>
>> * fold-const.c (fold_plusminus_mult_expr): Use int_cst_value instead
>> of TREE_INT_CST_LOW.
>>
>> gcc/testsuite/ChangeLog
>> 2013-10-17  Bin Cheng  
>>
>> * gcc.dg/tree-ssa/ivopts-outside-loop-use-1.c: New test.
>>
>>> -Original Message-
>>> From: Richard Biener [mailto:richard.guent...@gmail.com]
>>> Sent: Wednesday, October 02, 2013 4:32 PM
>>> To: Bin.Cheng
>>> Cc: Bin Cheng; GCC Patches
>>> Subject: Re: [PATCH]Fix computation of offset in ivopt
>>>
>>> On Tue, Oct 1, 2013 at 6:13 PM, Bin.Cheng  wrote:
>>> > On Tue, Oct 1, 2013 at 6:50 PM, Richard Biener
>>> >  wrote:
>>> >> On Mon, Sep 30, 2013 at 7:39 AM, bin.cheng 
>>> wrote:
>>> >>>
>>> >>>
>>> >>
>>> >> I don't think you need
>>> >>
>>> >> +  /* Sign extend off if expr is in type which has lower precision
>>> >> + than HOST_WIDE_INT.  */
>>> >> +  if (TYPE_PRECISION (TREE_TYPE (expr)) <= HOST_BITS_PER_WIDE_INT)
>>> >> +off = sext_hwi (off, TYPE_PRECISION (TREE_TYPE (expr)));
>>> >>
>>> >> at least it would be suspicious if you did ...
>>> > There is a problem for example of the first message.  The iv base if
>> like:
>>> > pretmp_184 + ((sizetype) KeyIndex_180 + 1073741823) * 4 I am not sure
>>> > why but it seems (-4/0xFFFC) is represented by (1073741823*4).
>>&

Re: [PATCH]Fix computation of offset in ivopt

2013-10-20 Thread Bin.Cheng
On Fri, Oct 18, 2013 at 7:18 PM, Richard Biener
 wrote:
> On Thu, Oct 17, 2013 at 7:52 AM, bin.cheng  wrote:
>> Hi,
>> As noted in previous messages, GCC forces offset to unsigned in middle end.
>> It also gets offset value and stores it in HOST_WIDE_INT then uses it in
>> various computation.  In this scenario, It should use int_cst_value to do
>> additional sign extension against the type of tree node, otherwise we might
>> lose sign information.  Take function fold_plusminus_mult_expr as an
>> example, the sign information of arg01/arg11 might be lost because it uses
>> TREE_INT_CST_LOW directly.  I think this is the offender of the problem in
>> this thread.  I think we should fix the offender, rather the hacking
>> strip_offset as in the original patch, so I split original patch into two as
>> attached, with one fixing the offset of COMPONENT_REF in strip_offset_1, the
>> other fixing the mentioned sign extension problem.
>
> Index: gcc/fold-const.c
> ===
> --- gcc/fold-const.c (revision 203267)
> +++ gcc/fold-const.c (working copy)
> @@ -7270,8 +7270,8 @@ fold_plusminus_mult_expr (location_t loc, enum tre
>HOST_WIDE_INT int01, int11, tmp;
>bool swap = false;
>tree maybe_same;
> -  int01 = TREE_INT_CST_LOW (arg01);
> -  int11 = TREE_INT_CST_LOW (arg11);
> +  int01 = int_cst_value (arg01);
> +  int11 = int_cst_value (arg11);
>
> this is not correct - it will mishandle all large unsigned numbers.
As far as the patch itself.  I think the preceding host_integerp
already checks for this case:

int
host_integerp (const_tree t, int pos)
{
  if (t == NULL_TREE)
return 0;

  return (TREE_CODE (t) == INTEGER_CST
  && ((TREE_INT_CST_HIGH (t) == 0
   && (HOST_WIDE_INT) TREE_INT_CST_LOW (t) >= 0)
  || (! pos && TREE_INT_CST_HIGH (t) == -1
  && (HOST_WIDE_INT) TREE_INT_CST_LOW (t) < 0
  && !TYPE_UNSIGNED (TREE_TYPE (t)))
  || (pos && TREE_INT_CST_HIGH (t) == 0)));
}

Since the call is host_integerp (xxx, 0), it returns 0 for large
unsigned numbers, right?

>
> The real issue is that we rely on twos-complement arithmetic to work
> when operating on pointer offsets (because we convert them all to
> unsigned sizetype).  That makes interpreting the offsets or expressions
> that compute offsets hard (or impossible in some cases), as you can
> see from the issues in IVOPTs.
>
> The above means that strip_offset_1 cannot reliably look through
> MULT_EXPRs as that can make an offset X * CST that is
> "effectively" signed "surely" unsigned in the process.  Think of
> this looking into X * CST as performing a division - whose result
> is dependent on the sign of X which we lost with our unsigned
> casting.  Now, removing the MULT_EXPR look-through might
> be too pessimizing ... but it may be worth trying to see if it fixes
> your issue?  In this context I also remember a new bug filed
> recently of us not folding x /[ex] 4 * 4 to x.
>
> In the past I was working multiple times on stopping doing that
> (make all offsets unsigned), but I never managed to finish ...
>
> Richard.
>
>> Bootstrap and test on x86/x86_64/arm.  Is it OK?
>>
>> Thanks.
>> bin
>>
>> Patch a:
>> 2013-10-17  Bin Cheng  
>>
>> * tree-ssa-loop-ivopts.c (strip_offset_1): Change parameter type.
>> Count DECL_FIELD_BIT_OFFSET when computing offset for COMPONENT_REF.
>>
>> Patch b:
>> 2013-10-17  Bin Cheng  
>>
>> * fold-const.c (fold_plusminus_mult_expr): Use int_cst_value instead
>> of TREE_INT_CST_LOW.
>>
>> gcc/testsuite/ChangeLog
>> 2013-10-17  Bin Cheng  
>>
>> * gcc.dg/tree-ssa/ivopts-outside-loop-use-1.c: New test.
>>
>>> -Original Message-
>>> From: Richard Biener [mailto:richard.guent...@gmail.com]
>>> Sent: Wednesday, October 02, 2013 4:32 PM
>>> To: Bin.Cheng
>>> Cc: Bin Cheng; GCC Patches
>>> Subject: Re: [PATCH]Fix computation of offset in ivopt
>>>
>>> On Tue, Oct 1, 2013 at 6:13 PM, Bin.Cheng  wrote:
>>> > On Tue, Oct 1, 2013 at 6:50 PM, Richard Biener
>>> >  wrote:
>>> >> On Mon, Sep 30, 2013 at 7:39 AM, bin.cheng 
>>> wrote:
>>> >>>
>>> >>>
>>> >>
>>> >> I don't think you need
>>> >>
>>> >> +  /* Sign extend off if expr is in type which has lower precision
>>> >> + than HOST_WIDE_INT.  */
>>> >> +  if

Re: [PATCH]Fix computation of offset in ivopt

2013-10-27 Thread Bin.Cheng
On Mon, Oct 21, 2013 at 8:21 PM, Richard Biener
 wrote:
> On Fri, Oct 18, 2013 at 5:48 PM, Bin.Cheng  wrote:
>> Hi Richard,
>> Is the first patch OK?  Since there is a patch depending on it.
>> http://gcc.gnu.org/ml/gcc-patches/2013-09/msg01931.html
>
> Yes.
I committed the patch fixing strip_offset_1 as r204116.
Since the root cause of the second issue is POINTER_PLUS_EXPR requires
an unsigned offset operand and can't be fixed as in the second patch,
I will discard that patch.

Thanks.
bin

>> On Fri, Oct 18, 2013 at 7:18 PM, Richard Biener
>>  wrote:
>>> On Thu, Oct 17, 2013 at 7:52 AM, bin.cheng  wrote:
>>>> Hi,
>>>> As noted in previous messages, GCC forces offset to unsigned in middle end.
>>>> It also gets offset value and stores it in HOST_WIDE_INT then uses it in
>>>> various computation.  In this scenario, It should use int_cst_value to do
>>>> additional sign extension against the type of tree node, otherwise we might
>>>> lose sign information.  Take function fold_plusminus_mult_expr as an
>>>> example, the sign information of arg01/arg11 might be lost because it uses
>>>> TREE_INT_CST_LOW directly.  I think this is the offender of the problem in
>>>> this thread.  I think we should fix the offender, rather the hacking
>>>> strip_offset as in the original patch, so I split original patch into two 
>>>> as
>>>> attached, with one fixing the offset of COMPONENT_REF in strip_offset_1, 
>>>> the
>>>> other fixing the mentioned sign extension problem.
>>>
>>> Index: gcc/fold-const.c
>>> ===
>>> --- gcc/fold-const.c (revision 203267)
>>> +++ gcc/fold-const.c (working copy)
>>> @@ -7270,8 +7270,8 @@ fold_plusminus_mult_expr (location_t loc, enum tre
>>>HOST_WIDE_INT int01, int11, tmp;
>>>bool swap = false;
>>>tree maybe_same;
>>> -  int01 = TREE_INT_CST_LOW (arg01);
>>> -  int11 = TREE_INT_CST_LOW (arg11);
>>> +  int01 = int_cst_value (arg01);
>>> +  int11 = int_cst_value (arg11);
>>>
>>> this is not correct - it will mishandle all large unsigned numbers.
>>>
>>> The real issue is that we rely on twos-complement arithmetic to work
>>> when operating on pointer offsets (because we convert them all to
>>> unsigned sizetype).  That makes interpreting the offsets or expressions
>>> that compute offsets hard (or impossible in some cases), as you can
>>> see from the issues in IVOPTs.
>>>
>>> The above means that strip_offset_1 cannot reliably look through
>>> MULT_EXPRs as that can make an offset X * CST that is
>>> "effectively" signed "surely" unsigned in the process.  Think of
>>> this looking into X * CST as performing a division - whose result
>>> is dependent on the sign of X which we lost with our unsigned
>>> casting.  Now, removing the MULT_EXPR look-through might
>>> be too pessimizing ... but it may be worth trying to see if it fixes
>>> your issue?  In this context I also remember a new bug filed
>>> recently of us not folding x /[ex] 4 * 4 to x.
>>>
>>> In the past I was working multiple times on stopping doing that
>>> (make all offsets unsigned), but I never managed to finish ...
>>>
>>> Richard.
>>>
>>>> Bootstrap and test on x86/x86_64/arm.  Is it OK?
>>>>
>>>> Thanks.
>>>> bin
>>>>
>>>> Patch a:
>>>> 2013-10-17  Bin Cheng  
>>>>
>>>> * tree-ssa-loop-ivopts.c (strip_offset_1): Change parameter type.
>>>> Count DECL_FIELD_BIT_OFFSET when computing offset for 
>>>> COMPONENT_REF.
>>>>
>>>> Patch b:
>>>> 2013-10-17  Bin Cheng  
>>>>
>>>> * fold-const.c (fold_plusminus_mult_expr): Use int_cst_value 
>>>> instead
>>>> of TREE_INT_CST_LOW.
>>>>
>>>> gcc/testsuite/ChangeLog
>>>> 2013-10-17  Bin Cheng  
>>>>
>>>> * gcc.dg/tree-ssa/ivopts-outside-loop-use-1.c: New test.
>>>>
>>>>> -Original Message-
>>>>> From: Richard Biener [mailto:richard.guent...@gmail.com]
>>>>> Sent: Wednesday, October 02, 2013 4:32 PM
>>>>> To: Bin.Cheng
>>>>> Cc: Bin Cheng; GCC Patches
>>>>> Subject: Re: [PATCH]Fix computation of

[PATCH GCC]Simplify address expression in IVOPT

2013-10-29 Thread bin.cheng
Hi,
I noticed that IVOPT generates complex address expressions like below for iv
base.
&arr_base[0].y
&arr[0]
&MEM[p+o]
It's even worse for targets support auto-increment addressing mode because
IVOPT adjusts such base expression with +/- step, then creates below:
&arr_base[0].y +/- step
&arr[0] +/- step
&MEM[p+o] +/- step
It has two disadvantages:
1) Cost computation in IVOPT can't handle complex address expression and
general returns spill_cost for it, which is bad since address iv is
important to IVOPT.
2) IVOPT creates duplicate candidates for IVs which have same value in
different forms, for example, two candidates are generated with each for
"&a[0]" and "&a".  Again, it's even worse for auto-increment addressing
mode.

This patch fixes the issue by simplifying address expression at the entry of
allocating IV struct.  Maybe the simplification can be put in various fold*
functions but I think it might be better in this way, because:
1) fold* functions are used from front-end to various tree optimizations,
the simplified address expressions may not be what each optimizer wanted.
Think about parallelism related passes, they might want the array index
information kept for further analysis.
2) In some way, the simplification is conflict with current implementation
of fold* function.  Take fold_binary_loc as an example, it tries to simplify
"&a[i1] +p c* i2" into "&a[i1+i2]".  Of course we can simplify in this way
for IVOPT too, but that will cause new problems like: a) we have to add code
in IVOPT to cope with complex ARRAY_REF which is the exactly thing we want
to avoid; b) the simplification can't always be done because of the
sign/unsigned offset problem (especially for auto-increment addressing
mode).
3) There are many entry point for fold* functions, the change will be
non-trivial.
4) The simplification is only done in alloc_iv for true (not duplicate ones)
iv struct, the number of such iv should be moderate.

With these points, I think it might be a win to do the simplification in
IVOPT and create a kind of sand box to let IVOPT play.  Any suggestions?

Bootstrap and tested on x86/x86_64/arm.
The patch causes three cases failed on some target, but all of them are
false alarm, which can be resolved by refining test cases to check more
accurate information.

Is it OK?

Thanks.
bin

gcc/testsuite/ChangeLog
2013-10-29  Bin Cheng  

* gcc.dg/tree-ssa/loop-2.c: Refine check condition.
* gcc.dg/tree-ssa/ivopt_infer_2.c: Ditto.
* gcc.dg/tree-ssa/ivopt_mult_3.c: Ditto.

2013-10-29  Bin Cheng  

* tree-ssa-loop-ivopts.c (alloc_iv): Simplify base expression.
(get_shiftadd_cost): Check equality using operand_equal_p.
(force_expr_to_var_cost): Refactor the code.  Handle type
conversion.
(split_address_cost): Call force_expr_to_var_cost.Index: gcc/testsuite/gcc.dg/tree-ssa/loop-2.c
===
--- gcc/testsuite/gcc.dg/tree-ssa/loop-2.c  (revision 204117)
+++ gcc/testsuite/gcc.dg/tree-ssa/loop-2.c  (working copy)
@@ -27,7 +27,7 @@ void xxx(void)
 
 /* { dg-final { scan-tree-dump-times " \\* \[^\\n\\r\]*=" 0 "optimized" } } */
 /* { dg-final { scan-tree-dump-times "\[^\\n\\r\]*= \\* " 0 "optimized" } } */
-/* { dg-final { scan-tree-dump-times "MEM" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "MEM\\\[base" 1 "optimized" } } */
 
 /* 17 * iter should be strength reduced.  */
 
Index: gcc/testsuite/gcc.dg/tree-ssa/ivopt_infer_2.c
===
--- gcc/testsuite/gcc.dg/tree-ssa/ivopt_infer_2.c   (revision 204117)
+++ gcc/testsuite/gcc.dg/tree-ssa/ivopt_infer_2.c   (working copy)
@@ -7,7 +7,8 @@
 
 extern char a[];
 
-/* Can not infer loop iteration from array -- exit test can not be replaced.  
*/
+/* Can not infer loop iteration from array -- exit test can not be
+   replaced by the array address.  */
 void foo (unsigned int i_width, TYPE dst)
 {
   unsigned long long i = 0;
@@ -21,5 +22,5 @@ void foo (unsigned int i_width, TYPE dst)
 }
 }
 
-/* { dg-final { scan-tree-dump-times "Replacing" 0 "ivopts"} } */
+/* { dg-final { scan-tree-dump-times "\[^:\]*if \\(.*j_\[0-9\]+.*\\)" 1 
"ivopts"} } */
 /* { dg-final { cleanup-tree-dump "ivopts" } } */
Index: gcc/testsuite/gcc.dg/tree-ssa/ivopt_mult_3.c
===
--- gcc/testsuite/gcc.dg/tree-ssa/ivopt_mult_3.c(revision 204117)
+++ gcc/testsuite/gcc.dg/tree-ssa/ivopt_mult_3.c(working copy)
@@ -18,5 +18,5 @@ long foo(long* p, long* p2, int N1, int N2)
   return s;
 }
 
-/* { dg-final { scan-tree-dump-times "Replacing" 1 "ivopts"} } */
+/* { dg-final { scan-tree-dump-times "Replacing exit test: if \\(.*p2.*\\)" 1 
"ivopts"} } */
 /* { dg-final { cleanup-tree-dump "ivopts" } } */
Index: gcc/tree-ssa-loop-ivopts.c
==

[PATCH GCC]Compute, cache and use cost of auto-increment rtx patterns in IVOPT

2013-11-03 Thread bin.cheng
Hi,

The IVOPT in GCC has a problem that it does not use cost of auto-increment
address expression in accounting, while it retreats to cost of address
expression if auto-increment addressing mode is unavailable.
For example, on ARM target:
1) the cost of "[reg]" (which is 6) is used for address expression "[reg],
#off";
2) the cost of "[reg+off]" (which is 2) is used for address expression
"[reg, #off]!"; 

This causes:
1) cost of non-auto increment address expression is used for auto-increment
address expression;
2) different address costs are used for pre/post increment address
expressions.
This patch fixes the problem by computing, caching and using the cost of
auto-increment address expressions.

Bootstrap and test on x86/arm.  Is it OK?

2013-11-01  Bin Cheng  

* tree-ssa-loop-ivopts.c (enum ainc_type): New.
(address_cost_data): New field.
(get_address_cost): Compute auto-increment rtx cost in ainc_costs.
Use ainc_costs for auto-increment rtx patterns.
Cleanup TWS.
Index: gcc/tree-ssa-loop-ivopts.c
===
--- gcc/tree-ssa-loop-ivopts.c  (revision 204117)
+++ gcc/tree-ssa-loop-ivopts.c  (working copy)
@@ -3185,10 +3185,21 @@ multiplier_allowed_in_address_p (HOST_WIDE_INT rat
 
TODO -- there must be some better way.  This all is quite crude.  */
 
+enum ainc_type
+{
+  AINC_PRE_INC,/* Pre increment.  */
+  AINC_PRE_DEC,/* Pre decrement.  */
+  AINC_POST_INC,   /* Post increment.  */
+  AINC_POST_DEC,   /* Post decrement.  */
+  AINC_NUM,/* Number of auto increment types.  */
+  AINC_NONE
+};
+
 typedef struct address_cost_data_s
 {
   HOST_WIDE_INT min_offset, max_offset;
   unsigned costs[2][2][2][2];
+  unsigned ainc_costs[AINC_NUM];
 } *address_cost_data;
 
 
@@ -3206,6 +3217,7 @@ get_address_cost (bool symbol_present, bool var_pr
   static bool has_preinc[MAX_MACHINE_MODE], has_postinc[MAX_MACHINE_MODE];
   static bool has_predec[MAX_MACHINE_MODE], has_postdec[MAX_MACHINE_MODE];
   unsigned cost, acost, complexity;
+  enum ainc_type autoinc_type;
   bool offset_p, ratio_p, autoinc;
   HOST_WIDE_INT s_offset, autoinc_offset, msize;
   unsigned HOST_WIDE_INT mask;
@@ -3277,33 +3289,49 @@ get_address_cost (bool symbol_present, bool var_pr
   reg0 = gen_raw_REG (address_mode, LAST_VIRTUAL_REGISTER + 1);
   reg1 = gen_raw_REG (address_mode, LAST_VIRTUAL_REGISTER + 2);
 
-  if (USE_LOAD_PRE_DECREMENT (mem_mode) 
+  if (USE_LOAD_PRE_DECREMENT (mem_mode)
  || USE_STORE_PRE_DECREMENT (mem_mode))
{
  addr = gen_rtx_PRE_DEC (address_mode, reg0);
  has_predec[mem_mode]
= memory_address_addr_space_p (mem_mode, addr, as);
+
+ if (has_predec[mem_mode])
+   data->ainc_costs[AINC_PRE_DEC]
+ = address_cost (addr, mem_mode, as, speed);
}
-  if (USE_LOAD_POST_DECREMENT (mem_mode) 
+  if (USE_LOAD_POST_DECREMENT (mem_mode)
  || USE_STORE_POST_DECREMENT (mem_mode))
{
  addr = gen_rtx_POST_DEC (address_mode, reg0);
  has_postdec[mem_mode]
= memory_address_addr_space_p (mem_mode, addr, as);
+
+ if (has_postdec[mem_mode])
+   data->ainc_costs[AINC_POST_DEC]
+ = address_cost (addr, mem_mode, as, speed);
}
-  if (USE_LOAD_PRE_INCREMENT (mem_mode) 
+  if (USE_LOAD_PRE_INCREMENT (mem_mode)
  || USE_STORE_PRE_DECREMENT (mem_mode))
{
  addr = gen_rtx_PRE_INC (address_mode, reg0);
  has_preinc[mem_mode]
= memory_address_addr_space_p (mem_mode, addr, as);
+
+ if (has_preinc[mem_mode])
+   data->ainc_costs[AINC_PRE_INC]
+ = address_cost (addr, mem_mode, as, speed);
}
-  if (USE_LOAD_POST_INCREMENT (mem_mode) 
+  if (USE_LOAD_POST_INCREMENT (mem_mode)
  || USE_STORE_POST_INCREMENT (mem_mode))
{
  addr = gen_rtx_POST_INC (address_mode, reg0);
  has_postinc[mem_mode]
= memory_address_addr_space_p (mem_mode, addr, as);
+
+ if (has_postinc[mem_mode])
+   data->ainc_costs[AINC_POST_INC]
+ = address_cost (addr, mem_mode, as, speed);
}
   for (i = 0; i < 16; i++)
{
@@ -3429,22 +3457,32 @@ get_address_cost (bool symbol_present, bool var_pr
   s_offset = offset;
 
   autoinc = false;
+  autoinc_type = AINC_NONE;
   msize = GET_MODE_SIZE (mem_mode);
   autoinc_offset = offset;
   if (stmt_after_inc)
 autoinc_offset += ratio * cstep;
   if (symbol_present || var_present || ratio != 1)
 autoinc = false;
-  else if ((has_postinc[mem_mode] && autoinc_offset == 0
+  else
+{
+  if (has_postinc[mem_mode] && autoinc_offset == 0
+ && msize == cstep)
+   autoinc_type = AINC_POST_INC;
+  else if (has_postdec[mem_mode] && autoinc_offset == 0
+  && msize == -cstep)
+   autoinc_type = AINC_

RE: [PATCH GCC]Simplify address expression in IVOPT

2013-11-04 Thread bin.cheng


> -Original Message-
> From: Richard Biener [mailto:richard.guent...@gmail.com]
> Sent: Wednesday, October 30, 2013 10:46 PM
> To: Bin Cheng
> Cc: GCC Patches
> Subject: Re: [PATCH GCC]Simplify address expression in IVOPT
> 
> On Tue, Oct 29, 2013 at 10:18 AM, bin.cheng  wrote:
> 
> Hmm.  I think you want what get_inner_reference_aff does, using the return
> value of get_inner_reference as starting point for determine_base_object.
> And you don't want to restrict yourselves so much on what exprs to
process,
> but only exclude DECL_Ps.
Did you mean I should pass all address expressions to
get_inner_reference_aff except the one with declaration operand?
I am a little confused here why DECL_Ps should be handled specially?  Seems
it can be handled properly by the get_* function.  Anything important I
missed?

> Just amend get_inner_reference_aff to return the tree base object.
I found it's inappropriate to do this because: functions like
get_inner_reference* only accept reference expressions, while base_object
has to be computed for other kinds of expressions too.  Take gimple
statement "a_1 = *ptr_1" as an example, the base passed to alloc_iv is
ptr_1; the base_object computed by determine_base_object is also ptr_1,
which we can't rely on get_inner_reference* .

Also It seems the comment before determine_base_object might be inaccurate:
" Returns a memory object to that EXPR points." which I think is " Returns a
pointer pointing to the main memory object to that EXPR points."
Right?
> 
> Note that this isn't really "simplifying" but rather lowering all
addresses.
> 
> The other changes in this patch are unrelated, right?
Right, I should have named this message like "refine cost computation of
expressions in IVOPT" just as the patch file.  

Thanks.
bin






Re: [PATCH GCC]Compute, cache and use cost of auto-increment rtx patterns in IVOPT

2013-11-04 Thread Bin.Cheng
On Mon, Nov 4, 2013 at 7:38 PM, Richard Biener
 wrote:
> On Mon, Nov 4, 2013 at 4:31 AM, bin.cheng  wrote:
>> Hi,
>>
>> The IVOPT in GCC has a problem that it does not use cost of auto-increment
>> address expression in accounting, while it retreats to cost of address
>> expression if auto-increment addressing mode is unavailable.
>> For example, on ARM target:
>> 1) the cost of "[reg]" (which is 6) is used for address expression "[reg],
>> #off";
>> 2) the cost of "[reg+off]" (which is 2) is used for address expression
>> "[reg, #off]!";
>>
>> This causes:
>> 1) cost of non-auto increment address expression is used for auto-increment
>> address expression;
>> 2) different address costs are used for pre/post increment address
>> expressions.
>> This patch fixes the problem by computing, caching and using the cost of
>> auto-increment address expressions.
>>
>> Bootstrap and test on x86/arm.  Is it OK?
>
> But don't you need to adjust
>
> static bool
> determine_use_iv_cost_address (struct ivopts_data *data,
>struct iv_use *use, struct iv_cand *cand)
> {
>   bitmap depends_on;
>   bool can_autoinc;
>   int inv_expr_id = -1;
>   comp_cost cost = get_computation_cost (data, use, cand, true, &depends_on,
>  &can_autoinc, &inv_expr_id);
>
>   if (cand->ainc_use == use)
> {
>   if (can_autoinc)
> cost.cost -= cand->cost_step;
>
> this which seems to try to compensate for your issue?
That's where problem gets complicated depending on how backend defines
address cost.  For back ends define cost according to the true cost of
addressing mode approximately, the address cost of auto-increment
addressing mode doesn't take the saved stepping instruction into
consideration, so the code is necessary.
Moreover, according to gcc internal's description about
TARGET_ADDRESS_COST, RISC machines may define different address cost
for addressing modes which actually have equal execution on
micro-architecture level (like ARM for now).  The problems are:
1) Though address costs are defined in this "discriminated" way, it's
unlikely to have the saved stepping instruction considered either.
The address cost of auto-increment address expression shouldn't go so
far.
2) We think the "discriminated" address cost model is established
before gimple pass and is outdated.  The true micro-architecture
address cost (or cost normalized with COSTS_N_INSNS) should be used in
GCC nowadays.  The rtl passes like fwprop_addr which use address cost
as heuristic information should be refined... but that's totally
another problem (am investigating it).

So overall, I think the stepping cost should to be subtracted here.

>
> Or maybe I don't understand.
>
> CCing Bernd who implemented this IIRC.
Any suggestions appreciated.

Thanks.
bin

-- 
Best Regards.


RE: [PATCH GCC]Simplify address expression in IVOPT

2013-11-05 Thread bin.cheng


> -Original Message-
> From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-
> ow...@gcc.gnu.org] On Behalf Of bin.cheng
> Sent: Monday, November 04, 2013 4:35 PM
> To: 'Richard Biener'
> Cc: GCC Patches
> Subject: RE: [PATCH GCC]Simplify address expression in IVOPT
> 
> 
> 
> > -Original Message-
> > From: Richard Biener [mailto:richard.guent...@gmail.com]
> > Sent: Wednesday, October 30, 2013 10:46 PM
> > To: Bin Cheng
> > Cc: GCC Patches
> > Subject: Re: [PATCH GCC]Simplify address expression in IVOPT
> >
> > On Tue, Oct 29, 2013 at 10:18 AM, bin.cheng  wrote:
> >
> > Hmm.  I think you want what get_inner_reference_aff does, using the
> > return value of get_inner_reference as starting point for
> determine_base_object.
> > And you don't want to restrict yourselves so much on what exprs to
> process,
> > but only exclude DECL_Ps.
> Did you mean I should pass all address expressions to
> get_inner_reference_aff except the one with declaration operand?
> I am a little confused here why DECL_Ps should be handled specially?
Seems
> it can be handled properly by the get_* function.  Anything important I
> missed?
> 
> > Just amend get_inner_reference_aff to return the tree base object.
> I found it's inappropriate to do this because: functions like
> get_inner_reference* only accept reference expressions, while base_object
> has to be computed for other kinds of expressions too.  Take gimple
> statement "a_1 = *ptr_1" as an example, the base passed to alloc_iv is
ptr_1;
> the base_object computed by determine_base_object is also ptr_1, which
> we can't rely on get_inner_reference* .
> 
> Also It seems the comment before determine_base_object might be
> inaccurate:
> " Returns a memory object to that EXPR points." which I think is " Returns
a
> pointer pointing to the main memory object to that EXPR points."
> Right?
> >
> > Note that this isn't really "simplifying" but rather lowering all
> addresses.
> >
> > The other changes in this patch are unrelated, right?
> Right, I should have named this message like "refine cost computation of
> expressions in IVOPT" just as the patch file.

Hi,
I updated the patch according to review comments.  Now it lowers all address
expressions except ones with DECL_P to get_inner_reference_aff, it also
computes base_object which is used later to ease determine_base_object.

Bootstrap and test on x86/x86_64/arm on going,  is it OK if tests pass?

Thanks.
bin

gcc/testsuite/ChangeLog
2013-11-05  Bin Cheng  

* gcc.dg/tree-ssa/loop-2.c: Refine check condition.
* gcc.dg/tree-ssa/ivopt_infer_2.c: Ditto.
* gcc.dg/tree-ssa/ivopt_mult_3.c: Ditto.

2013-11-05  Bin Cheng  

* tree-ssa-loop-ivopts.c (alloc_iv): Lower address expressions.
(get_shiftadd_cost): Check equality using operand_equal_p.
(force_expr_to_var_cost): Refactor the code.  Handle type
conversion.
(split_address_cost): Call force_expr_to_var_cost.
* tree-affine.c (get_inner_reference_aff): Return base_addr.
* tree-affine.h (get_inner_reference_aff): Change prototype.Index: gcc/testsuite/gcc.dg/tree-ssa/loop-2.c
===
--- gcc/testsuite/gcc.dg/tree-ssa/loop-2.c  (revision 204117)
+++ gcc/testsuite/gcc.dg/tree-ssa/loop-2.c  (working copy)
@@ -27,7 +27,7 @@ void xxx(void)
 
 /* { dg-final { scan-tree-dump-times " \\* \[^\\n\\r\]*=" 0 "optimized" } } */
 /* { dg-final { scan-tree-dump-times "\[^\\n\\r\]*= \\* " 0 "optimized" } } */
-/* { dg-final { scan-tree-dump-times "MEM" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "MEM\\\[base" 1 "optimized" } } */
 
 /* 17 * iter should be strength reduced.  */
 
Index: gcc/testsuite/gcc.dg/tree-ssa/ivopt_infer_2.c
===
--- gcc/testsuite/gcc.dg/tree-ssa/ivopt_infer_2.c   (revision 204117)
+++ gcc/testsuite/gcc.dg/tree-ssa/ivopt_infer_2.c   (working copy)
@@ -7,7 +7,8 @@
 
 extern char a[];
 
-/* Can not infer loop iteration from array -- exit test can not be replaced.  
*/
+/* Can not infer loop iteration from array -- exit test can not be
+   replaced by the array address.  */
 void foo (unsigned int i_width, TYPE dst)
 {
   unsigned long long i = 0;
@@ -21,5 +22,5 @@ void foo (unsigned int i_width, TYPE dst)
 }
 }
 
-/* { dg-final { scan-tree-dump-times "Replacing" 0 "ivopts"} } */
+/* { dg-final { scan-tree-dump-times "\[^:\]*if \\(.*j_\[0-9\]+.*\\)" 1 
"ivopts"} } */
 /* { dg-final { cleanup-tree-dump "ivopts&

Re: [PATCH GCC]Simplify address expression in IVOPT

2013-11-05 Thread Bin.Cheng
On Tue, Nov 5, 2013 at 7:19 PM, Yufeng Zhang  wrote:
> On 11/05/13 10:13, bin.cheng wrote:
>>
>> Index: gcc/tree-affine.c
>> ===
>> --- gcc/tree-affine.c   (revision 204117)
>> +++ gcc/tree-affine.c   (working copy)
>> @@ -874,10 +874,11 @@ debug_aff (aff_tree *val)
>> fprintf (stderr, "\n");
>>   }
>>
>> -/* Returns address of the reference REF in ADDR.  The size of the
>> accessed
>> -   location is stored to SIZE.  */
>> +/* Computes address of the reference REF in ADDR.  The size of the
>> accessed
>> +   location is stored to SIZE.  Returns pointer to the ultimate
>> containing
>> +   object to which REF refers.  */
>>
>> -void
>> +tree
>>   get_inner_reference_aff (tree ref, aff_tree *addr, double_int *size)
>>   {
>> HOST_WIDE_INT bitsize, bitpos;
>> @@ -904,6 +905,8 @@ get_inner_reference_aff (tree ref, aff_tree *addr,
>> aff_combination_add (addr,&tmp);
>>
>> *size = double_int::from_shwi ((bitsize + BITS_PER_UNIT - 1) /
>> BITS_PER_UNIT);
>> +
>> +  return base_addr;
>>   }
>>
>
> I think what Richard suggests is to return the base object rather the
> address of the base object, i.e.
I am not sure about that.  We have to pass pointer_type expression to
function determine_base_object for address expressions, because there
is no way to tell pointer from object once we are in
determine_base_object.
Thanks.
bin
>
>   return base;
>
> This is good in keeping the consistency in the return values between
> get_inner_reference_aff and get_inner_reference.
>
> Yufeng
>



-- 
Best Regards.


Re: [PATCH PR68021]Set ratio to 1 when computing the value of biv cand by itself

2016-02-10 Thread Bin.Cheng
On Wed, Feb 10, 2016 at 1:27 PM, Richard Biener
 wrote:
>
> On Wed, Feb 10, 2016 at 12:34 PM, Bin Cheng  wrote:
> > Hi,
> > This is another way to fix PR68021, and I think it's the least intrusive 
> > way.  The issue is triggered in a special case in which cand is a original 
> > biv, and use denotes the value of the biv itself.  In this case, the use is 
> > added specifically for the original biv, as a result, get_computation_aff 
> > isn't called for the  pair before rewriting the use.  It is 
> > possible that constant_multiple_of/operand_equal_q could fail because of 
> > inconsistent fold behavior.  The fold behavior is fully described in 
> > PR68021.
> > This patch fixes IVOPT part of issue by setting ratio to 1, because it is 
> > known that the use has the value of the biv cand.
> >
> > Bootstrap and test on x86_64 and aarch64.  Is it OK if no failures?
>
> Ok, but please add a comment why we have this special-casing instead
> of relying on constant_multiple_of.
Done, patch applied at revision 233269.

Thanks,
bin
>
>
> Thanks,
> Richard.
>
> > Thanks,
> > bin
> >
> > 2016-02-09  Bin Cheng  
> >
> > PR tree-optimization/68021
> > * tree-ssa-loop-ivopts.c (get_computation_aff): Set ratio to 1 if
> > when computing the value of biv cand by itself.
> >
> > gcc/testsuite/ChangeLog
> > 2016-02-09  Bin Cheng  
> >
> > PR tree-optimization/68021
> > * gcc.dg/tree-ssa/pr68021.c: New test.


Re: [PATCH PR69052]Check if loop inv can be propagated into mem ref with additional addr expr canonicalization

2016-02-11 Thread Bin.Cheng
On Thu, Feb 11, 2016 at 7:14 AM, Jeff Law  wrote:
> On 02/09/2016 04:08 AM, Bin Cheng wrote:
>>
>> Hi,
>> When counting cost for loop inv, GCC checks if a loop inv can be
>> propagated into its use site (a memory reference).  If it cannot be
>> propagated, we increase its cost so that it's expensive enough to be hoisted
>> out of loop.  Currently we simply replace loop inv register in the use site
>> with its definition expression, then call validate_changes to check if the
>> result insn is valid.  This is weak because validate_changes doesn't take
>> canonicalization into consideration.  Given below example:
>>
>>Loop inv def:
>> 69: r149:SI=r87:SI+const(unspec[`'] 1)
>>REG_DEAD r87:SI
>>Loop inv use:
>> 70: r150:SI=[r90:SI*0x4+r149:SI]
>>REG_DEAD r149:SI
>>
>> The address expression after propagation is "r90 * 0x4 + (r87 +
>> const(unspec[`']))".  Function validate_changes simply returns false to
>> it.  As a matter of fact, the propagation is feasible if we canonicalize
>> address expression into the form like "(r90 * 0x4 + r87) +
>> const(unspec[`'])".
>>
>> This patch fixes the problem by canonicalizing address expression and
>> verifying if the new addr is valid.  The canonicalization follows GCC insn
>> canonicalization rules.  The test case from bugzilla PR is also included.
>> As for the canonicalize_address interface, there is another
>> canonicalize_address in fwprop.c which only changes shift into mult.  I
>> think it would be good to factor out a common RTL interface in GCC, but
>> that's stage1 work.
>
>
> Also note there's bits in combine that will canonicalize appropriate shifts
> into mults.  Clearly there's a need for some generalized routines to take a
> fairly generic address and perform canonicalizations and simplifications on
> it.
>
>> Bootstrap and test on x86_64 and AArch64.  Is it OK?
>>
>> Thanks,
>> bin
>>
>> 2016-02-09  Bin Cheng
>>
>> PR tree-optimization/69052
>> * loop-invariant.c (canonicalize_address): New function.
>> (inv_can_prop_to_addr_use): Check validity of address expression
>> which is canonicalized by above function.
>>
>> gcc/testsuite/ChangeLog
>> 2016-02-09  Bin Cheng
>>
>> PR tree-optimization/69052
>> * gcc.target/i386/pr69052.c: New test.
>>
>>
>> pr69052-20160204.txt
>>
>>
>> diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c
>> index 707f044..157e273 100644
>> --- a/gcc/loop-invariant.c
>> +++ b/gcc/loop-invariant.c
>> @@ -754,6 +754,74 @@ create_new_invariant (struct def *def, rtx_insn
>> *insn, bitmap depends_on,
>> return inv;
>>   }
>>
>> +/* Returns a canonical version address for X.  It identifies
>> +   addr expr in the form of A + B + C.  Following instruction
>> +   canonicalization rules, MULT operand is moved to the front,
>> +   CONST operand is moved to the end; also PLUS operators are
>> +   chained to the left.  */
>> +
>> +static rtx
>> +canonicalize_address (rtx x)
>> +{
>> +  rtx op0, op1, op2;
>> +  machine_mode mode = GET_MODE (x);
>> +  enum rtx_code code = GET_CODE (x);
>> +
>> +  if (code != PLUS)
>> +return x;
>> +
>> +  /* Extract operands from A + B (+ C).  */
>> +  if (GET_CODE (XEXP (x, 0)) == PLUS)
>> +{
>> +  op0 = XEXP (XEXP (x, 0), 0);
>> +  op1 = XEXP (XEXP (x, 0), 1);
>> +  op2 = XEXP (x, 1);
>> +}
>> +  else if (GET_CODE (XEXP (x, 1)) == PLUS)
>> +{
>> +  op0 = XEXP (x, 0);
>> +  op1 = XEXP (XEXP (x, 1), 0);
>> +  op2 = XEXP (XEXP (x, 1), 1);
>> +}
>> +  else
>> +{
>> +  op0 = XEXP (x, 0);
>> +  op1 = XEXP (x, 1);
>> +  op2 = NULL_RTX;
>> +}
>> +
>> +  /* Move MULT operand to the front.  */
>> +  if (!REG_P (op1) && !CONST_INT_P (op1))
>> +std::swap (op0, op1);
>
> This feels a bit hack-ish in the sense that you already know the form of the
> RTL you're expecting and just assume that you'll be given something of that
> form, but no more complex.
>
> ISTM you're better off walking the whole rtx, recording the tidbits as you
> go into a vec.  If you see something unexpected during that walk, you punt
> canonicalization of the whole expression.
>
> You then sort the vec.  You want to move things like MULT to the start and
> all the constants to the end I think.
>
> You then do simplifications, particularly on the constants, but there may be
> something useful to do with MULT terms as well.  You could also arrange to
> rewrite ASHIFTs into MULTs at this stage.
>
> Then you generate a new equivalent expression from the simplified operands
> in the vec.
>
> You might look at tree-ssa-reassoc for ideas on implementation details.
>
> Initially just use it in the LICM code, but I think given that kind of
> structure it'd be generally useful to replace bits of combine and fwprop
>
> If your contention is that only a few forms really matter, then I'd like to
> see those forms spelled out better in the comment and some kind of checking
> that we have reasonable inc

Re: [PATCH PR69052]Check if loop inv can be propagated into mem ref with additional addr expr canonicalization

2016-02-22 Thread Bin.Cheng
On Fri, Feb 19, 2016 at 10:24 PM, Jeff Law  wrote:
> On 02/16/2016 11:43 AM, Bin Cheng wrote:
>>
>> 
>> From: Jeff Law 
>> Sent: 11 February 2016 23:26
>> To: Bin.Cheng
>> Cc: Bin Cheng; gcc-patches@gcc.gnu.org; nd
>> Subject: Re: [PATCH PR69052]Check if loop inv can be propagated into mem
>> ref with additional addr expr canonicalization
>>
>>>> On 02/11/2016 10:59 AM, Bin.Cheng wrote:
>>
>>
>>>> Hi Jeff,
>>>> Thanks for detailed review.  I also think a generic canonical
>>>> interface for RTL is much better.  I will give it a try.  But with
>>>> high chance it's a next stage1 stuff.
>>>
>>> That is, of course, fine.  However, if you do get something ready, I'd
>>> support using it within LICM for gcc-6, then using it in other places
>>> for gcc-7.
>>
>> Hi,
>> This is the updated version patch.  It fixes the problem by introducing a
>> generic address canonicalization interface.  This new interface
>> canonicalizes address expression in following steps:
>>   1) Rewrite ASHIFT into MULT recursively.
>>   2) Divide address into sub expressions with PLUS as the separator.
>>   3) Sort sub expressions according to precedence defined for
>> communative operations.
>>   4) Simplify CONST_INT_P sub expressions.
>>   5) Create new canonicalized address and return.
>>
>> According to review comments, this interface is now restricted in LCIM,
>> and will probably be expanded to other passes like fwprop and combine after
>> entering GCC7.
>> Bootstrap and test on x86_64 and AArch64.  Is it OK?
>>
>> Thanks,
>> bin
>>
>> 2016-02-15  Bin Cheng  
>>
>> PR tree-optimization/69052
>> * loop-invariant.c (canonicalize_address_mult): New function.
>> (MAX_CANON_ADDR_PARTS): New macro.
>> (collect_address_parts): New function.
>> (compare_address_parts, canonicalize_address): New functions.
>> (inv_can_prop_to_addr_use): Check validity of address expression
>> which is canonicalized by above canonicalize_address.
>>
>> gcc/testsuite/ChangeLog
>> 2016-02-15  Bin Cheng  
>>
>> PR tree-optimization/69052
>> * gcc.target/i386/pr69052.c: New test.
>
> This is exactly what I was looking for from a design standpoint.
>
> My only question is why didn't you use FOR_EACH_SUBRTX_VRA from rtl-iter.h
> to walk the RTX expressions in collect_address_parts and
> canonicalize_address_mult?
Hi Jeff,
Nothing special, just I haven't used this before, also
canonicalize_address_mult is alphabetically copied from fwprop.c.  One
question is when rewriting SHIFT to MULT, we need to modify rtl
expression in place, does FOR_EACH_SUBRTX iterator support this?  If
yes, what is the behavior for modified sub-expression?

Thanks,
bin
>
> Jeff
>
>
>>
>>>
>>> Jeff
>>
>>
>


Re: [PATCH PR69052]Check if loop inv can be propagated into mem ref with additional addr expr canonicalization

2016-02-23 Thread Bin.Cheng
On Fri, Feb 19, 2016 at 10:24 PM, Jeff Law  wrote:
> On 02/16/2016 11:43 AM, Bin Cheng wrote:
>>
>> 
>> From: Jeff Law 
>> Sent: 11 February 2016 23:26
>> To: Bin.Cheng
>> Cc: Bin Cheng; gcc-patches@gcc.gnu.org; nd
>> Subject: Re: [PATCH PR69052]Check if loop inv can be propagated into mem
>> ref with additional addr expr canonicalization
>>
>>>> On 02/11/2016 10:59 AM, Bin.Cheng wrote:
>>
>>
>>>> Hi Jeff,
>>>> Thanks for detailed review.  I also think a generic canonical
>>>> interface for RTL is much better.  I will give it a try.  But with
>>>> high chance it's a next stage1 stuff.
>>>
>>> That is, of course, fine.  However, if you do get something ready, I'd
>>> support using it within LICM for gcc-6, then using it in other places
>>> for gcc-7.
>>
>> Hi,
>> This is the updated version patch.  It fixes the problem by introducing a
>> generic address canonicalization interface.  This new interface
>> canonicalizes address expression in following steps:
>>   1) Rewrite ASHIFT into MULT recursively.
>>   2) Divide address into sub expressions with PLUS as the separator.
>>   3) Sort sub expressions according to precedence defined for
>> communative operations.
>>   4) Simplify CONST_INT_P sub expressions.
>>   5) Create new canonicalized address and return.
>>
>> According to review comments, this interface is now restricted in LCIM,
>> and will probably be expanded to other passes like fwprop and combine after
>> entering GCC7.
>> Bootstrap and test on x86_64 and AArch64.  Is it OK?
>>
>> Thanks,
>> bin
>>
>> 2016-02-15  Bin Cheng  
>>
>> PR tree-optimization/69052
>> * loop-invariant.c (canonicalize_address_mult): New function.
>> (MAX_CANON_ADDR_PARTS): New macro.
>> (collect_address_parts): New function.
>> (compare_address_parts, canonicalize_address): New functions.
>> (inv_can_prop_to_addr_use): Check validity of address expression
>> which is canonicalized by above canonicalize_address.
>>
>> gcc/testsuite/ChangeLog
>> 2016-02-15  Bin Cheng  
>>
>> PR tree-optimization/69052
>> * gcc.target/i386/pr69052.c: New test.
>
> This is exactly what I was looking for from a design standpoint.
>
> My only question is why didn't you use FOR_EACH_SUBRTX_VRA from rtl-iter.h
> to walk the RTX expressions in collect_address_parts and
> canonicalize_address_mult?
Hi Jeff,
Here comes the updated patch using FOR_EACH_SUBRTX_VAR in both
functions you mentioned.
Bootstrap and test on x86_64 and AArch64, is it OK?  The ChangeLog
isn't affected.

Thanks,
bin
>
> Jeff
>
>
>>
>>>
>>> Jeff
>>
>>
>
diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c
index 707f044..dcbe932 100644
--- a/gcc/loop-invariant.c
+++ b/gcc/loop-invariant.c
@@ -52,6 +52,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "cfgloop.h"
 #include "expr.h"
 #include "params.h"
+#include "rtl-iter.h"
 #include "dumpfile.h"
 
 /* The data stored for the loop.  */
@@ -754,6 +755,130 @@ create_new_invariant (struct def *def, rtx_insn *insn, 
bitmap depends_on,
   return inv;
 }
 
+/* Return a canonical version of X for the address, from the point of view,
+   that all multiplications are represented as MULT instead of the multiply
+   by a power of 2 being represented as ASHIFT.
+
+   Callers should prepare a copy of X because this function may modify it
+   in place.  */
+
+static void
+canonicalize_address_mult (rtx x)
+{
+  subrtx_var_iterator::array_type array;
+  FOR_EACH_SUBRTX_VAR (iter, array, x, NONCONST)
+{
+  rtx sub = *iter;
+
+  if (GET_CODE (sub) == ASHIFT
+ && CONST_INT_P (XEXP (sub, 1))
+ && INTVAL (XEXP (sub, 1)) < GET_MODE_BITSIZE (GET_MODE (sub))
+ && INTVAL (XEXP (sub, 1)) >= 0)
+   {
+ HOST_WIDE_INT shift = INTVAL (XEXP (sub, 1));
+ PUT_CODE (sub, MULT);
+ XEXP (sub, 1) = gen_int_mode ((HOST_WIDE_INT) 1 << shift,
+   GET_MODE (sub));
+ iter.skip_subrtxes ();
+   }
+}
+}
+
+/* Maximum number of sub expressions in address.  We set it to
+   a small integer since it's unlikely to have a complicated
+   address expression.  */
+
+#define MAX_CANON_ADDR_PARTS (5)
+
+/* Collect sub expressions in address X with PLUS as the seperator.
+   Sub expressions are stored in vector ADDR_PARTS.  */
+
+static void
+collect_address_parts (rtx x, 

Re: [PATCH PR69052]Check if loop inv can be propagated into mem ref with additional addr expr canonicalization

2016-02-25 Thread Bin.Cheng
On Thu, Feb 25, 2016 at 6:39 AM, Jeff Law  wrote:
> On 02/22/2016 02:22 AM, Bin.Cheng wrote:
>
>>> My only question is why didn't you use FOR_EACH_SUBRTX_VRA from
>>> rtl-iter.h
>>> to walk the RTX expressions in collect_address_parts and
>>> canonicalize_address_mult?
>>
>> Hi Jeff,
>> Nothing special, just I haven't used this before, also
>> canonicalize_address_mult is alphabetically copied from fwprop.c.  One
>> question is when rewriting SHIFT to MULT, we need to modify rtl
>> expression in place, does FOR_EACH_SUBRTX iterator support this?  If
>> yes, what is the behavior for modified sub-expression?
>
> Hmm.  The question of semantics when we change the underlying
> sub-expressions is an interesting one.
>
> While I think we're OK in practice using the iterators, I think that's more
> of an accident than by design -- if MULT/ASHIFT had a different underlying
> RTL structure then I'm much less sure using the iterators would be safe.
Hi Jeff,
Yes, I thought about this too and finally decided to skip sub rtxes in
modified MULT/ASHIFT expressions.  I think this will make the patch
with FOR_EACH_SUBRTX iterator safe.  Although it doesn't dive into
MULT/ASHIFT while the original one does, I don't think there is such
case in practice, after all we can't handle such complicated address
expression either.
>
> Let's go with your original patch that didn't use the iterators.  Sorry for
> making you do the additional work/testing to build the iterator version.
Not even a problem.
> But after pondering the issue you raised, I think your original patch is
> safer.
So in conclusion, I think both versions should be safe, the iterator
one is definitely cleaner.  It is your call which version I should
apply.

Thanks,
bin
>
>
> jeff


Re: [PATCH PR69052]Check if loop inv can be propagated into mem ref with additional addr expr canonicalization

2016-03-01 Thread Bin.Cheng
On Thu, Feb 25, 2016 at 8:47 AM, Bin.Cheng  wrote:
> On Thu, Feb 25, 2016 at 6:39 AM, Jeff Law  wrote:
>> On 02/22/2016 02:22 AM, Bin.Cheng wrote:
>>
>>>> My only question is why didn't you use FOR_EACH_SUBRTX_VRA from
>>>> rtl-iter.h
>>>> to walk the RTX expressions in collect_address_parts and
>>>> canonicalize_address_mult?
>>>
>>> Hi Jeff,
>>> Nothing special, just I haven't used this before, also
>>> canonicalize_address_mult is alphabetically copied from fwprop.c.  One
>>> question is when rewriting SHIFT to MULT, we need to modify rtl
>>> expression in place, does FOR_EACH_SUBRTX iterator support this?  If
>>> yes, what is the behavior for modified sub-expression?
>>
>> Hmm.  The question of semantics when we change the underlying
>> sub-expressions is an interesting one.
>>
>> While I think we're OK in practice using the iterators, I think that's more
>> of an accident than by design -- if MULT/ASHIFT had a different underlying
>> RTL structure then I'm much less sure using the iterators would be safe.
> Hi Jeff,
> Yes, I thought about this too and finally decided to skip sub rtxes in
> modified MULT/ASHIFT expressions.  I think this will make the patch
> with FOR_EACH_SUBRTX iterator safe.  Although it doesn't dive into
> MULT/ASHIFT while the original one does, I don't think there is such
> case in practice, after all we can't handle such complicated address
> expression either.
>>
>> Let's go with your original patch that didn't use the iterators.  Sorry for
>> making you do the additional work/testing to build the iterator version.
> Not even a problem.
>> But after pondering the issue you raised, I think your original patch is
>> safer.
> So in conclusion, I think both versions should be safe, the iterator
> one is definitely cleaner.  It is your call which version I should
> apply.
Hi Jeff,
I tend to apply the new patch with FOR_EACH_SUBRTX because it's
clearer.  Does it work for you?  Of course if you do think it's not
that safe I will fall back to the old one.

Thanks,
bin


Re: [PATCH PR69052]Set higher precedence for CONST_WIDE_INT than CONST_INT when canonicalizing addr expr

2016-03-04 Thread Bin.Cheng
On Fri, Mar 4, 2016 at 11:57 AM, Richard Biener
 wrote:
> On Fri, Mar 4, 2016 at 12:07 PM, Bin Cheng  wrote:
>> Hi,
>> A address canonicalization interface was introduced by my original patch to 
>> PR69052.  The interface sorts sub-parts in an address expression based on 
>> precedences defined by function commutative_operand_precedence.  It also 
>> assumes that all CONST_INT sub-parts are at the end of vector after sorting. 
>>  But this is not always true because commutative_operand_precedence sets the 
>> same precedence to both CONST_INT and CONST_WIDE_INT.  The patch could be 
>> broken in cases which have CONST_WIDE_INT sorted after CONST_INT.  Even 
>> though I don't think we can really run into such complicated case, I worked 
>> out a patch forcing CONST_INT to lower precedence than CONST_WIDE_INT, so 
>> that for sure it will be sorted after all other kinds sub-parts.
>>
>> This is an obvious change.  Bootstrap&test on x86_64, bootstrap&test on 
>> AArch64.  Is it OK for this stage?
>
> I believe the obvious change would be to change
> commutative_operand_precedence to pre-CONST_WIDE_INT behavior, namely
> giving CONST_WIDE_INT the same precedence as CONST_DOUBLE.
Yes, but I am not sure what else this change will affect, so I made
the smallest change in the original code.  I am testing this now.  It
would be great if anyone describes it a bit.

Thanks,
bin
>
> Richard.
>
>
>
>> Thanks,
>> bin
>>
>> 2016-03-04  Bin Cheng  
>>
>> PR rtl-optimization/69052
>> * loop-invariant.c (compare_address_parts): Force CONST_INT to lower
>> precedence than CONST_WIDE_INT.


Re: [PING][GCC ARM]Refine scaled address expression on ARM

2013-11-27 Thread Bin.Cheng
Ping^2

Thanks,
bin

On Fri, Nov 22, 2013 at 3:32 PM, bin.cheng  wrote:
> PING.
> Hi, there is a patch at
> http://gcc.gnu.org/ml/gcc-patches/2013-09/msg01353.html which slipped away.
>
> Thanks,
> bin
>
>
>



-- 
Best Regards.


[PATCH ARM/Embedded-4_8-branch]Backport revision 200103 from mainline

2013-11-27 Thread bin.cheng
Hi,
This patch back ports revision 200103 from mainline to
ARM/Embedded-4_8-branch.  It is tested on arm cortex-m3 and no regressions.

Thanks,
binIndex: gcc/tree-ssa-uncprop.c
===
--- gcc/tree-ssa-uncprop.c  (revision 205471)
+++ gcc/tree-ssa-uncprop.c  (revision 205472)
@@ -466,12 +466,11 @@
  struct equiv_hash_elt equiv_hash_elt;
  void **slot;
 
- /* If the argument is not an invariant, and refers to the same
-underlying variable as the PHI result, then there's no
-point in un-propagating the argument.  */
+ /* If the argument is not an invariant and can be potentially
+coalesced with the result, then there's no point in
+un-propagating the argument.  */
  if (!is_gimple_min_invariant (arg)
- && (SSA_NAME_VAR (arg) == SSA_NAME_VAR (res)
- && TREE_TYPE (arg) == TREE_TYPE (res)))
+ && gimple_can_coalesce_p (arg, res))
continue;
 
  /* Lookup this argument's value in the hash table.  */
@@ -485,7 +484,7 @@
  int j;
 
  /* Walk every equivalence with the same value.  If we find
-one with the same underlying variable as the PHI result,
+one that can potentially coalesce with the PHI rsult,
 then replace the value in the argument with its equivalent
 SSA_NAME.  Use the most recent equivalence as hopefully
 that results in shortest lifetimes.  */
@@ -493,8 +492,7 @@
{
  tree equiv = elt->equivalences[j];
 
- if (SSA_NAME_VAR (equiv) == SSA_NAME_VAR (res)
- && TREE_TYPE (equiv) == TREE_TYPE (res))
+ if (gimple_can_coalesce_p (equiv, res))
{
  SET_PHI_ARG_DEF (phi, e->dest_idx, equiv);
  break;
Index: gcc/testsuite/ChangeLog.arm
===
--- gcc/testsuite/ChangeLog.arm (revision 205471)
+++ gcc/testsuite/ChangeLog.arm (revision 205472)
@@ -1,3 +1,10 @@
+2013-11-28  Bin Cheng  
+
+   Backport mainline r200103
+   2013-06-15  Jeff Law  (missing)
+
+   * gcc.dg/tree-ssa/coalesce-1.c: New test.
+
 2013-11-27  Terry Guo  
 
Backport mainline r205391
Index: gcc/testsuite/gcc.dg/tree-ssa/coalesce-1.c
===
--- gcc/testsuite/gcc.dg/tree-ssa/coalesce-1.c  (revision 0)
+++ gcc/testsuite/gcc.dg/tree-ssa/coalesce-1.c  (revision 205472)
@@ -0,0 +1,195 @@
+/* { dg-do compile } */
+
+/* { dg-options "-O2 -fdump-rtl-expand-details" } */
+
+typedef long unsigned int size_t;
+union tree_node;
+typedef union tree_node *tree;
+union gimple_statement_d;
+typedef union gimple_statement_d *gimple;
+typedef const union tree_node *const_tree;
+typedef const union gimple_statement_d *const_gimple;
+struct gimple_seq_d;
+typedef struct gimple_seq_d *gimple_seq;
+struct edge_def;
+typedef struct edge_def *edge;
+struct basic_block_def;
+typedef struct basic_block_def *basic_block;
+typedef const struct basic_block_def *const_basic_block;
+struct tree_exp
+{
+  tree operands[1];
+};
+typedef struct ssa_use_operand_d
+{
+  tree *use;
+} ssa_use_operand_t;
+struct phi_arg_d
+{
+  struct ssa_use_operand_d imm_use;
+};
+union tree_node
+{
+  struct tree_exp exp;
+};
+struct function
+{
+};
+extern struct function *cfun;
+struct edge_def
+{
+  unsigned int dest_idx;
+};
+static __inline__ void
+VEC_edge_must_be_pointer_type (void)
+{
+  (void) ((edge) 1 == (void *) 1);
+} typedef struct VEC_edge_base
+
+{
+  unsigned num;
+  unsigned alloc;
+  edge vec[1];
+} VEC_edge_base;
+typedef struct VEC_edge_none
+{
+  VEC_edge_base base;
+} VEC_edge_none;
+
+static __inline__ edge
+VEC_edge_base_index (const VEC_edge_base * vec_, unsigned ix_,
+const char *file_, unsigned line_, const char *function_)
+{
+  return vec_->vec[ix_];
+}
+
+typedef struct VEC_edge_gc
+{
+  VEC_edge_base base;
+} VEC_edge_gc;
+struct basic_block_def
+{
+  VEC_edge_gc *succs;
+};
+static __inline__ edge
+single_succ_edge (const_basic_block bb)
+{
+  return (VEC_edge_base_index
+ bb)->succs) ? &((bb)->succs)->base : 0), (0),
+  "/home/gcc/virgin-gcc/gcc/basic-block.h", 563, __FUNCTION__));
+}
+
+edge find_edge (basic_block, basic_block);
+typedef tree *def_operand_p;
+typedef ssa_use_operand_t *use_operand_p;
+struct gimple_seq_node_d;
+typedef struct gimple_seq_node_d *gimple_seq_node;
+struct gimple_seq_node_d
+{
+  gimple stmt;
+};
+typedef struct
+{
+  gimple_seq_node ptr;
+  gimple_seq seq;
+  basic_block bb;
+} gimple_stmt_iterator;
+struct gimple_statement_phi
+{
+  struct phi_arg_d args[1];
+};
+union gimple_statement_d
+{
+  struct gimple_statement_phi gimple_phi;
+};
+extern size_t const gimple_ops_offset_[];
+static __inline__ tre

Re: [PATCH ARM]Refine scaled address expression on ARM

2013-11-28 Thread Bin.Cheng
On Thu, Nov 28, 2013 at 6:48 PM, Richard Earnshaw  wrote:
> On 18/09/13 10:15, bin.cheng wrote:
>>
>>
>>> -Original Message-
>>> From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-
>>> ow...@gcc.gnu.org] On Behalf Of bin.cheng
>>> Sent: Monday, September 02, 2013 3:09 PM
>>> To: Richard Earnshaw
>>> Cc: gcc-patches@gcc.gnu.org
>>> Subject: RE: [PATCH ARM]Refine scaled address expression on ARM
>>>
>>>
>>>
>>>> -Original Message-
>>>> From: Richard Earnshaw
>>>> Sent: Thursday, August 29, 2013 9:06 PM
>>>> To: Bin Cheng
>>>> Cc: gcc-patches@gcc.gnu.org
>>>> Subject: Re: [PATCH ARM]Refine scaled address expression on ARM
>>>>
>>>> On 28/08/13 08:00, bin.cheng wrote:
>>>>> Hi,
>>>>>
>>>>> This patch refines scaled address expression on ARM.  It supports
>>>>> "base+index*scale" in arm_legitimate_address_outer_p.  It also tries
>>>>> to legitimize "base + index * scale + offset" with "reg <- base +
>>>>> offset;  reg
>>>>> + index * scale" by introducing thumb2_legitimize_address.  For now
>>>>> + function
>>>>> thumb2_legitimize_address is a kind of placeholder and just does the
>>>>> mentioned transformation by calling to try_multiplier_address.
>>>>> Hoping we can improve it in the future.
>>>>>
>>>>> With this patch:
>>>>> 1) "base+index*scale" is recognized.
>>>>
>>>> That's because (PLUS (REG) (MULT (REG) (CONST))) is not canonical form.
>>>>  So this shouldn't be necessary.  Can you identify where this
>>> non-canoncial form is being generated?
>>>>
>>>
>>> Oh, for now ivopt constructs "index*scale" to test whether backend
>>> supports scaled addressing mode, which is not valid on ARM, so I was going
>>> to construct "base + index*scale" instead.  Since "base + index * scale"
>> is not
>>> canonical form, I will construct the canonical form and drop this part of
>> the
>>> patch.
>>>
>>> Is rest of this patch OK?
>>>
>> Hi Richard, I removed the part over which you concerned and created this
>> updated patch.
>>
>> Is it OK?
>>
>> Thanks.
>> bin
>>
>> 2013-09-18  Bin Cheng  
>>
>>   * config/arm/arm.c (try_multiplier_address): New function.
>>   (thumb2_legitimize_address): New function.
>>   (arm_legitimize_address): Call try_multiplier_address and
>>   thumb2_legitimize_address.
>>
>>
>> 6-arm-scaled_address-20130918.txt
>>
>>
>> Index: gcc/config/arm/arm.c
>> ===
>> --- gcc/config/arm/arm.c  (revision 200774)
>> +++ gcc/config/arm/arm.c  (working copy)
>> @@ -6652,6 +6654,106 @@ legitimize_tls_address (rtx x, rtx reg)
>>  }
>>  }
>>
>> +/* Try to find address expression like base + index * scale + offset
>> +   in X.  If we find one, force base + offset into register and
>> +   construct new expression reg + index * scale; return the new
>> +   address expression if it's valid.  Otherwise return X.  */
>> +static rtx
>> +try_multiplier_address (rtx x, enum machine_mode mode ATTRIBUTE_UNUSED)
>> +{
>> +  rtx tmp, base_reg, new_rtx;
>> +  rtx base = NULL_RTX, index = NULL_RTX, scale = NULL_RTX, offset = 
>> NULL_RTX;
>> +
>> +  gcc_assert (GET_CODE (x) == PLUS);
>> +
>> +  /* Try to find and record base/index/scale/offset in X. */
>> +  if (GET_CODE (XEXP (x, 1)) == MULT)
>> +{
>> +  tmp = XEXP (x, 0);
>> +  index = XEXP (XEXP (x, 1), 0);
>> +  scale = XEXP (XEXP (x, 1), 1);
>> +  if (GET_CODE (tmp) != PLUS)
>> + return x;
>> +
>> +  base = XEXP (tmp, 0);
>> +  offset = XEXP (tmp, 1);
>> +}
>> +  else
>> +{
>> +  tmp = XEXP (x, 0);
>> +  offset = XEXP (x, 1);
>> +  if (GET_CODE (tmp) != PLUS)
>> + return x;
>> +
>> +  base = XEXP (tmp, 0);
>> +  scale = XEXP (tmp, 1);
>> +  if (GET_CODE (base) == MULT)
>> + {
>> +   tmp = base;
>> +   base = scale;
>> +   scale = tmp;
>> + }
>> +  if (GET_CODE (scale) != MULT)
>> + return x;
>> +
>> +

Re: [PATCH ARM]Refine scaled address expression on ARM

2013-11-28 Thread Bin.Cheng
On Thu, Nov 28, 2013 at 8:06 PM, Bin.Cheng  wrote:
> On Thu, Nov 28, 2013 at 6:48 PM, Richard Earnshaw  wrote:
>> On 18/09/13 10:15, bin.cheng wrote:
>>>
>>>
>>>> -Original Message-
>>>> From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-
>>>> ow...@gcc.gnu.org] On Behalf Of bin.cheng
>>>> Sent: Monday, September 02, 2013 3:09 PM
>>>> To: Richard Earnshaw
>>>> Cc: gcc-patches@gcc.gnu.org
>>>> Subject: RE: [PATCH ARM]Refine scaled address expression on ARM
>>>>
>>>>
>>>>
>>>>> -Original Message-
>>>>> From: Richard Earnshaw
>>>>> Sent: Thursday, August 29, 2013 9:06 PM
>>>>> To: Bin Cheng
>>>>> Cc: gcc-patches@gcc.gnu.org
>>>>> Subject: Re: [PATCH ARM]Refine scaled address expression on ARM
>>>>>
>>>>> On 28/08/13 08:00, bin.cheng wrote:
>>>>>> Hi,
>>>>>>
>>>>>> This patch refines scaled address expression on ARM.  It supports
>>>>>> "base+index*scale" in arm_legitimate_address_outer_p.  It also tries
>>>>>> to legitimize "base + index * scale + offset" with "reg <- base +
>>>>>> offset;  reg
>>>>>> + index * scale" by introducing thumb2_legitimize_address.  For now
>>>>>> + function
>>>>>> thumb2_legitimize_address is a kind of placeholder and just does the
>>>>>> mentioned transformation by calling to try_multiplier_address.
>>>>>> Hoping we can improve it in the future.
>>>>>>
>>>>>> With this patch:
>>>>>> 1) "base+index*scale" is recognized.
>>>>>
>>>>> That's because (PLUS (REG) (MULT (REG) (CONST))) is not canonical form.
>>>>>  So this shouldn't be necessary.  Can you identify where this
>>>> non-canoncial form is being generated?
>>>>>
>>>>
>>>> Oh, for now ivopt constructs "index*scale" to test whether backend
>>>> supports scaled addressing mode, which is not valid on ARM, so I was going
>>>> to construct "base + index*scale" instead.  Since "base + index * scale"
>>> is not
>>>> canonical form, I will construct the canonical form and drop this part of
>>> the
>>>> patch.
>>>>
>>>> Is rest of this patch OK?
>>>>
>>> Hi Richard, I removed the part over which you concerned and created this
>>> updated patch.
>>>
>>> Is it OK?
>>>
>>> Thanks.
>>> bin
>>>
>>> 2013-09-18  Bin Cheng  
>>>
>>>   * config/arm/arm.c (try_multiplier_address): New function.
>>>   (thumb2_legitimize_address): New function.
>>>   (arm_legitimize_address): Call try_multiplier_address and
>>>   thumb2_legitimize_address.
>>>
>>>
>>> 6-arm-scaled_address-20130918.txt
>>>
>>>
>>> Index: gcc/config/arm/arm.c
>>> ===
>>> --- gcc/config/arm/arm.c  (revision 200774)
>>> +++ gcc/config/arm/arm.c  (working copy)
>>> @@ -6652,6 +6654,106 @@ legitimize_tls_address (rtx x, rtx reg)
>>>  }
>>>  }
>>>
>>> +/* Try to find address expression like base + index * scale + offset
>>> +   in X.  If we find one, force base + offset into register and
>>> +   construct new expression reg + index * scale; return the new
>>> +   address expression if it's valid.  Otherwise return X.  */
>>> +static rtx
>>> +try_multiplier_address (rtx x, enum machine_mode mode ATTRIBUTE_UNUSED)
>>> +{
>>> +  rtx tmp, base_reg, new_rtx;
>>> +  rtx base = NULL_RTX, index = NULL_RTX, scale = NULL_RTX, offset = 
>>> NULL_RTX;
>>> +
>>> +  gcc_assert (GET_CODE (x) == PLUS);
>>> +
>>> +  /* Try to find and record base/index/scale/offset in X. */
>>> +  if (GET_CODE (XEXP (x, 1)) == MULT)
>>> +{
>>> +  tmp = XEXP (x, 0);
>>> +  index = XEXP (XEXP (x, 1), 0);
>>> +  scale = XEXP (XEXP (x, 1), 1);
>>> +  if (GET_CODE (tmp) != PLUS)
>>> + return x;
>>> +
>>> +  base = XEXP (tmp, 0);
>>> +  offset = XEXP (tmp, 1);
>>> +}
>>> +  else
>>> +{
>>>

Re: [PATCH ARM]Refine scaled address expression on ARM

2013-11-29 Thread Bin.Cheng
On Fri, Nov 29, 2013 at 6:44 PM, Richard Biener
 wrote:
> On Fri, Nov 29, 2013 at 8:52 AM, Bin.Cheng  wrote:
>> On Thu, Nov 28, 2013 at 8:06 PM, Bin.Cheng  wrote:
>>> On Thu, Nov 28, 2013 at 6:48 PM, Richard Earnshaw  wrote:
>>>> On 18/09/13 10:15, bin.cheng wrote:
>>>>>
>>>>>
>>>>>> -Original Message-
>>>>>> From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-
>>>>>> ow...@gcc.gnu.org] On Behalf Of bin.cheng
>>>>>> Sent: Monday, September 02, 2013 3:09 PM
>>>>>> To: Richard Earnshaw
>>>>>> Cc: gcc-patches@gcc.gnu.org
>>>>>> Subject: RE: [PATCH ARM]Refine scaled address expression on ARM
>>>>>>
>>>>>>
>>>>>>
>>>>>>> -Original Message-
>>>>>>> From: Richard Earnshaw
>>>>>>> Sent: Thursday, August 29, 2013 9:06 PM
>>>>>>> To: Bin Cheng
>>>>>>> Cc: gcc-patches@gcc.gnu.org
>>>>>>> Subject: Re: [PATCH ARM]Refine scaled address expression on ARM
>>>>>>>
>>>>>>> On 28/08/13 08:00, bin.cheng wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> This patch refines scaled address expression on ARM.  It supports
>>>>>>>> "base+index*scale" in arm_legitimate_address_outer_p.  It also tries
>>>>>>>> to legitimize "base + index * scale + offset" with "reg <- base +
>>>>>>>> offset;  reg
>>>>>>>> + index * scale" by introducing thumb2_legitimize_address.  For now
>>>>>>>> + function
>>>>>>>> thumb2_legitimize_address is a kind of placeholder and just does the
>>>>>>>> mentioned transformation by calling to try_multiplier_address.
>>>>>>>> Hoping we can improve it in the future.
>>>>>>>>
>>>>>>>> With this patch:
>>>>>>>> 1) "base+index*scale" is recognized.
>>>>>>>
>>>>>>> That's because (PLUS (REG) (MULT (REG) (CONST))) is not canonical form.
>>>>>>>  So this shouldn't be necessary.  Can you identify where this
>>>>>> non-canoncial form is being generated?
>>>>>>>
>>>>>>
>>>>>> Oh, for now ivopt constructs "index*scale" to test whether backend
>>>>>> supports scaled addressing mode, which is not valid on ARM, so I was 
>>>>>> going
>>>>>> to construct "base + index*scale" instead.  Since "base + index * scale"
>>>>> is not
>>>>>> canonical form, I will construct the canonical form and drop this part of
>>>>> the
>>>>>> patch.
>>>>>>
>>>>>> Is rest of this patch OK?
>>>>>>
>>>>> Hi Richard, I removed the part over which you concerned and created this
>>>>> updated patch.
>>>>>
>>>>> Is it OK?
>>>>>
>>>>> Thanks.
>>>>> bin
>>>>>
>>>>> 2013-09-18  Bin Cheng  
>>>>>
>>>>>   * config/arm/arm.c (try_multiplier_address): New function.
>>>>>   (thumb2_legitimize_address): New function.
>>>>>   (arm_legitimize_address): Call try_multiplier_address and
>>>>>   thumb2_legitimize_address.
>>>>>
>>>>>
>>>>> 6-arm-scaled_address-20130918.txt
>>>>>
>>>>>
>>>>> Index: gcc/config/arm/arm.c
>>>>> ===
>>>>> --- gcc/config/arm/arm.c  (revision 200774)
>>>>> +++ gcc/config/arm/arm.c  (working copy)
>>>>> @@ -6652,6 +6654,106 @@ legitimize_tls_address (rtx x, rtx reg)
>>>>>  }
>>>>>  }
>>>>>
>>>>> +/* Try to find address expression like base + index * scale + offset
>>>>> +   in X.  If we find one, force base + offset into register and
>>>>> +   construct new expression reg + index * scale; return the new
>>>>> +   address expression if it's valid.  Otherwise return X.  */
>>>>> +static rtx
>>>>> +try_multiplier_address (rtx x, enum m

Re: [PATCH ARM]Refine scaled address expression on ARM

2013-11-29 Thread Bin.Cheng
On Sat, Nov 30, 2013 at 12:34 AM, Richard Earnshaw  wrote:
> On 29/11/13 11:46, Yufeng Zhang wrote:
>> On 11/29/13 07:52, Bin.Cheng wrote:
>>> After thinking twice, I some kind of think we should not re-associate
>>> addresses during expanding, because of lacking of context information.
>>>   Take base + scaled_index + offset as an example in PR57540, we just
>>> don't know if "base+offset" is loop invariant from either backend or
>>> RTL expander.
>>
>> I'm getting less convinced by re-associating base with offset
>> unconditionally.  One counter example is
>>
>> typedef int arr_1[20];
>> void foo (arr_1 a1, int i)
>> {
>>a1[i+10] = 1;
>> }
>>
>> I'm experimenting a patch to get the immediate offset in the above
>> example to be the last addend in the address computing (as mentioned in
>> http://gcc.gnu.org/ml/gcc/2013-11/msg00581.html), aiming to get the
>> following code-gen:
>>
>>  add r1, r0, r1, asl #2
>>  mov r3, #1
>>  str r3, [r1, #40]
>>
>> With your patch applied, the effort will be reverted to
>>
>>  add r0, r0, #40
>>  mov r3, #1
>>  str r3, [r0, r1, asl #2]
>>
>
> And another one is:
>
>
>
> typedef int arr_1[20];
> void foo (arr_1 a1, int i)
> {
>a1[i+10] = 1;
>a1[i+11] = 1;
> }
>
> This should compile to:
>
> add r1, r0, r1, asl #2
> mov r3, #1
> str r3, [r1, #40]
> str r3, [r1, #44]
>
> And which on Thumb2 should then collapse to:
>
> add r1, r0, r1, asl #2
> mov r3, #1
> strdr3, r3, [r1, #40]
>
> With your patch I don't see any chance of being able to get to this
> situation.
>
> (BTW, we currently generate:
>
> mov r3, #1
> add r1, r1, #10
> add r2, r0, r1, asl #2
> str r3, [r0, r1, asl #2]
> str r3, [r2, #4]
>
> which is insane).
The two memory references share common sub expressions, SLSR is
designed to handle this case, and it should be improved to handle.
The original patch are only used to pick up cases not handled by SLSR
and IVOPT.  Anyway, as you saw from previous message, to do the
refactoring during expand is not a good practice, without enough
CSE/INVARIANT information, there will be always catched and missed
opportunities, that's why I think another lowering besides SLSR/IVOPT
on gimple might be a win.

Thanks,
bin

>
> I think I see where you're coming from on the original testcase, but I
> think you're trying to solve the wrong problem.   In your test case the
> base is an eliminable register, which is likely to be replaced with an
> offset expression during register allocation.  The problem then, I
> think, is that the cost of these virtual registers is treated the same
> as any other pseudo register, when it may really have the cost of a PLUS
> expression.
>
> Perhaps the cost of using an eliminable register should be raised in
> rtx_costs() (treating them as equivalent to (PLUS (reg) (CONST_INT
> (TBD))), so that loop optimizations will try to hoist suitable
> sub-expressions out the loop and replace them with real pseudos.
>
> R.
>
>
>
>
>
>
>
>



-- 
Best Regards.


Re: [PATCH ARM]Refine scaled address expression on ARM

2013-12-01 Thread Bin.Cheng
On Sat, Nov 30, 2013 at 12:34 AM, Richard Earnshaw  wrote:
> On 29/11/13 11:46, Yufeng Zhang wrote:
>> On 11/29/13 07:52, Bin.Cheng wrote:
>>> After thinking twice, I some kind of think we should not re-associate
>>> addresses during expanding, because of lacking of context information.
>>>   Take base + scaled_index + offset as an example in PR57540, we just
>>> don't know if "base+offset" is loop invariant from either backend or
>>> RTL expander.
>>
>> I'm getting less convinced by re-associating base with offset
>> unconditionally.  One counter example is
>>
>> typedef int arr_1[20];
>> void foo (arr_1 a1, int i)
>> {
>>a1[i+10] = 1;
>> }
>>
>> I'm experimenting a patch to get the immediate offset in the above
>> example to be the last addend in the address computing (as mentioned in
>> http://gcc.gnu.org/ml/gcc/2013-11/msg00581.html), aiming to get the
>> following code-gen:
>>
>>  add r1, r0, r1, asl #2
>>  mov r3, #1
>>  str r3, [r1, #40]
>>
>> With your patch applied, the effort will be reverted to
>>
>>  add r0, r0, #40
>>  mov r3, #1
>>  str r3, [r0, r1, asl #2]
>>
>
> And another one is:
>
>
>
> typedef int arr_1[20];
> void foo (arr_1 a1, int i)
> {
>a1[i+10] = 1;
>a1[i+11] = 1;
> }
>
> This should compile to:
>
> add r1, r0, r1, asl #2
> mov r3, #1
> str r3, [r1, #40]
> str r3, [r1, #44]
>
> And which on Thumb2 should then collapse to:
>
> add r1, r0, r1, asl #2
> mov r3, #1
> strdr3, r3, [r1, #40]
>
> With your patch I don't see any chance of being able to get to this
> situation.
>
> (BTW, we currently generate:
>
> mov r3, #1
> add r1, r1, #10
> add r2, r0, r1, asl #2
> str r3, [r0, r1, asl #2]
> str r3, [r2, #4]
>
> which is insane).
>
> I think I see where you're coming from on the original testcase, but I
> think you're trying to solve the wrong problem.   In your test case the
> base is an eliminable register, which is likely to be replaced with an
> offset expression during register allocation.  The problem then, I
> think, is that the cost of these virtual registers is treated the same
> as any other pseudo register, when it may really have the cost of a PLUS
> expression.
>
> Perhaps the cost of using an eliminable register should be raised in
> rtx_costs() (treating them as equivalent to (PLUS (reg) (CONST_INT
> (TBD))), so that loop optimizations will try to hoist suitable
> sub-expressions out the loop and replace them with real pseudos.
>
I now have access to the code.
The gimple before expanding is like:
  :
  # j_26 = PHI 
  # k_29 = PHI 
  j_9 = j_26 + 1;
  k_8 = parent[k_29];
  if (k_8 >= 0)
goto ;
  else
goto ;

The rtl generated after expanding is like:
   88: NOTE_INSN_BASIC_BLOCK 7
   89: r174:SI=r174:SI+0x1
   90: r191:SI=r173:SI<<0x2
   91: r190:SI=r105:SI+r191:SI
   92: r173:SI=[r190:SI-0x810]

with r105 == virtual_stack_vars_rtx, and it will be instantiated into
frame_pointer_rtx in vregs pass:
   88: NOTE_INSN_BASIC_BLOCK 7
   89: r174:SI=r174:SI+0x1
   90: r191:SI=r173:SI<<0x2
   91: r190:SI=sfp:SI+r191:SI
   92: r173:SI=[r190:SI-0x810]

As you pointed out, sfp is not hoisted as a high cost invariant.  I am
not sure if loop-invariant will hoist a single pseudo reg even it's
assigned with a higher cost.
But before the invariant problem, the choice made by RTL expand is bad
because it hides the CSE opportunity, because (sfp + r173<<2 - 0x810)
== (sp + r173<<2), (sfp-0x810) can be folded into (sp), then we can
embed the shift instruction in scaled addressing mode [sp + r173 <<
2].

Thanks,
bin

-- 
Best Regards.


Re: Make C11 _Alignof return least not greatest alignment for a type (PR c/52023)

2013-12-05 Thread Bin.Cheng
On Thu, Dec 5, 2013 at 6:57 AM, Joseph S. Myers  wrote:

> Index: c-family/c-common.c
> ===
> --- c-family/c-common.c (revision 205668)
> +++ c-family/c-common.c (working copy)
> @@ -4921,14 +4921,17 @@ c_common_get_alias_set (tree t)
>  }
>
>  /* Compute the value of 'sizeof (TYPE)' or '__alignof__ (TYPE)', where
> -   the second parameter indicates which OPERATOR is being applied.
> +   the IS_SIZEOF parameter indicates which operator is being applied.
> The COMPLAIN flag controls whether we should diagnose possibly
> ill-formed constructs or not.  LOC is the location of the SIZEOF or
> -   TYPEOF operator.  */
> +   TYPEOF operator.  If MIN_ALIGNOF, the least alignment required for
> +   a type in any context should be returned, rather than the normal
> +   alignment for that type.  */
>
>  tree
>  c_sizeof_or_alignof_type (location_t loc,
> - tree type, bool is_sizeof, int complain)
> + tree type, bool is_sizeof, bool min_alignof,
> + int complain)
>  {
>const char *op_name;
>tree value = NULL;
> @@ -4994,6 +4997,22 @@ c_sizeof_or_alignof_type (location_t loc,
> value = size_binop_loc (loc, CEIL_DIV_EXPR, TYPE_SIZE_UNIT (type),
> size_int (TYPE_PRECISION (char_type_node)
>   / BITS_PER_UNIT));
> +  else if (min_alignof)
> +   {
> + unsigned int align = TYPE_ALIGN (type);
> + align = MIN (align, BIGGEST_ALIGNMENT);
> +#ifdef BIGGEST_FIELD_ALIGNMENT
> + align = MIN (align, BIGGEST_FIELD_ALIGNMENT);
> +#endif
> + tree field = build_decl (UNKNOWN_LOCATION, FIELD_DECL, NULL_TREE,
> +  type);
> + unsigned int field_align = align;
> +#ifdef ADJUST_FIELD_ALIGN
> + field_align = ADJUST_FIELD_ALIGN (field, field_align);
> +#endif

Won't *field* be unused if ADJUST_FIELD_ALIGN not defined?

Thanks,
bin


-- 
Best Regards.


Re: [Ping]Two pending IVOPT patches

2013-12-06 Thread Bin.Cheng
On Fri, Dec 6, 2013 at 2:10 PM, Jeff Law  wrote:
> On 11/26/13 03:52, Bin.Cheng wrote:
>>
>> On Tue, Nov 26, 2013 at 6:06 AM, Jeff Law  wrote:
>>>
>>> On 11/25/13 02:11, Bin.Cheng wrote:
>>>>
>>>>
>>>>
>>>> Slightly tune to make iv cand choosing algorithm more accurate:
>>>> http://gcc.gnu.org/ml/gcc-patches/2013-11/msg01574.html
>>>
>>>
>>> It would help if you had some sample codes where this patch was useful.
>>> I
>>> can kind-of see what's going on, but I'm way too unfamiliar with the
>>> tree-IV
>>> code to review this change.
>>>
>> Jeff, Thanks to your review.
>> As for example, consider below code on ARM/cortex-m4, (the code itself
>> may be non-sense):
>
> [ ... ]
> So in this testcase, it looks like the effect is to eliminate the IVs for
> the loop counters, instead expressing the loop termination in terms of the
> pointers.
Yes, it's linear function test elimination.

>
> As far as I can tell this is entirely due to the changes to iv_ca_narrow.
Exactly.

>
> Do you have any codes where iv_ca_extend helps?  I can see how that hunk
> appears to be safe, and I'm guessing that setting the cost pair at each step
> could potentially give more accurate costing on the next iteration of the
> loop.   But I'd love to be able to see this effect directly rather than just
> assuming it's helpful.  Given that I'm prepared to approve the iv_ca_extend
> hunk.
Very sorry I can't provide an example about this now.  I remember it's
a case in eembc I encountered, but with current trunk I can't
reproduce it with the change about iv_ca_extend.  Maybe recent
checking has changed the behavior of IVOPT.  Considering there is no
case about this change, I am fine to discard this part of patch and
continue with iv_ca_narrow part.

>
> I realize the changes to iv_ca_extend are of lesser importance, but I'm
> still working my way through the tree IV code to get some basic
> understanding of what it's doing and why.  I hope to be able to look at
> iv_ca_narrow in more detail over the weekend.
Thanks very much.

Thanks,
bin

>
> jeff
>
>



-- 
Best Regards.


[PATCH PR41488]Recognize more induction variables by simplifying PEELED chrec in scalar evolution

2013-12-06 Thread bin.cheng
Hi,
Entry pr41488 shows a case in which induction variable can't be recognized
or coalesced.  As noted in the bug entry, one possible fix is to teach PRE
not to do some specific transformation.  However, it's in nature a scalar
evolution issue.  Considering below snippet:

   # i_17 = PHI 
   # _20 = PHI <_5(5), start_4(D)(3)>
   ...
   i_13 = i_17 + 1;
   _5 = start_4(D) + i_13;

Variable _20 appears in the form of PEELED chrec like (start_4, _5)_LOOP,
meaning it has value "start_4" in the 1st iteration, then changes to _5 in
following iterations.  PEELED chrec is not implemented by GCC right now, it
can be simplified into either POLYNOMIAL or PERIOD one.  The POLYNOMIAL
chrec is processed by GCC after being recognized, as in the examle, _20 is
actually {start_4, 1}_LOOP.

This patch modifies scalar evolution by trying to simplify PEELED chrec.
Without this patch, generated code for pr41488.c is like:
foo:
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
cmp r1, r2
bge .L7
push{r4}
mov r3, r1
ldr r4, [r0]
addsr2, r2, #1
addsr1, r1, #1
movsr0, #0
.L3:
str r0, [r4, r3, lsl #2]
mov r3, r1
addsr1, r1, #1
cmp r1, r2
bne .L3
ldr r4, [sp], #4
.L7:
bx  lr
.size   foo, .-foo

Which is improved to:
foo:
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
cmp r1, r2
bge .L1
ldr r3, [r0]
movsr0, #0
add r1, r3, r1, lsl #2
add r2, r3, r2, lsl #2
.L3:
str r0, [r1], #4
cmp r1, r2
bne .L3
.L1:
bx  lr
.size   foo, .-foo

One point needs to be clarified that I use tree_to_aff_combination_expand in
the patch.  Rational is the number of the specific PEELED_CHRECs should be
moderate, I also check the equality literally firstly and resort to affine
facility if that fails.  By measuring bootstrap of gcc and spec2kint, I
collected the number of opportunities caught by this patch like below:
literal comparison/affine comparison
GCC  ~1200/~250
Spec2Kint  93/34

I could back trace the ssa chain instead of using affine functions, but that
would miss some cases.

Bootstrap and test on x86/x86_64/arm.  Is it OK?

Thanks,
bin

2013-12-06  Bin Cheng  

PR tree-optimization/41488
* tree-ssa-loop-ivopts.c (add_old_iv_candidates): Don't add cand
for PEELED_CHREC kind of IV.
* tree-scalar-evolution.c: Include necessary header files.
(peeled_chrec_map, simplify_peeled_chrec): New.
(analyze_evolution_in_loop): New static variable.
Call simplify_peeled_chrec.
(scev_initialize): Initialize peeled_chrec_map.
(scev_reset, scev_finalize): Reset and release peeled_chrec_map.

gcc/testsuite/ChangeLog
2013-12-06  Bin Cheng  

* gcc.dg/tree-ssa/scev-7.c: New test.
* gcc.dg/pr41488.c: New test.Index: gcc/tree-scalar-evolution.c
===
--- gcc/tree-scalar-evolution.c (revision 205688)
+++ gcc/tree-scalar-evolution.c (working copy)
@@ -280,6 +280,8 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-ssa.h"
 #include "cfgloop.h"
 #include "tree-chrec.h"
+#include "pointer-set.h"
+#include "tree-affine.h"
 #include "tree-scalar-evolution.h"
 #include "dumpfile.h"
 #include "params.h"
@@ -1380,7 +1382,66 @@ follow_ssa_edge (struct loop *loop, gimple def, gi
 }
 
 
+/* Pointer map used when simplifying PEELED_CHREC into POLYNOMIAL_CHREC.  */
+static pointer_map_t *peeled_chrec_map;
 
+/* Simplify PEELED_CHREC represented by (init_cond, arg) in LOOP.
+   Handle below case and return the corresponding POLYNOMIAL_CHREC:
+
+   # i_17 = PHI 
+   # _20 = PHI <_5(5), start_4(D)(3)>
+   ...
+   i_13 = i_17 + 1;
+   _5 = start_4(D) + i_13;
+
+   Though variable _20 appears as a PEELED_CHREC in the form of
+   (start_4, _5)_LOOP, it's a POLYNOMIAL_CHREC like {start_4, 1}_LOOP.
+
+   See PR41488.  */
+
+static tree
+simplify_peeled_chrec (struct loop *loop, tree arg, tree init_cond)
+{
+  aff_tree aff1, aff2;
+  tree ev, left, right, type, step_val;
+
+  ev = instantiate_parameters (loop, analyze_scalar_evolution (loop, arg));
+  if (ev == NULL_TREE || TREE_CODE (ev) != POLYNOMIAL_CHREC)
+return chrec_dont_know;
+
+  left = CHREC_LEFT (ev);
+  right = CHREC_RIGHT (ev);
+  type = TREE_TYPE (left);
+  step_val = chrec_fold_plus (type, init_cond, right);
+
+  /* Transform (init, {left, right}_LOOP)_LOOP to {init, right}_LOOP
+ if "left" equals to "init + right".  */
+  if (operand_equal_p (left, step_val, 0))
+{
+  if (dump_file && (dump_flags & TDF_S

Re: [PATCH GCC]Pick up more address lowering cases for ivopt and tree-affine.c

2013-12-06 Thread Bin.Cheng
On Fri, Dec 6, 2013 at 6:19 PM, Richard Biener
 wrote:
> On Mon, Nov 25, 2013 at 7:41 PM, Jeff Law  wrote:
>> On 11/25/13 02:22, bin.cheng wrote:
>>
>> Unless there's a PR for this problem, I think this needs to wait.
>
> I agree.  Btw, please split the patch.
Yes, I will get back to this after entering stage 1 again :)

Hi Richard,
I talked with you about clean strip_offset_1 up after this series of
base simplification patches, but I realized it's not safe because
affine facility has it's limit, like can only support 8 elements.
Though the cleanup passes bootstrap and test on x86/x86_64 and most of
codes in strip_offset_1 won't be executed usually, I guess we'd better
to live with it, so what do you think?

Thanks,
bin

>
> Index: gcc/tree-affine.c
> ===
> --- gcc/tree-affine.c(revision 205087)
> +++ gcc/tree-affine.c(working copy)
> @@ -328,7 +328,19 @@ tree_to_aff_combination (tree expr, tree type, aff
>   double_int::from_uhwi (bitpos / BITS_PER_UNIT));
>core = build_fold_addr_expr (core);
>if (TREE_CODE (core) == ADDR_EXPR)
> -aff_combination_add_elt (comb, core, double_int_one);
> +{
> +  /* Handle &MEM[ptr + CST] in core part of complex reference.  */
> +  if (TREE_CODE (TREE_OPERAND (core, 0)) == MEM_REF)
> +{
> +  core = TREE_OPERAND (core, 0);
> +  tree_to_aff_combination (TREE_OPERAND (core, 0), type, &tmp);
> +  aff_combination_add (comb, &tmp);
> +  tree_to_aff_combination (TREE_OPERAND (core, 1), sizetype, &tmp);
> +  aff_combination_add (comb, &tmp);
> +}
> +  else
> +aff_combination_add_elt (comb, core, double_int_one);
> +}
>else
>  {
>tree_to_aff_combination (core, type, &tmp)
>
> please handle the offset before taking the address, thus
>
>   if (TREE_CODE (core) == MEM_REF)
> {
>add constant offset;
>core = TREE_OPERAND (core, 0);
> }
>   else
> core = build_fold_addr_expr (core);
>
> that simplifies things and avoids the address building.
>
> Richard.
>
>> jeff
>>
>>



-- 
Best Regards.


Re: [PATCH GCC]Pick up more address lowering cases for ivopt and tree-affine.c

2013-12-06 Thread Bin.Cheng
On Fri, Dec 6, 2013 at 7:20 PM, Richard Biener
 wrote:
> On Fri, Dec 6, 2013 at 11:40 AM, Bin.Cheng  wrote:
>> On Fri, Dec 6, 2013 at 6:19 PM, Richard Biener
>>  wrote:
>>> On Mon, Nov 25, 2013 at 7:41 PM, Jeff Law  wrote:
>>>> On 11/25/13 02:22, bin.cheng wrote:
>>>>
>>>> Unless there's a PR for this problem, I think this needs to wait.
>>>
>>> I agree.  Btw, please split the patch.
>> Yes, I will get back to this after entering stage 1 again :)
>>
>> Hi Richard,
>> I talked with you about clean strip_offset_1 up after this series of
>> base simplification patches, but I realized it's not safe because
>> affine facility has it's limit, like can only support 8 elements.
>> Though the cleanup passes bootstrap and test on x86/x86_64 and most of
>> codes in strip_offset_1 won't be executed usually, I guess we'd better
>> to live with it, so what do you think?
>
> Not sure - I'm lacking some context here ;)  If you have a cleanup patch
> fine - WRT the affine limit of 8 elements, further elements will just
> add to the rest tree.  This is to limit compile-time.
Yes, so it's possible to have COMPONENT_REF stored in rest tree if we
reach the 8 elements limit.  In this case the COMPONENT_REF will be
visited by the code in strip_offset_1, although it sounds unlikely to
happen.

Thanks,
bin
>
> Richard.


Re: [PATCH PR41488]Recognize more induction variables by simplifying PEELED chrec in scalar evolution

2013-12-06 Thread Bin.Cheng
On Sat, Dec 7, 2013 at 3:20 AM, Jeff Law  wrote:
> On 12/06/13 03:29, bin.cheng wrote:
>>
>> Hi,
>> Entry pr41488 shows a case in which induction variable can't be recognized
>> or coalesced.  As noted in the bug entry, one possible fix is to teach PRE
>> not to do some specific transformation.  However, it's in nature a scalar
>> evolution issue.  Considering below snippet:
>>
>> # i_17 = PHI 
>> # _20 = PHI <_5(5), start_4(D)(3)>
>> ...
>> i_13 = i_17 + 1;
>> _5 = start_4(D) + i_13;
>>
>> Variable _20 appears in the form of PEELED chrec like (start_4, _5)_LOOP,
>> meaning it has value "start_4" in the 1st iteration, then changes to _5 in
>> following iterations.  PEELED chrec is not implemented by GCC right now,
>> it
>> can be simplified into either POLYNOMIAL or PERIOD one.
>
> Right.  PEELED_CHREC was removed from the LNO branch back in 2004,
> presumably before the LNO branch was merged into the mainline.  No reason
> was given.

Maybe we don't have any user for such CHRECs in GCC now, but is's just guessing.

>
> But what you're really discovering here is that _20 is just another standard
> looking IV that appears in the form of a peeled chrec, right?

Exactly, IVOPT handles only polynomial ones, and _20 is one such chrec
only appearing in peeled form.

>
>
>
>  The POLYNOMIAL
>>
>> chrec is processed by GCC after being recognized, as in the examle, _20 is
>> actually {start_4, 1}_LOOP.
>
> Right.  Based on reading the archives, it looks like this stuff is/was
> generated by PRE.  I also suspect jump threading can create them.  There was
> talk of throttling PRE to leave things in a form that the IV analysis could
> more easily digest, but I'm not sure if that was ever completed.

Also could just because it is coded in that way, just as the case I
added in patch.  I found real examples from ggc-page.c in GCC.
But it's always good to cleanup input of an optimization, I presume
that's why Richard tried to move IVOPT later before.

>
> [ snip ]
>
>
>>
>> One point needs to be clarified that I use tree_to_aff_combination_expand
>> in
>> the patch.  Rational is the number of the specific PEELED_CHRECs should be
>> moderate, I also check the equality literally firstly and resort to affine
>> facility if that fails.  By measuring bootstrap of gcc and spec2kint, I
>> collected the number of opportunities caught by this patch like below:
>>  literal comparison/affine comparison
>> GCC  ~1200/~250
>> Spec2Kint  93/34
>>
>> I could back trace the ssa chain instead of using affine functions, but
>> that
>> would miss some cases.
>
> I assume tree_to_aff_combination_expand is relatively expensive, thus the
> two approaches, one which avoids tree_to_aff_combination_expand.
Considering the dump of case in the patch:

  :
  _16 = start_4(D) + 1000;
  if (end_6(D) < _16)
goto ;
  else
goto ;

  :
  pretmp_22 = sp_7(D)->data;

  :
  # i_17 = PHI 
  # _20 = PHI <_5(5), _16(3)>
  _9 = (unsigned int) _20;
  _10 = _9 * 4;
  _11 = pretmp_22 + _10;
  *_11 = 0;
  i_13 = i_17 + -1;
  _5 = start_4(D) + i_13;
  if (_5 > end_6(D))
goto ;
  else
goto ;

  :
  goto ;

I have to prove (_16 + -1) equals to (start_4 + 999) for _20, so
either using affine function to back trace the definition of _16, or I
have to back trace ssa_chain manually.  Here I use affine function
because the number of such cases should be moderate and there are more
complicated case in which simple back trace will lose.

Another question, is it acceptable to add an parameter for
tree_to_aff_combination_expand to limit the number of recursive call
for it?  Thus we don't need to expand to leaf node every time.

>
>
> In add_old_iv_candidates, is it the case that the only time
> SSA_NAME_DEF_STMT (def) is another PHI node is when it's one of these affine

I suppose.  Actually IVOPT make an assert on IP_ORIGINAL candidates in
function rewrite_use_nonlinear_expr, like:

  /* An important special case -- if we are asked to express value of
 the original iv by itself, just exit; there is no need to
 introduce a new computation (that might also need casting the
 variable to unsigned and back).  */
  if (cand->pos == IP_ORIGINAL
  && cand->incremented_at == use->stmt)
{
  enum tree_code stmt_code;

  gcc_assert (is_gimple_assign (use->stmt));
  ..

The only case I can think about involving other kind of phi node is
for merging phi in conditional code like:

LOOP:
x_1 = phi 

if (cond)
   x_3 = x_1 + 1;
else
   x_4

Re: [PATCH PR41488]Recognize more induction variables by simplifying PEELED chrec in scalar evolution

2013-12-08 Thread Bin.Cheng
On Mon, Dec 9, 2013 at 12:00 PM, Jeff Law  wrote:
> On 12/06/13 19:44, Bin.Cheng wrote:
>>>
>>> Right.  Based on reading the archives, it looks like this stuff is/was
>>> generated by PRE.  I also suspect jump threading can create them.  There
>>> was
>>> talk of throttling PRE to leave things in a form that the IV analysis
>>> could
>>> more easily digest, but I'm not sure if that was ever completed.
>>
>>
>> Also could just because it is coded in that way, just as the case I
>> added in patch.  I found real examples from ggc-page.c in GCC.
>> But it's always good to cleanup input of an optimization, I presume
>> that's why Richard tried to move IVOPT later before.
>
> Certainly.  That possibility was mentioned as well in either 41488 or
> elsewhere in my research.
>
>
>>>
>>> I assume tree_to_aff_combination_expand is relatively expensive, thus the
>>> two approaches, one which avoids tree_to_aff_combination_expand.
>>
>> Considering the dump of case in the patch:
>
> [ snip ]
> So it's also a case using the affine stuff gets you get a simpler interface
> to express the value in terms you may be able match.
Yeah.

>
>
>>
>> Another question, is it acceptable to add an parameter for
>> tree_to_aff_combination_expand to limit the number of recursive call
>> for it?  Thus we don't need to expand to leaf node every time.
>
> If there's a compelling reason to limit the recursion, then I'd think such a
> parameter would be reasonable.
Not for now.  Just come up this idea given that some recent patches
also use the interface to get back trace information and it's
expensive.  Maybe Richard have some comment here too.

>
>
>
>>
>>>
>>>
>>> In add_old_iv_candidates, is it the case that the only time
>>> SSA_NAME_DEF_STMT (def) is another PHI node is when it's one of these
>>> affine
>>
>>
>> I suppose.  Actually IVOPT make an assert on IP_ORIGINAL candidates in
>> function rewrite_use_nonlinear_expr, like:
>
> Well, the reason I ask is after your patch we'd be ignoring anything where
> SSA_NAME_DEF_STMT (def) is a PHI node and not a PEELED_CHREC.  I want to
> make sure there aren't other expected cases where SSA_NAME_DEF_STMT (def) is
> a PHI that we'd want to call add_candidate_1 for.
>
> It sounds like there aren't other cases you'd expect to see here where
> SSA_NAME_DEF_STMT (defFFF) is a PHI.
Yes.

>
>
>> IVOPT adds original cand and tries to keep it the original IV is
>> because it does have an updating statement.  For IVs picked up by this
>> patch, it doesn't have stepping statement, so no need/way to leaving
>> it untouched.  It will be represented by candidates for IVs from which
>> it is peeled.
>
> Understood.  Thanks for clarifying.  The patch is fine for the trunk.
Thanks very much for reviewing.  Since Sebastian hasn't added any
comments, I will change the patch as suggested by Richard before, then
retest it, and check in the new version if it's ok.

Thanks,
bin

>
> jeff



-- 
Best Regards.


RE: [PATCH PR41488]Recognize more induction variables by simplifying PEELED chrec in scalar evolution

2013-12-09 Thread bin.cheng


> -Original Message-
> From: Bin.Cheng [mailto:amker.ch...@gmail.com]
> Sent: Monday, December 09, 2013 1:15 PM
> To: Jeff Law
> Cc: Bin Cheng; gcc-patches List
> Subject: Re: [PATCH PR41488]Recognize more induction variables by
> simplifying PEELED chrec in scalar evolution
> 
> On Mon, Dec 9, 2013 at 12:00 PM, Jeff Law  wrote:
> > On 12/06/13 19:44, Bin.Cheng wrote:
> >>>
> >>> Right.  Based on reading the archives, it looks like this stuff
> >>> is/was generated by PRE.  I also suspect jump threading can create
> >>> them.  There was talk of throttling PRE to leave things in a form
> >>> that the IV analysis could more easily digest, but I'm not sure if
> >>> that was ever completed.
> >>
> >>
> >> Also could just because it is coded in that way, just as the case I
> >> added in patch.  I found real examples from ggc-page.c in GCC.
> >> But it's always good to cleanup input of an optimization, I presume
> >> that's why Richard tried to move IVOPT later before.
> >
> > Certainly.  That possibility was mentioned as well in either 41488 or
> > elsewhere in my research.
> >
> >
> >>>
> >>> I assume tree_to_aff_combination_expand is relatively expensive,
> >>> thus the two approaches, one which avoids
> tree_to_aff_combination_expand.
> >>
> >> Considering the dump of case in the patch:
> >
> > [ snip ]
> > So it's also a case using the affine stuff gets you get a simpler
> > interface to express the value in terms you may be able match.
> Yeah.
> 
> >
> >
> >>
> >> Another question, is it acceptable to add an parameter for
> >> tree_to_aff_combination_expand to limit the number of recursive call
> >> for it?  Thus we don't need to expand to leaf node every time.
> >
> > If there's a compelling reason to limit the recursion, then I'd think
> > such a parameter would be reasonable.
> Not for now.  Just come up this idea given that some recent patches also
use
> the interface to get back trace information and it's expensive.  Maybe
> Richard have some comment here too.
> 
> >
> >
> >
> >>
> >>>
> >>>
> >>> In add_old_iv_candidates, is it the case that the only time
> >>> SSA_NAME_DEF_STMT (def) is another PHI node is when it's one of
> >>> these affine
> >>
> >>
> >> I suppose.  Actually IVOPT make an assert on IP_ORIGINAL candidates
> >> in function rewrite_use_nonlinear_expr, like:
> >
> > Well, the reason I ask is after your patch we'd be ignoring anything
> > where SSA_NAME_DEF_STMT (def) is a PHI node and not a PEELED_CHREC.
> I
> > want to make sure there aren't other expected cases where
> > SSA_NAME_DEF_STMT (def) is a PHI that we'd want to call
> add_candidate_1 for.
> >
> > It sounds like there aren't other cases you'd expect to see here where
> > SSA_NAME_DEF_STMT (defFFF) is a PHI.
> Yes.
> 
> >
> >
> >> IVOPT adds original cand and tries to keep it the original IV is
> >> because it does have an updating statement.  For IVs picked up by
> >> this patch, it doesn't have stepping statement, so no need/way to
> >> leaving it untouched.  It will be represented by candidates for IVs
> >> from which it is peeled.
> >
> > Understood.  Thanks for clarifying.  The patch is fine for the trunk.
> Thanks very much for reviewing.  Since Sebastian hasn't added any
> comments, I will change the patch as suggested by Richard before, then
> retest it, and check in the new version if it's ok.
> 
Hi,
This is the new version patch computing the difference in tree affine then
comparing against integer_zero_node.
Bootstrap and test on current upstream.  I will commit it if there is no
other objection.

Thanks,
bin
> >
> > jeff
> 
> 
> 
> --
> Best Regards.
Index: gcc/tree-scalar-evolution.c
===
--- gcc/tree-scalar-evolution.c (revision 205688)
+++ gcc/tree-scalar-evolution.c (working copy)
@@ -280,6 +280,8 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-ssa.h"
 #include "cfgloop.h"
 #include "tree-chrec.h"
+#include "pointer-set.h"
+#include "tree-affine.h"
 #include "tree-scalar-evolution.h"
 #include "dumpfile.h"
 #include "params.h"
@@ -1380,7 +1382,67 @@ follow_ssa_edge (struct loop *loop, gimple def, gi
 }

RE: [Ping]Two pending IVOPT patches

2013-12-09 Thread bin.cheng


> -Original Message-
> From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-
> ow...@gcc.gnu.org] On Behalf Of Jeff Law
> Sent: Tuesday, December 10, 2013 6:31 AM
> To: Bin.Cheng
> Cc: gcc-patches List; Richard Biener; Zdenek Dvorak
> Subject: Re: [Ping]Two pending IVOPT patches
> 
> On 11/26/13 03:52, Bin.Cheng wrote:
> > On Tue, Nov 26, 2013 at 6:06 AM, Jeff Law  wrote:
> >> On 11/25/13 02:11, Bin.Cheng wrote:
> >>>
> >>>
> >>> Slightly tune to make iv cand choosing algorithm more accurate:
> >>> http://gcc.gnu.org/ml/gcc-patches/2013-11/msg01574.html
> >>
> >> It would help if you had some sample codes where this patch was
> >> useful.  I can kind-of see what's going on, but I'm way too
> >> unfamiliar with the tree-IV code to review this change.
> >>
> > Jeff, Thanks to your review.
> > As for example, consider below code on ARM/cortex-m4, (the code itself
> > may be non-sense):
> [ snip ]
> 
> So in iv_ca_narrow you have the following change:
> @@ -5705,13 +5709,15 @@ iv_ca_narrow (struct ivopts_data *data, struct
> iv_
> if (!cp)
>   continue;
> 
> -   if (!iv_ca_has_deps (ivs, cp))
> -continue;
> +   iv_ca_set_cp (data, ivs, use, cp);
> +   acost = iv_ca_cost (ivs);
> +   iv_ca_set_cp (data, ivs, use, old_cp);
> 
> -   if (!cheaper_cost_pair (cp, new_cp))
> - continue;
> -
> -   new_cp = cp;
> +   if (compare_costs (acost, best_cost) < 0)
> + {
> +   best_cost = acost;
> +   new_cp = cp;
> + }
>   }
>   }
> else
> 
> You're set a new cost pair (cp), compute the cost into acost, the restore
the
> old cost pair (oldcp).  Then you see if you got a cheaper cost and if so,
record
> the best cost and set new_cp.
> 
> What's the point behind restoring the old cost pair within the loop?
> Can't you just restore it outside the loop if nothing cheaper was found?

Yes, I updated the patch as attached.  I also skipped the computation for
START in the inner most loop.  Actually there is another chance to reduce
computation by merging the restore of old cost pair and the first call of
iv_ca_delta_commit if we don't skip computation for START, but that should
be done in another patch.

> 
> AFAICT, the calls to iv_ca_has_deps aren't there for correctness, they
were
> there to prune out IVs that were not useful in this loop, correct?
Emm, some kind of.  See the cost of iv candidate set consists of several
parts, the representation cost in cost pair; the register pressure cost
falls in dependence on invariant expressions, etc..  Here iv_ca_has_deps
checks whether new cost pair depends on other invariant expression which is
not involved before, if so, current algorithm considers the new cost pair
has higher cost and just skips.  In fact, the new cost pair may give us
lower overall cost even it introduces new dependence(similar to the case I
gave).  That's why I used the overall cost comparison for good.

Is this new version patch looks good to you? I will re-test if it's ok.

Thanks,
bin
> 
> Jeff
Index: gcc/tree-ssa-loop-ivopts.c
===
--- gcc/tree-ssa-loop-ivopts.c  (revision 205798)
+++ gcc/tree-ssa-loop-ivopts.c  (working copy)
@@ -5718,18 +5718,20 @@ iv_ca_extend (struct ivopts_data *data, struct iv_
 }
 
 /* Try narrowing set IVS by removing CAND.  Return the cost of
-   the new set and store the differences in DELTA.  */
+   the new set and store the differences in DELTA.  START is
+   the candidate with which we start narrowing.  */
 
 static comp_cost
 iv_ca_narrow (struct ivopts_data *data, struct iv_ca *ivs,
- struct iv_cand *cand, struct iv_ca_delta **delta)
+ struct iv_cand *cand, struct iv_cand *start,
+ struct iv_ca_delta **delta)
 {
   unsigned i, ci;
   struct iv_use *use;
   struct cost_pair *old_cp, *new_cp, *cp;
   bitmap_iterator bi;
   struct iv_cand *cnd;
-  comp_cost cost;
+  comp_cost cost, best_cost, acost;
 
   *delta = NULL;
   for (i = 0; i < n_iv_uses (data); i++)
@@ -5740,13 +5742,15 @@ iv_ca_narrow (struct ivopts_data *data, struct iv_
   if (old_cp->cand != cand)
continue;
 
-  new_cp = NULL;
+  best_cost = iv_ca_cost (ivs);
+  /* Start narrowing with START.  */
+  new_cp = get_use_iv_cost (data, use, start);
 
   if (data->consider_all_candidates)
{
  EXECUTE_IF_SET_IN_BITMAP (ivs->cands, 0, ci, bi)
{
- if (ci == cand->id)
+ if (ci == cand->id || (start && ci == start->id))

Re: [PATCH PR41488]Recognize more induction variables by simplifying PEELED chrec in scalar evolution

2013-12-10 Thread Bin.Cheng
On Wed, Dec 11, 2013 at 6:50 AM, Jakub Jelinek  wrote:
> On Tue, Dec 10, 2013 at 10:46:32AM +0800, bin.cheng wrote:
>> This is the new version patch computing the difference in tree affine then
>> comparing against integer_zero_node.
>> Bootstrap and test on current upstream.  I will commit it if there is no
>> other objection.
>
> This breaks bootstrap on x86_64-linux for me too, though in a different
> pass.
> Anyway, the bugs I'm seeing here:
> 1) other passes that use a cache for tree-affine computations
>don't initialize it by pointer_map_create (), tree-affine.c does that
>for you
> 2) the cache shouldn't be destroyed by pointer_map_destroy, but
>instead by free_affine_expand_cache that will actually make sure
>it doesn't leak memory, etc.
> 3) the actual issue why this breaks bootstrap is that you've added
>the pointer_map_destroy (should be free_affine_expand_cache per 2) )
>to scev_finalize, but scev_initialize is performed at the start
>of the loop passes and scev_finalize several passes later, and
>in between passes ggc_collect is called.  As the cache isn't registered
>with GC, if you are unlucky and ggc_collect actually collects
>in between scev_initialize and scev_finalize and garbage collects
>some trees only mentioned in the affine cache peeled_chrec_map,
>things can break completely.
>
> So, IMHO please revert this patch for now and try to find some
> better place for the free_affine_expand_cache (&peeled_chrec_map);
> call so that the cache is never non-NULL when ggc_collect may be called.
> Alternatively you'd need to either tell GC about it (but the structures
> are heap allocated, not GC), or say instruct through some GTY magic
> that during collection free_affine_expand_cache (&peeled_chrec_map);
> should be called.

Sorry for bothering, I have reverted the patch.  Will investigate it.

Thanks,
bin
>
> Jakub



-- 
Best Regards.


Re: [PATCH PR41488]Recognize more induction variables by simplifying PEELED chrec in scalar evolution

2013-12-10 Thread Bin.Cheng
On Wed, Dec 11, 2013 at 4:31 AM, H.J. Lu  wrote:
> On Mon, Dec 9, 2013 at 6:46 PM, bin.cheng  wrote:
>>
>>>
>> Hi,
>> This is the new version patch computing the difference in tree affine then
>> comparing against integer_zero_node.
>> Bootstrap and test on current upstream.  I will commit it if there is no
>> other objection.
>>
>
> This caused:
>
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59445
I found out the cause and addressed it in the bug entry.

Thanks,
bin

>
> --
> H.J.



-- 
Best Regards.


Re: [PATCH PR41488]Recognize more induction variables by simplifying PEELED chrec in scalar evolution

2013-12-10 Thread Bin.Cheng
On Wed, Dec 11, 2013 at 6:50 AM, Jakub Jelinek  wrote:
> On Tue, Dec 10, 2013 at 10:46:32AM +0800, bin.cheng wrote:
>> This is the new version patch computing the difference in tree affine then
>> comparing against integer_zero_node.
>> Bootstrap and test on current upstream.  I will commit it if there is no
>> other objection.
>
> This breaks bootstrap on x86_64-linux for me too, though in a different
> pass.
> Anyway, the bugs I'm seeing here:
> 1) other passes that use a cache for tree-affine computations
>don't initialize it by pointer_map_create (), tree-affine.c does that
>for you
> 2) the cache shouldn't be destroyed by pointer_map_destroy, but
>instead by free_affine_expand_cache that will actually make sure
>it doesn't leak memory, etc.
> 3) the actual issue why this breaks bootstrap is that you've added
>the pointer_map_destroy (should be free_affine_expand_cache per 2) )
>to scev_finalize, but scev_initialize is performed at the start
>of the loop passes and scev_finalize several passes later, and
>in between passes ggc_collect is called.  As the cache isn't registered
>with GC, if you are unlucky and ggc_collect actually collects
>in between scev_initialize and scev_finalize and garbage collects
>some trees only mentioned in the affine cache peeled_chrec_map,
>things can break completely.
>
> So, IMHO please revert this patch for now and try to find some
> better place for the free_affine_expand_cache (&peeled_chrec_map);
> call so that the cache is never non-NULL when ggc_collect may be called.
> Alternatively you'd need to either tell GC about it (but the structures
> are heap allocated, not GC), or say instruct through some GTY magic
> that during collection free_affine_expand_cache (&peeled_chrec_map);
> should be called.

Thanks for helping on the issue, I shouldn't use the affine facility
before fully understanding it.

I know little about GC, so when ggc_collect may be called (between two
passes)? If so, I have to call free_affine_expand_cache just after the
calling to tree_to_affine_combination_expand in SCEV because it's an
analyzer and as you pointed out, the analyzing result is used by
different optimizers.  The pointer map would be maintained between
different passes if it's not instantly freed.

Thanks,
bin
>
> Jakub



-- 
Best Regards.


Re: [PATCH PR41488]Recognize more induction variables by simplifying PEELED chrec in scalar evolution

2013-12-11 Thread Bin.Cheng
On Wed, Dec 11, 2013 at 4:17 PM, Jakub Jelinek  wrote:
> On Tue, Dec 10, 2013 at 11:59:16PM -0700, Jeff Law wrote:
>> On 12/10/13 23:35, Bin.Cheng wrote:
>> >I know little about GC, so when ggc_collect may be called (between two
>> >passes)? If so, I have to call free_affine_expand_cache just after the
>> >calling to tree_to_affine_combination_expand in SCEV because it's an
>> >analyzer and as you pointed out, the analyzing result is used by
>> >different optimizers.  The pointer map would be maintained between
>> >different passes if it's not instantly freed.
>> The garbage collector only runs between passes.  If you have roots
>> in static storage, then  you'll need to decorate them so the garbage
>> collector scans them.
>>
>> There's a fairly extensive discussion of the garbage collector in
>> the GCC internals manual.
>
> It isn't that easy, pointer_map isn't a data structure recognized by the
> garbage collector at all (plus, as I've said before, the code is inserting
> XNEW allocated structures into the pointer_map, which isn't GC friendly
> either, but making them GC allocated might increase garbage unnecessarily,
> as all the structs are short lived).
>
> I've looked through scev_{initialize,finalize} callers, and almost all
> passes initialize it and finalize within the same pass, it is only the loop
> optimizations:
>   NEXT_PASS (pass_tree_loop_init);
>   NEXT_PASS (pass_lim);
>   NEXT_PASS (pass_copy_prop);
>   NEXT_PASS (pass_dce_loop);
>   NEXT_PASS (pass_tree_unswitch);
>   NEXT_PASS (pass_scev_cprop);
>   NEXT_PASS (pass_record_bounds);
>   NEXT_PASS (pass_check_data_deps);
>   NEXT_PASS (pass_loop_distribution);
>   NEXT_PASS (pass_copy_prop);
>   NEXT_PASS (pass_graphite);
>   PUSH_INSERT_PASSES_WITHIN (pass_graphite)
>   NEXT_PASS (pass_graphite_transforms);
>   NEXT_PASS (pass_lim);
>   NEXT_PASS (pass_copy_prop);
>   NEXT_PASS (pass_dce_loop);
>   POP_INSERT_PASSES ()
>   NEXT_PASS (pass_iv_canon);
>   NEXT_PASS (pass_parallelize_loops);
>   NEXT_PASS (pass_if_conversion);
>   /* pass_vectorize must immediately follow pass_if_conversion.
>  Please do not add any other passes in between.  */
>   NEXT_PASS (pass_vectorize);
>   PUSH_INSERT_PASSES_WITHIN (pass_vectorize)
>   NEXT_PASS (pass_dce_loop);
>   POP_INSERT_PASSES ()
>   NEXT_PASS (pass_predcom);
>   NEXT_PASS (pass_complete_unroll);
>   NEXT_PASS (pass_slp_vectorize);
>   NEXT_PASS (pass_loop_prefetch);
>   NEXT_PASS (pass_iv_optimize);
>   NEXT_PASS (pass_lim);
>   NEXT_PASS (pass_tree_loop_done);
> where scev is live in between the passes, and unfortunately there
> is no call that each pass calls that you could easily add the free_affine.*
> call to (except for execute_function_todo but you'd need to invent another
> TODO_* flag for it and the question is how you'd ensure it will be handled
> for the non-loop specific passes that are just run in between these
> passes (say pass_copy_prop, would we need pass_copy_prop_loop?).
>
> Perhaps you could replace in tree-affine.c and all it's users
> {,struct} pointer_map_t * with
> struct GTY((user)) affine_expand_cache {
>   pointer_map_t *cache;
> };
> and write custom gt_ggc_mx (affine_expand_cache *) function that would
> would either pointer_map_traverse the cache and for each cache entry
> call gt_ggc_mx on the embedded trees in the structures, or drop the cache
> (dunno if the cache is really just a cache and it doesn't matter if it is
> dropped any time, or if it can affect code generation).  There are also PCH
> related functions that need to be defined, though I guess you could just
> assert there that the cache is NULL and not walk anything.

Thank both of you for being patient on this patch.
I went through the documentation quickly and realized that I have to
modify pointer-map structure to make it recognized by GC (maybe more
work suggested by Jakub).  It seems I shouldn't include that task in
this patch at this stage 3, I am thinking just call free_affine*
function in the place it is created for SCEV.  Of course, that makes
it more expensive.

I just found that the patch also fixes PR58296 on IVOPT and I do want
to include it in 4.9 if we are ok with it.  So what's your opinion?

Thanks,
bin
>
> Jakub



-- 
Best Regards.


RE: [PATCH PR41488]Recognize more induction variables by simplifying PEELED chrec in scalar evolution

2013-12-11 Thread bin.cheng


> -Original Message-
> From: Richard Biener [mailto:richard.guent...@gmail.com]
> Sent: Wednesday, December 11, 2013 6:04 PM
> To: Jakub Jelinek
> Cc: Bin.Cheng; Jeff Law; Bin Cheng; gcc-patches List
> Subject: Re: [PATCH PR41488]Recognize more induction variables by
> simplifying PEELED chrec in scalar evolution
> 
> On Wed, Dec 11, 2013 at 9:50 AM, Jakub Jelinek  wrote:
> > On Wed, Dec 11, 2013 at 04:31:55PM +0800, Bin.Cheng wrote:
> >> Thank both of you for being patient on this patch.
> >> I went through the documentation quickly and realized that I have to
> >> modify pointer-map structure to make it recognized by GC (maybe more
> >
> > Changing pointer_map at this point is IMHO not appropriate.
> >
> >> work suggested by Jakub).  It seems I shouldn't include that task in
> >> this patch at this stage 3, I am thinking just call free_affine*
> >> function in the place it is created for SCEV.  Of course, that makes
> >> it more expensive.
> >
> > Perhaps you could call free_affine_expand_cache say from
> > analyze_scalar_evolution, that is the innermost enclosing exported API
> > that indirectly calls your new code, but it seems it is also called
> > recursively from analyze_scalar_evolution_1 or functions it calls.
> > So maybe you'd need to rename analyze_scalar_evolution to
> > analyze_scalar_evolution_2 and adjust some calls to
> > analyze_scalar_evolution to the latter in tree-scalar-evolution.c, and
> > add analyze_scalar_evolution wrapper that calls
> analyze_scalar_evolution_2 and free_affine_expand_cache.
> > Whether this would work depends on detailed analysis of the
> > tree-scalar-evolution.c callgraph, there are tons of functions, most
> > of them static, and the question is if there is a single non-static
> > (or at most a few of them) function that cover all calls to your new
> > code in the static functions it (indirectly) calls, i.e. if there
> > isn't any other external entry point that might call your new code
> > without doing free_affine_expand_cache afterwards.
> >
> > If you can find that, I'd say it would be the safest thing to do.
> > But let's see what say Richard thinks about it too.
> 
> I think keeping the SCEV cache live over multiple passes is fragile (we do
re-
> use SSA names and passes do not appropriately call scev_reset[_htab] in
all
> cases).
> 
> With fixing that the easiest approach is to associate a affine-expand
cache
> with the scev info.
> 
> Without that there isn't a very good solution other than discarding the
cache
> after each expansion at the point you use it.  The SCEV cache should
> effectively make most of the affine expansion caching moot (but you can
try
> instrumenting that to verify it).
> 
> That is, I suggest to try freeing the cache right after the second
> tree_to_aff_combination_expand.
> 
> Btw,
> 
> +  /* Try harder to check if they are equal.  */
> + tree_to_aff_combination_expand (left, type, &aff1, &peeled_chrec_map);
> + tree_to_aff_combination_expand (step_val, type, &aff2,
> + &peeled_chrec_map);  aff_combination_scale (&aff2,
> + double_int_minus_one);  aff_combination_add (&aff1, &aff2);  left =
> + fold_convert (type, aff_combination_to_tree (&aff1));
> +
> +  /* Transform (init, {left, right}_LOOP)_LOOP to {init, right}_LOOP
> + if "left" equals to "init + right".  */  if (operand_equal_p
> + (left, integer_zero_node, 0))
> 
> you don't need aff_combination_to_tree, simply check
> 
>  aff1.n == 0 && aff1.offset.is_zero ()
> 
> (you may want to add aff_combination_zero_p or even
> aff_combination_equal_p)

Hi,
This is the new version patch with bug 59445 fixed and calling free_affine
stuff just after it is used.  I also added aff_combination_zero_p as
suggested.
The change in add_old_iv_candidates is unnecessary now, since the previous
version patch triggered the assertion, I just leave it here as an additional
check.

At last, apology that I missed java in bootstrap before.

Bootstrap and test on x86/x86_64,  arm is still ongoing.  Is it OK if tests
pass.

Thanks,
bin

2013-12-12  Bin Cheng  

PR tree-optimization/58296
PR tree-optimization/41488
* tree-scalar-evolution.c: Include necessary header files.
(simplify_peeled_chrec): New function.
(analyze_evolution_in_loop): New static variable.
Call simplify_peeled_chrec.
* tree-ssa-loop-ivopts.c (mark_bivs): Don't mark peeled IV as biv.
(add_old_iv_candidates): Don't add candidate for peeled IV.
* tree-affine.h (aff_combination_zero_p): New function.

gcc/test

  1   2   3   4   5   6   7   8   9   10   >