RE: A problem about loop store motion

2012-02-20 Thread Jiangning Liu


> -Original Message-
> From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of
> Jiangning Liu
> Sent: Friday, February 17, 2012 5:53 PM
> To: gcc@gcc.gnu.org
> Subject: A problem about loop store motion
> 
> Hi,
> 
> For this small test case,
> 
> int *l, *r;
> int test_func(void)
> {
>   int i;
>   int direction;
>   static int pos;
> 
>   pos = 0;
>   direction = 1;
> 
>   for ( i = 0; i <= 400; i++ )
>   {
>   if ( direction == 0 )
>   pos = l[pos];
>   else
>   pos = r[pos];
> 
>   if ( pos == -1 )
>   {
>   pos = 0;
>   direction = !direction;
>   }
>   }
>   return i;
> }
> 
> In middle end, I don't see pos is sunk out of loop by loop store motion.
> Any
> idea?
> 
> The dump after lim is like below, and I expect a SSA symbole xxx_lsm
> could
> be created with this pass.
> 
> ;; Function test_func (test_func, funcdef_no=0, decl_uid=4057,
> cgraph_uid=0)
> 
> 
> Symbols to be put in SSA form
> { .MEM }
> Incremental SSA update started at block: 0
> Number of blocks in CFG: 12
> Number of blocks to update: 11 ( 92%)
> 
> 
> test_func ()
> {
>   int pretmp.14;
>   unsigned int pretmp.13;
>   int prephitmp.12;
>   int pretmp.11;
>   unsigned int pretmp.10;
>   int pretmp.9;
>   int D.4088;
>   static int pos;
>   int direction;
>   int i;
>   _Bool D.4082;
>   int pos.5;
>   int * D.4078;
>   int * r.4;
>   int pos.3;
>   int * D.4074;
>   unsigned int D.4073;
>   unsigned int pos.2;
>   int pos.1;
>   int * l.0;
> 
> :
>   pos = 0;
>   l.0_6 = l;
>   r.4_12 = r;
> 
> :
>   # i_32 = PHI 
>   # direction_37 = PHI 
>   # prephitmp.12_35 = PHI 
>   if (direction_37 == 0)
> goto ;
>   else
> goto ;
> 
> :
>   pos.1_7 = prephitmp.12_35;
>   pos.2_8 = (unsigned int) pos.1_7;
>   D.4073_9 = pos.2_8 * 4;
>   D.4074_10 = l.0_6 + D.4073_9;
>   pos.3_11 = *D.4074_10;
>   pos = pos.3_11;
>   goto ;
> 
> :
>   pos.1_13 = prephitmp.12_35;
>   pos.2_14 = (unsigned int) pos.1_13;
>   D.4073_15 = pos.2_14 * 4;
>   D.4078_16 = r.4_12 + D.4073_15;
>   pos.5_17 = *D.4078_16;
>   pos = pos.5_17;
> 
> :
>   # prephitmp.12_31 = PHI 
>   pos.1_18 = prephitmp.12_31;
>   if (pos.1_18 == -1)
> goto ;
>   else
> goto ;
> 
> :
>   goto ;
> 
> :
>   pos = 0;
>   D.4088_36 = direction_37 ^ 1;
>   direction_20 = D.4088_36 & 1;
> 
> :
>   # direction_2 = PHI 
>   i_21 = i_32 + 1;
>   if (i_21 != 401)
> goto ;
>   else
> goto ;
> 
> :
>   pretmp.11_1 = pos;

Hi,

pos in RHS seems to be transformed to MEM[&pos] by the code in
tree-ssa-sccvn.c like below,

case VAR_DECL:
  if (DECL_HARD_REGISTER (ref))
{
  temp.op0 = ref;
  break;
}
  /* Fallthru.  */
case PARM_DECL:
case CONST_DECL:
case RESULT_DECL:
  /* Canonicalize decls to MEM[&decl] which is what we end up with
 when valueizing MEM[ptr] with ptr = &decl.  */
  temp.opcode = MEM_REF;
  temp.op0 = build_int_cst (build_pointer_type (TREE_TYPE (ref)),
0);
  temp.off = 0;
  VEC_safe_push (vn_reference_op_s, heap, *result, &temp);
  temp.opcode = ADDR_EXPR;
  temp.op0 = build_fold_addr_expr (ref);
  temp.type = TREE_TYPE (temp.op0);
  temp.off = -1;
  break;

But the LHS doesn't have this kind of transformation on gimple,

So the code below in gather_mem_refs_stmt of tree-ssa-loop-im.c can't find
MEM[&pos] is just pos,

  hash = iterative_hash_expr (*mem, 0);
  slot = htab_find_slot_with_hash (memory_accesses.refs, *mem, hash,
INSERT);

Finally pos is set to be dependent on pretmp.11_1 at code below in
ref_indep_loop_p_1, and then loop store motion fails to find pos.

  EXECUTE_IF_SET_IN_BITMAP (refs_to_check, 0, i, bi)
{
  aref = VEC_index (mem_ref_p, memory_accesses.refs_list, i);
  if (!refs_independent_p (ref, aref))
{
  ret = false;
  record_indep_loop (loop, aref, false);
  break;
}
}

Should we fix this problem by enhancing iterative_hash_expr, or we should
use MEM[&pos] to describe pos at all?

Any hints? I need to understand why we have to use MEM[&pos] to describe
pos? This looks strange to me.

BTW, this bug caused significant regression for an important benchmark.

Thanks,
-Jiangning

>   goto ;
> 
> :
>   return 401;
> 
> }
> 
> Thanks,
> -Jiangning
> 
> 
> 






Re: [i386] Question about Constraint Modifier character in smaxdf3 pattern.

2012-02-20 Thread Richard Guenther
On Sun, Feb 19, 2012 at 4:43 PM, Marc Glisse  wrote:
> On Sun, 19 Feb 2012, Kumar, Venkataramanan wrote:
>
>> Sphinx3 benchmark segmented when built and ran with -Ofast and
>> -fprefecth-loop-arrays.
>
> [...]
>
>> In my case the register 367 is sNAN.
>
>
> -Ofast implies -ffast-math which implies -ffinite-math-only:
>    Allow optimizations for floating-point arithmetic that assume that
>    arguments and results are not NaNs or +-Infs.
>
> so I am not sure what you expect exactly.

Irrespective of -ffast-math the .md file looks ok - the tree and RTL level
mark their min/max as commutative, so if there is an error the error is
that sminmax is used, not that the pattern for it has its operand marked
as commutative (this may mean that without -fno-signed-zeros and
-ffinite-math-only [s]minmax cannot be used on x86).

Richard.

> --
> Marc Glisse


Re: A problem about loop store motion

2012-02-20 Thread Richard Guenther
On Mon, Feb 20, 2012 at 9:46 AM, Jiangning Liu  wrote:
>
>
>> -Original Message-
>> From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of
>> Jiangning Liu
>> Sent: Friday, February 17, 2012 5:53 PM
>> To: gcc@gcc.gnu.org
>> Subject: A problem about loop store motion
>>
>> Hi,
>>
>> For this small test case,
>>
>> int *l, *r;
>> int test_func(void)
>> {
>>       int i;
>>       int direction;
>>       static int pos;
>>
>>       pos = 0;
>>       direction = 1;
>>
>>       for ( i = 0; i <= 400; i++ )
>>       {
>>               if ( direction == 0 )
>>                       pos = l[pos];
>>               else
>>                       pos = r[pos];
>>
>>               if ( pos == -1 )
>>               {
>>                       pos = 0;
>>                       direction = !direction;
>>               }
>>       }
>>       return i;
>> }
>>
>> In middle end, I don't see pos is sunk out of loop by loop store motion.
>> Any
>> idea?
>>
>> The dump after lim is like below, and I expect a SSA symbole xxx_lsm
>> could
>> be created with this pass.
>>
>> ;; Function test_func (test_func, funcdef_no=0, decl_uid=4057,
>> cgraph_uid=0)
>>
>>
>> Symbols to be put in SSA form
>> { .MEM }
>> Incremental SSA update started at block: 0
>> Number of blocks in CFG: 12
>> Number of blocks to update: 11 ( 92%)
>>
>>
>> test_func ()
>> {
>>   int pretmp.14;
>>   unsigned int pretmp.13;
>>   int prephitmp.12;
>>   int pretmp.11;
>>   unsigned int pretmp.10;
>>   int pretmp.9;
>>   int D.4088;
>>   static int pos;
>>   int direction;
>>   int i;
>>   _Bool D.4082;
>>   int pos.5;
>>   int * D.4078;
>>   int * r.4;
>>   int pos.3;
>>   int * D.4074;
>>   unsigned int D.4073;
>>   unsigned int pos.2;
>>   int pos.1;
>>   int * l.0;
>>
>> :
>>   pos = 0;
>>   l.0_6 = l;
>>   r.4_12 = r;
>>
>> :
>>   # i_32 = PHI 
>>   # direction_37 = PHI 
>>   # prephitmp.12_35 = PHI 
>>   if (direction_37 == 0)
>>     goto ;
>>   else
>>     goto ;
>>
>> :
>>   pos.1_7 = prephitmp.12_35;
>>   pos.2_8 = (unsigned int) pos.1_7;
>>   D.4073_9 = pos.2_8 * 4;
>>   D.4074_10 = l.0_6 + D.4073_9;
>>   pos.3_11 = *D.4074_10;
>>   pos = pos.3_11;
>>   goto ;
>>
>> :
>>   pos.1_13 = prephitmp.12_35;
>>   pos.2_14 = (unsigned int) pos.1_13;
>>   D.4073_15 = pos.2_14 * 4;
>>   D.4078_16 = r.4_12 + D.4073_15;
>>   pos.5_17 = *D.4078_16;
>>   pos = pos.5_17;
>>
>> :
>>   # prephitmp.12_31 = PHI 
>>   pos.1_18 = prephitmp.12_31;
>>   if (pos.1_18 == -1)
>>     goto ;
>>   else
>>     goto ;
>>
>> :
>>   goto ;
>>
>> :
>>   pos = 0;
>>   D.4088_36 = direction_37 ^ 1;
>>   direction_20 = D.4088_36 & 1;
>>
>> :
>>   # direction_2 = PHI 
>>   i_21 = i_32 + 1;
>>   if (i_21 != 401)
>>     goto ;
>>   else
>>     goto ;
>>
>> :
>>   pretmp.11_1 = pos;
>
> Hi,
>
> pos in RHS seems to be transformed to MEM[&pos] by the code in
> tree-ssa-sccvn.c like below,
>
>        case VAR_DECL:
>          if (DECL_HARD_REGISTER (ref))
>            {
>              temp.op0 = ref;
>              break;
>            }
>          /* Fallthru.  */
>        case PARM_DECL:
>        case CONST_DECL:
>        case RESULT_DECL:
>          /* Canonicalize decls to MEM[&decl] which is what we end up with
>             when valueizing MEM[ptr] with ptr = &decl.  */
>          temp.opcode = MEM_REF;
>          temp.op0 = build_int_cst (build_pointer_type (TREE_TYPE (ref)),
> 0);
>          temp.off = 0;
>          VEC_safe_push (vn_reference_op_s, heap, *result, &temp);
>          temp.opcode = ADDR_EXPR;
>          temp.op0 = build_fold_addr_expr (ref);
>          temp.type = TREE_TYPE (temp.op0);
>          temp.off = -1;
>          break;
>
> But the LHS doesn't have this kind of transformation on gimple,
>
> So the code below in gather_mem_refs_stmt of tree-ssa-loop-im.c can't find
> MEM[&pos] is just pos,
>
>  hash = iterative_hash_expr (*mem, 0);
>  slot = htab_find_slot_with_hash (memory_accesses.refs, *mem, hash,
> INSERT);
>
> Finally pos is set to be dependent on pretmp.11_1 at code below in
> ref_indep_loop_p_1, and then loop store motion fails to find pos.
>
>  EXECUTE_IF_SET_IN_BITMAP (refs_to_check, 0, i, bi)
>    {
>      aref = VEC_index (mem_ref_p, memory_accesses.refs_list, i);
>      if (!refs_independent_p (ref, aref))
>        {
>          ret = false;
>          record_indep_loop (loop, aref, false);
>          break;
>        }
>    }
>
> Should we fix this problem by enhancing iterative_hash_expr, or we should
> use MEM[&pos] to describe pos at all?

The MEM form is more "canonical", so the loop SM machinery to detect
equality should be adjusted accordingly.  Alternatively you can teach
PRE insertion to strip off the MEM if possible (though fold_stmt_inplace should
arelady do this if possible).

Richard.

> Any hints? I need to understand why we have to use MEM[&pos] to describe
> pos? This looks strange to me.
>
> BTW, this bug caused significant regression for an important benchmark.
>
> Thanks,
> -Jiangning
>
>>   goto ;
>>
>> :
>>   return

Re: weird optimization in sin+cos, x86 backend

2012-02-20 Thread Vincent Lefevre
On 2012-02-18 14:38:01 +, James Courtier-Dutton wrote:
> Would it be useful to have some lib functions:
> remainder_2pi(x, &remainder) /* returns remainder of x / 2Pi */
> remainder_pi2(x, &remainder, "ient)  /* returns remainder of x /
> (Pi/2)  with a quotient returned so you can tell where in the range [0
> , 2Pi] it would fall.*/
> remainder_pi4(x, &remainder, "ient)  /* returns remainder of x /
> (Pi/4), with a quotient returned.

For high accuracy (including correct rounding), since range reduction
isn't exact, one needs the remainder in a higher precision. The needed
precision can be more than twice the precision of the system.

-- 
Vincent Lefèvre  - Web: 
100% accessible validated (X)HTML - Blog: 
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)


RE: [i386] Question about Constraint Modifier character in smaxdf3 pattern.

2012-02-20 Thread Kumar, Venkataramanan

> -Original Message-
> From: Richard Guenther [mailto:richard.guent...@gmail.com]
> Sent: Monday, February 20, 2012 4:58 PM
> To: gcc@gcc.gnu.org
> Cc: Kumar, Venkataramanan
> Subject: Re: [i386] Question about Constraint Modifier character in smaxdf3
> pattern.
> 
> On Sun, Feb 19, 2012 at 4:43 PM, Marc Glisse  wrote:
> > On Sun, 19 Feb 2012, Kumar, Venkataramanan wrote:
> >
> >> Sphinx3 benchmark segmented when built and ran with -Ofast and
> >> -fprefecth-loop-arrays.
> >
> > [...]
> >
> >> In my case the register 367 is sNAN.
> >
> >
> > -Ofast implies -ffast-math which implies -ffinite-math-only:
> >    Allow optimizations for floating-point arithmetic that assume that
> >    arguments and results are not NaNs or +-Infs.
> >
> > so I am not sure what you expect exactly.
> 
> Irrespective of -ffast-math the .md file looks ok - the tree and RTL level
> mark their min/max as commutative, so if there is an error the error is
> that sminmax is used, not that the pattern for it has its operand marked
> as commutative (this may mean that without -fno-signed-zeros and
> -ffinite-math-only [s]minmax cannot be used on x86).


Since I compile with -Ofast it sets -finite-math-only. Then usage smax of 
pattern is correct.

But in my case smax (sNAN, x) is changed to smax (x, sNAN) at reload. And sNaN 
is returned as result. 

Ok I will check see why sNAN is getting generated at the first place.

> 
> Richard.
> 
> > --
> > Marc Glisse




RE: spill failure after IF-CASE-2 transformation

2012-02-20 Thread Henderson, Stuart
Ping.
>>looks like an invalid transformation, but I suspect rather than setting
>>the CC register, the (*) insn is setting a pseudo (more accurate RTL
>>would be useful). There are some cases in ifcvt.c which check
>>targetm.small_register_classes_for_mode already, this is probably what
>>should be done to prevent this transformation.
>
>You suspect correctly, cc=x sets CC whereas cc=y is a pseudo which can
>only match CC.
>
>Presumably I must check all instructions in the else_bb for
>modifications to small_register_classes_for_mode_p?  e.g. see below.
>
>Does this seem reasonable?

Patch here:
http://gcc.gnu.org/ml/gcc/2012-02/msg00296.html

Thanks,
Stu



Re: spill failure after IF-CASE-2 transformation

2012-02-20 Thread Bernd Schmidt
On 02/14/2012 07:12 PM, Henderson, Stuart wrote:
>> spill_failure does return for asms since we don't want to ICE on bad
>> user code. That's all that's going on here.
> 
> ahh, thanks.
> 
>> It sounds like ifcvt needs to be fixed. Your example:
>>> block 44:
>>> set cc = x;
>>> set cc = y; (*)
>>> if cc jump;
>>
>> looks like an invalid transformation, but I suspect rather than setting
>> the CC register, the (*) insn is setting a pseudo (more accurate RTL
>> would be useful). There are some cases in ifcvt.c which check
>> targetm.small_register_classes_for_mode already, this is probably what
>> should be done to prevent this transformation.
> 
> You suspect correctly, cc=x sets CC whereas cc=y is a pseudo which can only 
> match CC.
> 
> Presumably I must check all instructions in the else_bb for modifications to 
> small_register_classes_for_mode_p?  e.g. see below.
> 
> Does this seem reasonable?

Not really. I think in dead_or_predicable you need to check in the
  /* Try the NCE path if the CE path did not result in any changes.  */
block (I assume this is where we end up in this testcase) that none of
the live hard regs at the point where we are going to insert the insns
are in small register classes. small_register_classes_for_mode_p is
unfortunately not a good interface to answer that question.

It looks like noce_get_condition probably didn't find the instruction
setting CC? Maybe you can paper over the problem for the moment by
making sure it does.


Bernd


Inefficient end-of-loop value computation - missed optimization somewhere?

2012-02-20 Thread Ulrich Weigand
Hello,

we've noticed that the loop optimizer sometimes inserts weirdly
inefficient code to compute the value of an induction variable
at the end of the loop.

As a test case stripped down to the core of the problem, consider:

int test (int start, int end)
{
  int i;

  for (i = start; i < end; i++)
;

  return i;
}

That function really ought to be equivalent to 

int test2 (int start, int end)
{
  return start < end ? end : start;
}

And indeed on i386-linux-gnu, we get mostly identical code
(except for the branch direction prediction).

However, on arm-linux-gnuabi (using -mthumb and -march=armv7-a),
we see for the first function:

test:
cmp r0, r1
bge .L2
addsr3, r0, #1
mvnsr0, r0
addsr1, r1, r0
addsr0, r3, r1
.L2:
bx  lr

instead of simply (as for the second function):

test2:
cmp r1, r0
it  ge
movge   r0, r1
bx  lr



Where does that weird addition-and-negation sequence come from?
In 100t.dceloop we still have:

:
  # i_9 = PHI 
  i_5 = i_9 + 1;
  if (end_4(D) > i_5)
goto ;
  else
goto ;

:
  goto ;

:
  # i_1 = PHI 


But 102t.sccp has:

:
  # i_9 = PHI 
  i_5 = i_9 + 1;
  if (end_4(D) > i_5)
goto ;
  else
goto ;

:
  goto ;

:
  D.4123_10 = start_2(D) + 1;
  D.4124_11 = ~start_2(D);
  D.4125_12 = (unsigned int) D.4124_11;
  D.4126_13 = (unsigned int) end_4(D);
  D.4127_14 = D.4125_12 + D.4126_13;
  D.4128_15 = (int) D.4127_14;
  i_1 = D.4123_10 + D.4128_15;

This is the sequence that ends up in the final assembler code.

Note that this computes:

  i_1 = (start_2(D) + 1)
 + (int)((unsigned int) ~start_2(D) + (unsigned int) end_4(D))

which is (since those casts are no-ops on the assembler level):

  i_1 = start_2(D) + 1 + ~start_2(D) + end_4(D)

which is due to two's-complement arithmetic:

  i_1 = start_2(D) - start_2(D) + end_4(D)

that is simply:

  i_1 = end_4(D)


Unfortunately, no tree-based optimization pass after 102t.sccp is able to
clean up that mess.  This is true even on *i386*: the only reason we get
nice assembler there is due to *RTX* optimization, notably in combine.
This hits on i386 because of an intermediate stage that adds two registers
and the constant 1 (which matches the lea pattern).  On arm, there is no
instruction that would handle that intermediate stage, and combine is not
powerful enough to perform the whole simplification in a single step.


Now that sequence of gimple operations is generated in scev_const_prop
in tree-scalar-evolution.c.  First, the phi node corresponding to the
loop exit value is evaluated:

  def = analyze_scalar_evolution_in_loop (ex_loop, loop, def, NULL);

which returns a chrec of the form

  { 1, +, (start + 1) }

This is then further simplified by

  def = compute_overall_effect_of_inner_loop (ex_loop, def);

which first computes the number of loop iterations:

  tree nb_iter = number_of_latch_executions (loop);

which returns a tree representing

  (unsigned int) ~start + (unsigned int) end

(which at this point seems to be the validly most efficient way to
compute end - start - 1).

The chrec is then evaluated via chrec_apply which basically computes

  (start + 1) + 1 * (int)((unsigned int) ~start + (unsigned int) end)

which ends up being converted to the long gimple sequence seen above.


Now I guess my questions are:

- In the computation shown above, should that tree have been folded
  into just "end" to start with?  The tree is constructed at its
  outermost level in chrec_fold_plus, which simply calls fold_build2.
  While simplify-rtx.c has code to detect sequences of operations
  that cancel out while folding a PLUS or MINUS RTX, there does
  not appear to be corresponding code in fold-const.c to handle
  the equivalent optimization at the tree level.

- If not, should there be another tree pass later on that ought to
  clean this up?  I notice there is code to handle plus/minus
  sequences in tree-ssa-reassoc.c.  But this doesn't seem to be
  able to handle this particular case, possibly because the presence
  of the casts (nop_exprs) ...

- Anywhere else I'm overlooking right now?


I'd appreciate any tips / suggestions how to fix this ...


Thanks,
Ulrich

-- 
  Dr. Ulrich Weigand
  GNU Toolchain for Linux on System z and Cell BE
  ulrich.weig...@de.ibm.com