Predictive commoning leads to register to register moves through memory.

2015-08-28 Thread Simon Dardis
Hello all,

I've been looking at a code generation issue with GCC 5.2 lately dealing with 
register to register moves through memory with -O3 -funroll-loops. For 
reference the C code is at the end of this mail. The generated code for mips is 
(cut down for clarity, ldc1 and sdc1 are double word floating point stores):

div.d   $f8,$f6,$f4
mul.d   $f2,$f8,$f8
sdc1$f2,8($7)
$L38:
ldc1$f0,8($7)   <- load instead of move
li  $11,1   # 0x1


$L49:

div.d   $f8,$f6,$f4
addiu   $11,$10,3
mul.d   $f2,$f8,$f8
sdc1$f2,8($7)
ldc1$f0,8($7)   <- load instead of move
$L48:
mul.d   $f2,$f2,$f0




$L45:
mul.d   $f2,$f4,$f4
mov.d   $f8,$f4
j   $L38
sdc1$f2,8($7)

For the basic block L38, all dominating blocks store to 8($7) which is then 
loaded back into another floating register. 

Disabling predictive commoning generates:

div.d   $f4,$f18,$f2
mul.d   $f0,$f4,$f4
$L37:
mul.d   $f6,$f0,$f0
li  $10,1   # 0x1
mul.d   $f8,$f0,$f6
mul.d   $f10,$f0,$f8
mul.d   $f12,$f0,$f10
mul.d   $f14,$f0,$f12
mul.d   $f16,$f0,$f14
beq $4,$10,$L38
mul.d   $f20,$f0,$f16

For the same basic block. 

Following Jeff's advice[1] to extract more information from GCC, I've narrowed 
the cause down to the predictive commoning pass inserting the load in a loop 
header style basic block. However, the next pass in GCC, tree-cunroll promptly 
removes the loop and joins the loop header to the body of the (non)loop. More 
oddly, disabling conditional store elimination pass or the dominator 
optimizations pass or disabling of jump-threading with --param 
max-jump-thread-duplication-stmts=0 nets the above assembly code. Any ideas on 
an approach for this issue?

[1] https://gcc.gnu.org/ml/gcc-help/2015-08/msg00162.html

Thanks,
Simon

double N;
int i1;
double T;
double poly[9];


void 
g (int iterations)
{ 
  int count = 0;
  for (count = 0; count < iterations; count++)
{
  if (N > 1)
{
  T = 1 / N;
}
  else
{
  T = N;
}

  poly[1] = T * T;
  for (i1 = 2; i1 <= 8; i1++)
{
  poly[i1] = poly[i1 - 1] * poly[1];
}
}

  return;
}


RE: Predictive commoning leads to register to register moves through memory.

2015-09-17 Thread Simon Dardis
I've since taken another look at this recently and I've tracked the issue down 
to
tree-predcom.c, specifically ref_at_iteration almost always generating MEM_REFs.
With MEM_REFs, GCC's RTL GCSE cannot compare them as equal and hence 
remove them. A previous version of the code did generate ARRAY_REFs 
(pre 204458), but that was changed to generate MEM_REFs for pr/58653. 

Would something like:
--- a/gcc/tree-predcom.c
+++ b/gcc/tree-predcom.c
@@ -1409,7 +1409,21 @@ ref_at_iteration (data_reference_p dr, int iter, 
gimple_seq *stmts)
 addr, alias_ptr),
 DECL_SIZE (field), bitsize_zero_node);
 }
-  else
+  /* Generate an ARRAY_REF for array references when all details are 
INTEGER_CST
+ rather than a MEM_REF so that CSE passes can potientially optimize them.  
*/
+  else if (TREE_CODE (DR_REF (dr)) == ARRAY_REF
+  && TREE_CODE (DR_STEP (dr)) == INTEGER_CST
+  && TREE_CODE (DR_INIT (dr)) == INTEGER_CST
+  && TREE_CODE (DR_OFFSET (dr)) == INTEGER_CST)
+{
+   /* Reverse engineer the element from memory offset.  */
+   tree offset = size_binop (MINUS_EXPR, coff, off);
+   tree sizdiv = TYPE_SIZE (TREE_TYPE (TREE_TYPE (DR_BASE_OBJECT (dr;
+   sizdiv = div_if_zero_remainder (EXACT_DIV_EXPR, sizdiv, ssize_int 
(BITS_PER_UNIT));
+   tree element = div_if_zero_remainder (EXACT_DIV_EXPR, offset, sizdiv);
+   if (element != NULL_TREE)
+ return build4 (ARRAY_REF, TREE_TYPE (DR_REF (dr)), DR_BASE_OBJECT 
(dr),
+element, NULL_TREE, NULL_TREE);
+}
 return fold_build2 (MEM_REF, TREE_TYPE (DR_REF (dr)), addr, alias_ptr);

be an appropriate start to fixing this? That fix appears to work in in my 
testing.

Thanks,
Simon

-Original Message-
From: Richard Biener [mailto:richard.guent...@gmail.com] 
Sent: 31 August 2015 11:40
To: Jeff Law
Cc: Simon Dardis; gcc@gcc.gnu.org
Subject: Re: Predictive commoning leads to register to register moves through 
memory.

On Fri, Aug 28, 2015 at 5:48 PM, Jeff Law  wrote:
> On 08/28/2015 09:43 AM, Simon Dardis wrote:
>
>> Following Jeff's advice[1] to extract more information from GCC, I've 
>> narrowed the cause down to the predictive commoning pass inserting 
>> the load in a loop header style basic block. However, the next pass 
>> in GCC, tree-cunroll promptly removes the loop and joins the loop 
>> header to the body of the (non)loop. More oddly, disabling 
>> conditional store elimination pass or the dominator optimizations 
>> pass or disabling of jump-threading with --param
>> max-jump-thread-duplication-stmts=0 nets the above assembly code. Any 
>> ideas on an approach for this issue?
>
> I'd probably start by looking at the .optimized tree dump in both 
> cases to understand the difference, then (most liklely) tracing that 
> through the RTL optimizers into the register allocator.

It's the known issue of LIM (here the one after pcom and complete unrolling of 
the inner loop) being too aggressive with store-motion.  Here the comptete 
array is replaced with registers for the outer loop.  Were 'poly' a local 
variable we'd have optimized it away completely.

  :
  _8 = 1.0e+0 / pretmp_42;
  _12 = _8 * _8;
  poly[1] = _12;

  :
  # prephitmp_30 = PHI <_12(6), _36(9)>
  # T_lsm.8_22 = PHI <_8(6), pretmp_42(9)>
  poly_I_lsm0.10_38 = MEM[(double *)&poly + 8B];
  _2 = prephitmp_30 * poly_I_lsm0.10_38;
  _54 = _2 * poly_I_lsm0.10_38;
  _67 = poly_I_lsm0.10_38 * _54;
  _80 = poly_I_lsm0.10_38 * _67;
  _93 = poly_I_lsm0.10_38 * _80;
  _106 = poly_I_lsm0.10_38 * _93;
  _19 = poly_I_lsm0.10_38 * _106;
  count_23 = count_28 + 1;
  if (count_23 != iterations_6(D))
goto ;
  else
goto ;

  :
  poly[2] = _2;
  poly[3] = _54;
  poly[4] = _67;
  poly[5] = _80;
  poly[6] = _93;
  poly[7] = _106;
  poly[8] = _19;
  i1 = 9;
  T = T_lsm.8_22;

note that DOM misses to CSE poly[1] (a known defect), but heh, doing that would 
only increase register pressure even more.

Note the above is on x86_64.

Richard.


> jeff


RE: Predictive commoning leads to register to register moves through memory.

2015-09-22 Thread Simon Dardis
I took an attempt at addressing this through the RTL GCSE pass. This attempt 
tweaks
mem_attrs_eq_p to return true if its comparing something like poly+8 and MEM 
[&poly + 8].

Is this a more suitable approach?

Thanks,
Simon

+/* Return true if p and q reference the same location by the same name but
+   through VAR_DECL and MEM_REF.  */
+
+static bool
+mem_locations_match_p (const struct mem_attrs *p, const struct mem_attrs *q)
+{
+  HOST_WIDE_INT var_offset;
+  tree var, memref;
+
+  if (p->expr == NULL_TREE || q->expr == NULL_TREE)
+return false;
+
+  if (TREE_CODE (p->expr) == MEM_REF && TREE_CODE (q->expr) == VAR_DECL)
+{
+  var = q->expr;
+  var_offset = q->offset;
+  memref = p->expr;
+}
+  else if (TREE_CODE (q->expr) == MEM_REF && TREE_CODE (p->expr) == VAR_DECL)
+{
+  var = p->expr;
+  var_offset = p->offset;
+  memref = q->expr;
+}
+  else
+return false;
+
+  if (TREE_OPERAND (TREE_OPERAND (memref, 0), 0) != var)
+return false;
+
+  if (TREE_TYPE (TREE_TYPE (var)) != TREE_TYPE (memref))
+return false;
+
+  tree offset = TREE_OPERAND (memref, 1);
+  if ((TREE_CODE (offset) == INTEGER_CST && tree_fits_shwi_p (offset)
+  && tree_to_shwi (offset) == var_offset)
+  || offset == NULL_TREE && var_offset == 0)
+return true;
+
+  return false;
+
+}
+
/* Return true if the given memory attributes are equal.  */
 
 bool
@@ -254,16 +298,16 @@ mem_attrs_eq_p (const struct mem_attrs *p, const struct 
mem_attrs *q)
 return false;
   return (p->alias == q->alias
  && p->offset_known_p == q->offset_known_p
- && (!p->offset_known_p || p->offset == q->offset)
  && p->size_known_p == q->size_known_p
  && (!p->size_known_p || p->size == q->size)
  && p->align == q->align
  && p->addrspace == q->addrspace
- && (p->expr == q->expr
- || (p->expr != NULL_TREE && q->expr != NULL_TREE
- && operand_equal_p (p->expr, q->expr, 0;
+ && (mem_locations_match_p (p, q)
+|| (!p->offset_known_p || p->offset == q->offset)
+&& (p->expr == q->expr
+|| (p->expr != NULL_TREE && q->expr != NULL_TREE
+&& operand_equal_p (p->expr, q->expr, 0);
 }


RE: Predictive commoning leads to register to register moves through memory.

2015-09-24 Thread Simon Dardis
It seems to me that extending operand_equal_p to deal with the
MEM_REF/ARRAY_REF case could help.

Your recent change of:

  if (TREE_CODE (arg0) == MEM_REF
  && DECL_P (arg1)
  && TREE_CODE (TREE_OPERAND (arg0, 0)) == ADDR_EXPR
  && TREE_OPERAND (TREE_OPERAND (arg0, 0), 0) == arg1
  && integer_zerop (TREE_OPERAND (arg0, 1)))
return 1;
  else if (TREE_CODE (arg1) == MEM_REF
   && DECL_P (arg0)
   && TREE_CODE (TREE_OPERAND (arg1, 0)) == ADDR_EXPR
   && TREE_OPERAND (TREE_OPERAND (arg1, 0), 0) == arg0
   && integer_zerop (TREE_OPERAND (arg1, 1)))
return 1;
  return 0;

only seems to be handing the case of 'X' and MEM_REF[&X + 0]. Do you think
changing this to be 'return addr_eq_p (arg0, arg1);' where addr_eq_p handles 
cases of ARRAY_(RANGED_)REF, COMPONENT_REF by checking offset as well
would be useful? 

I've put together an initial version which splits out change to operand_equal_p
into its own predicate for just IDing cases of base objects and the above
mentioned addr_eq_p. Also in that patch is the change for mem_attrs_eq_p as
in that case the offset is not part of the TREE expression, so it has to be 
handled
differently.

Thoughts?

Thanks,
Simon

-Original Message-
From: Richard Biener [mailto:richard.guent...@gmail.com] 
Sent: 22 September 2015 12:12
To: Simon Dardis
Cc: Jeff Law; gcc@gcc.gnu.org
Subject: Re: Predictive commoning leads to register to register moves through 
memory.

On Tue, Sep 22, 2015 at 12:45 PM, Simon Dardis  wrote:
> I took an attempt at addressing this through the RTL GCSE pass. This 
> attempt tweaks mem_attrs_eq_p to return true if its comparing something like 
> poly+8 and MEM [&poly + 8].
>
> Is this a more suitable approach?

I actually recently modified stmt_kills_ref_p for a similar reason...
not that I liked that vey much.
I think the more suitable approach would be to not have both forms if they are 
actually equal.
Of course that's way more work.

So splitting out a function that handles sematic equality compare of MEM_EXPR 
sounds good to me.
No comments on your actual implementation yet, but I think we should do it in a 
way to be re-usable by stmt_kills_ref_p.

Richard.


> Thanks,
> Simon
>
> +/* Return true if p and q reference the same location by the same name but
> +   through VAR_DECL and MEM_REF.  */
> +
> +static bool
> +mem_locations_match_p (const struct mem_attrs *p, const struct 
> +mem_attrs *q) {
> +  HOST_WIDE_INT var_offset;
> +  tree var, memref;
> +
> +  if (p->expr == NULL_TREE || q->expr == NULL_TREE)
> +return false;
> +
> +  if (TREE_CODE (p->expr) == MEM_REF && TREE_CODE (q->expr) == VAR_DECL)
> +{
> +  var = q->expr;
> +  var_offset = q->offset;
> +  memref = p->expr;
> +}
> +  else if (TREE_CODE (q->expr) == MEM_REF && TREE_CODE (p->expr) == VAR_DECL)
> +{
> +  var = p->expr;
> +  var_offset = p->offset;
> +  memref = q->expr;
> +}
> +  else
> +return false;
> +
> +  if (TREE_OPERAND (TREE_OPERAND (memref, 0), 0) != var)
> +return false;
> +
> +  if (TREE_TYPE (TREE_TYPE (var)) != TREE_TYPE (memref))
> +return false;
> +
> +  tree offset = TREE_OPERAND (memref, 1);  if ((TREE_CODE (offset) == 
> + INTEGER_CST && tree_fits_shwi_p (offset)
> +  && tree_to_shwi (offset) == var_offset)
> +  || offset == NULL_TREE && var_offset == 0)
> +return true;
> +
> +  return false;
> +
> +}
> +
> /* Return true if the given memory attributes are equal.  */
>
>  bool
> @@ -254,16 +298,16 @@ mem_attrs_eq_p (const struct mem_attrs *p, const struct 
> mem_attrs *q)
>  return false;
>return (p->alias == q->alias
>   && p->offset_known_p == q->offset_known_p
> - && (!p->offset_known_p || p->offset == q->offset)
>   && p->size_known_p == q->size_known_p
>   && (!p->size_known_p || p->size == q->size)
>   && p->align == q->align
>   && p->addrspace == q->addrspace
> - && (p->expr == q->expr
> - || (p->expr != NULL_TREE && q->expr != NULL_TREE
> - && operand_equal_p (p->expr, q->expr, 0;
> + && (mem_locations_match_p (p, q)
> +|| (!p->offset_known_p || p->offset == q->offset)
> +&& (p->expr == q->expr
> +|| (p->expr != NULL_TREE && q->expr != NULL_TREE
> +&& operand_equal_p (p->expr, q->expr, 0);
>  }


gcse.patch
Description: gcse.patch