from:"William J. Schmidt"

[PATCH, rs6000] Remove XFAIL from default_format_denormal_2.f90 for PowerPC on Linux

2014-06-17 Thread William J. Schmidt

Hi,

The testcase gfortran.dg/default_format_denormal_2.f90 has been
reporting XPASS since 4.8 on the powerpc*-unknown-linux-gnu platforms.
This patch removes the XFAIL for powerpc*-*-linux-* from the test.  I
believe this pattern doesn't match any other platforms, but please let
me know if I should replace it with a more specific pattern instead.

Verified on powerpc64-unknown-linux-gnu (-m32 and -m64) and
powerpc64le-unknown-linux-gnu (-m64).  Is this ok for trunk, 4.9, and
4.8?

Thanks,
Bill


2014-06-17  Bill Schmidt  

* gfortran.dg/default_format_denormal_2.f90:  Remove xfail for
powerpc*-*-linux*.


Index: gcc/testsuite/gfortran.dg/default_format_denormal_2.f90
===
--- gcc/testsuite/gfortran.dg/default_format_denormal_2.f90 (revision 
211741)
+++ gcc/testsuite/gfortran.dg/default_format_denormal_2.f90 (working copy)
@@ -1,5 +1,5 @@
 ! { dg-require-effective-target fortran_large_real }
-! { dg-do run { xfail powerpc*-apple-darwin* powerpc*-*-linux* } }
+! { dg-do run { xfail powerpc*-apple-darwin* } }
 ! Test XFAILed on these platforms because the system's printf() lacks
 ! proper support for denormalized long doubles. See PR24685
 !

Re: [PATCH] Fix PR46556 (poor address generation)

2011-10-18 Thread William J. Schmidt

Greetings,

Here is a new revision of the tree portions of this patch.  I moved the
pattern recognizer to expand, and added additional logic to look for the
same pattern in gimple form.  I added two more tests to verify the new
logic.

I didn't run into any problems with the RTL CSE phases.  I can't recall
for certain what caused me to abandon the expand version previously.
There may not have been good reason; too many versions to keep track of
and too many interruptions, I suppose.  In any case, I'm much happier
having this code in the expander.

Paolo's RTL logic for unpropagating the zero-offset case is not going to
work out as is.  It causes a number of performance degradations, which I
suspect are due to the pass reordering.  That's a separate issue,
though, and not needed for this patch.

Bootstrapped and regression-tested on powerpc64-linux.  SPEC cpu2000
shows a number of small improvements and no significant degradations.
SPEC cpu2006 testing is pending.

Thanks,
Bill


2011-10-18  Bill Schmidt  

gcc:

PR rtl-optimization/46556
* expr.c (tree-pretty-print.h): New include.
(restructure_base_and_offset): New function.
(restructure_mem_ref): New function.
(expand_expr_real_1): In MEM_REF case, attempt restructure_mem_ref
first.  In normal_inner_ref case, attempt restructure_base_and_offset
first.
* Makefile.in: Update dependences for expr.o.

gcc/testsuite:

PR rtl-optimization/46556
* gcc.dg/tree-ssa-pr46556-1.c: New test.
* gcc.dg/tree-ssa-pr46556-2.c: Likewise.
* gcc.dg/tree-ssa-pr46556-3.c: Likewise.
* gcc.dg/tree-ssa-pr46556-4.c: Likewise.
* gcc.dg/tree-ssa-pr46556-5.c: Likewise.


Index: gcc/testsuite/gcc.dg/tree-ssa/pr46556-1.c
===
--- gcc/testsuite/gcc.dg/tree-ssa/pr46556-1.c   (revision 0)
+++ gcc/testsuite/gcc.dg/tree-ssa/pr46556-1.c   (revision 0)
@@ -0,0 +1,22 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-rtl-expand" } */
+
+struct x
+{
+  int a[16];
+  int b[16];
+  int c[16];
+};
+
+extern void foo (int, int, int);
+
+void
+f (struct x *p, unsigned int n)
+{
+  foo (p->a[n], p->c[n], p->b[n]);
+}
+
+/* { dg-final { scan-rtl-dump-times "\\(mem/s:SI \\(plus:" 2 "expand" } } */
+/* { dg-final { scan-rtl-dump-times "const_int 128" 1 "expand" } } */
+/* { dg-final { scan-rtl-dump-times "const_int 64 \\\[0x40\\\]\\)\\) \\\[" 1 
"expand" } } */
+/* { dg-final { cleanup-rtl-dump "expand" } } */
Index: gcc/testsuite/gcc.dg/tree-ssa/pr46556-2.c
===
--- gcc/testsuite/gcc.dg/tree-ssa/pr46556-2.c   (revision 0)
+++ gcc/testsuite/gcc.dg/tree-ssa/pr46556-2.c   (revision 0)
@@ -0,0 +1,26 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-rtl-expand" } */
+
+struct x
+{
+  int a[16];
+  int b[16];
+  int c[16];
+};
+
+extern void foo (int, int, int);
+
+void
+f (struct x *p, unsigned int n)
+{
+  foo (p->a[n], p->c[n], p->b[n]);
+  if (n > 12)
+foo (p->a[n], p->c[n], p->b[n]);
+  else if (n > 3)
+foo (p->b[n], p->a[n], p->c[n]);
+}
+
+/* { dg-final { scan-rtl-dump-times "\\(mem/s:SI \\(plus:" 6 "expand" } } */
+/* { dg-final { scan-rtl-dump-times "const_int 128" 3 "expand" } } */
+/* { dg-final { scan-rtl-dump-times "const_int 64 \\\[0x40\\\]\\)\\) \\\[" 3 
"expand" } } */
+/* { dg-final { cleanup-rtl-dump "expand" } } */
Index: gcc/testsuite/gcc.dg/tree-ssa/pr46556-3.c
===
--- gcc/testsuite/gcc.dg/tree-ssa/pr46556-3.c   (revision 0)
+++ gcc/testsuite/gcc.dg/tree-ssa/pr46556-3.c   (revision 0)
@@ -0,0 +1,27 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-rtl-expand" } */
+struct x
+{
+  int a[16];
+  int b[16];
+  int c[16];
+};
+
+extern void foo (int, int, int);
+
+void
+f (struct x *p, unsigned int n)
+{
+  foo (p->a[n], p->c[n], p->b[n]);
+  if (n > 3)
+{
+  foo (p->a[n], p->c[n], p->b[n]);
+  if (n > 12)
+   foo (p->b[n], p->a[n], p->c[n]);
+}
+}
+
+/* { dg-final { scan-rtl-dump-times "\\(mem/s:SI \\(plus:" 6 "expand" } } */
+/* { dg-final { scan-rtl-dump-times "const_int 128" 3 "expand" } } */
+/* { dg-final { scan-rtl-dump-times "const_int 64 \\\[0x40\\\]\\)\\) \\\[" 3 
"expand" } } */
+/* { dg-final { cleanup-rtl-dump "expand" } } */
Index: gcc/testsuite/gcc.dg/tree-ssa/pr46556-4.c
===
--- gcc/testsuite/gcc.dg/tree-ssa/pr46556-4.c   (revision 0)
+++ gcc/testsuite/gcc.dg/tree-ssa/pr46556-4.c   (revision 0)
@@ -0,0 +1,28 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-rtl-expand" } */
+
+struct x
+{
+  int a[16];
+  int b[16];
+  int c[16];
+};
+
+extern void foo (int, int, int);
+
+void
+f (struct x *p, unsigned int n)
+{
+  foo (*((int *)p + n*4), *((int *)p + 32 + n*4), *((int *)p + 16 + n*4));
+  if (n > 3)
+{
+  foo (*((int *)p + n*4), *((int *)p + 32 + n*4), *

Re: [PATCH] Fix PR46556 (poor address generation)

2011-10-21 Thread William J. Schmidt

On Fri, 2011-10-21 at 11:26 +0200, Richard Guenther wrote:
> On Tue, Oct 18, 2011 at 4:14 PM, William J. Schmidt
>  wrote:

> > +
> > +  /* We don't use get_def_for_expr for S1 because TER doesn't forward
> > + S1 in some situations where this transform is useful, such as
> > + when S1 is the base of two MEM_REFs fitting the pattern.  */
> > +  s1_stmt = SSA_NAME_DEF_STMT (TREE_OPERAND (exp, 0));
> 
> You can't do this - this will possibly generate wrong code.  You _do_
> have to use get_def_for_expr.  Or do it when we are still in "true" SSA 
> form...
> 
> Richard.
> 

OK.  get_def_for_expr always returns NULL here for the cases I was
targeting, so doing this in expand isn't going to be helpful.

Rather than cram this in somewhere else upstream, it might be better to
just wait and let this case be handled by the new strength reduction
pass.  This is one of the easy cases with explicit multiplies in the
instruction stream, so it shouldn't require any special handling there.
Seem reasonable?

Bill

Re: [PATCH] Fix PR46556 (poor address generation)

2011-10-24 Thread William J. Schmidt

OK, I've removed the pointer-arithmetic case from expand, to be handled
later by straight-line strength reduction.  Here's the patch to deal
with just the specific pattern of PR46556 (which will also eventually be
handled by strength reduction, but not as quickly).

(FYI, I've been thinking through the strength reduction pass, and my
plan is to stage in some of the easiest cases first, hopefully for 4.7,
and gradually add the more complex pieces.  Explicit multiplies in the
IL with known constants can be done pretty easily.  More complexity is
added when the multiplier is a variable, when conditional increments are
present, and when multiplies are hidden in addressing expressions.)

The present patch was bootstrapped and regression-tested on
powerpc64-linux.  OK for trunk?

Thanks,
Bill


2011-10-24  Bill Schmidt  

gcc:

PR rtl-optimization/46556
* expr.c (restructure_base_and_offset): New function.
(expand_expr_real_1): Replace result of get_inner_reference
with result of restructure_base_and_offset when applicable.
* Makefile.in (expr.o): Update dependencies.

gcc/testsuite:

PR rtl-optimization/46556
* gcc.dg/tree-ssa-pr46556-1.c: New testcase.
* gcc.dg/tree-ssa-pr46556-2.c: Likewise.
* gcc.dg/tree-ssa-pr46556-3.c: Likewise.


Index: gcc/testsuite/gcc.dg/tree-ssa/pr46556-1.c
===
--- gcc/testsuite/gcc.dg/tree-ssa/pr46556-1.c   (revision 0)
+++ gcc/testsuite/gcc.dg/tree-ssa/pr46556-1.c   (revision 0)
@@ -0,0 +1,22 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-rtl-expand" } */
+
+struct x
+{
+  int a[16];
+  int b[16];
+  int c[16];
+};
+
+extern void foo (int, int, int);
+
+void
+f (struct x *p, unsigned int n)
+{
+  foo (p->a[n], p->c[n], p->b[n]);
+}
+
+/* { dg-final { scan-rtl-dump-times "\\(mem/s:SI \\(plus:" 2 "expand" } } */
+/* { dg-final { scan-rtl-dump-times "const_int 128" 1 "expand" } } */
+/* { dg-final { scan-rtl-dump-times "const_int 64 \\\[0x40\\\]\\)\\) \\\[" 1 
"expand" } } */
+/* { dg-final { cleanup-rtl-dump "expand" } } */
Index: gcc/testsuite/gcc.dg/tree-ssa/pr46556-2.c
===
--- gcc/testsuite/gcc.dg/tree-ssa/pr46556-2.c   (revision 0)
+++ gcc/testsuite/gcc.dg/tree-ssa/pr46556-2.c   (revision 0)
@@ -0,0 +1,26 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-rtl-expand" } */
+
+struct x
+{
+  int a[16];
+  int b[16];
+  int c[16];
+};
+
+extern void foo (int, int, int);
+
+void
+f (struct x *p, unsigned int n)
+{
+  foo (p->a[n], p->c[n], p->b[n]);
+  if (n > 12)
+foo (p->a[n], p->c[n], p->b[n]);
+  else if (n > 3)
+foo (p->b[n], p->a[n], p->c[n]);
+}
+
+/* { dg-final { scan-rtl-dump-times "\\(mem/s:SI \\(plus:" 6 "expand" } } */
+/* { dg-final { scan-rtl-dump-times "const_int 128" 3 "expand" } } */
+/* { dg-final { scan-rtl-dump-times "const_int 64 \\\[0x40\\\]\\)\\) \\\[" 3 
"expand" } } */
+/* { dg-final { cleanup-rtl-dump "expand" } } */
Index: gcc/testsuite/gcc.dg/tree-ssa/pr46556-3.c
===
--- gcc/testsuite/gcc.dg/tree-ssa/pr46556-3.c   (revision 0)
+++ gcc/testsuite/gcc.dg/tree-ssa/pr46556-3.c   (revision 0)
@@ -0,0 +1,27 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-rtl-expand" } */
+struct x
+{
+  int a[16];
+  int b[16];
+  int c[16];
+};
+
+extern void foo (int, int, int);
+
+void
+f (struct x *p, unsigned int n)
+{
+  foo (p->a[n], p->c[n], p->b[n]);
+  if (n > 3)
+{
+  foo (p->a[n], p->c[n], p->b[n]);
+  if (n > 12)
+   foo (p->b[n], p->a[n], p->c[n]);
+}
+}
+
+/* { dg-final { scan-rtl-dump-times "\\(mem/s:SI \\(plus:" 6 "expand" } } */
+/* { dg-final { scan-rtl-dump-times "const_int 128" 3 "expand" } } */
+/* { dg-final { scan-rtl-dump-times "const_int 64 \\\[0x40\\\]\\)\\) \\\[" 3 
"expand" } } */
+/* { dg-final { cleanup-rtl-dump "expand" } } */
Index: gcc/expr.c
===
--- gcc/expr.c  (revision 180378)
+++ gcc/expr.c  (working copy)
@@ -56,6 +56,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "ssaexpand.h"
 #include "target-globals.h"
 #include "params.h"
+#include "tree-pretty-print.h"
 
 /* Decide whether a function's arguments should be processed
from first to last or from last to first.
@@ -7648,7 +7649,66 @@ expand_constructor (tree exp, rtx target, enum exp
   return target;
 }
 
+/* Given BASE, OFFSET, and BITPOS derived from EXPR, determine whether
+   there is a profitable opportunity to restructure address arithmetic
+   within BASE and OFFSET.  If so, produce such a restructuring and
+   return it.  */
+/* TODO: This belongs more properly in a separate pass that performs
+   general strength reduction on straight-line code.  Eventually move
+   this there.  */
 
+static tree
+restructure_base_and_offset (tree expr, tree base, tree offset,
+

[PATCH] Straight-line strength reduction, stage 1

2011-10-30 Thread William J. Schmidt

Greetings,

IVOPTS handles strength reduction of induction variables, but GCC does
not currently perform strength reduction in straight-line code.  This
has been noted before in PR22586 and PR35308.  PR46556 is also a case
that could be handled with strength reduction.  This patch adds a pass
to perform strength reduction along dominator paths on the easiest
cases, where replacements are obviously profitable.

My intent is to add subsequent installments to handle more involved
cases, as described in the code commentary.  The cases not yet covered
will require target-specific cost analysis to determine profitability.
It is likely that this will leverage some of the cost function
infrastructure in tree-ssa-loop-ivopts.c.

I've bootstrapped and tested the code on powerpc64-linux with no
regressions.  I've also run SPEC CPU2006 to compare results.  32-bit
PowerPC gains about 1% on integer code.  Other results are in the noise
range.  64-bit integer code would have also shown gains, except for one
bad result (400.perlbench degraded 4%).  Initial analysis shows that
very different control flow is created for regmatch with the patch than
without.  I will be investigating further this week, but presumably some
touchy threshold was no longer met for a downstream optimization --
probably an indirect effect.

My hope is to commit this first stage as part of 4.7, with the remainder
to follow in 4.8.

Thanks for your consideration,

Bill


2011-10-30  Bill Schmidt  

gcc:

* tree-pass.h (pass_strength_reduction): New declaration.
* timevar.def (TV_TREE_SLSR): New time-var.
* tree-ssa-strength-reduction.c: New file.
* Makefile.in: New dependences.
* passes.c (init_optimization_passes): Add new pass.

gcc/testsuite:

* gcc.dg/tree-ssa/slsr-1.c: New test case.
* gcc.dg/tree-ssa/slsr-2.c: New test case.
* gcc.dg/tree-ssa/slsr-3.c: New test case.
* gcc.dg/tree-ssa/slsr-4.c: New test case.


Index: gcc/tree-pass.h
===
--- gcc/tree-pass.h (revision 180617)
+++ gcc/tree-pass.h (working copy)
@@ -449,6 +449,7 @@ extern struct gimple_opt_pass pass_tracer;
 extern struct gimple_opt_pass pass_warn_unused_result;
 extern struct gimple_opt_pass pass_split_functions;
 extern struct gimple_opt_pass pass_feedback_split_functions;
+extern struct gimple_opt_pass pass_strength_reduction;
 
 /* IPA Passes */
 extern struct simple_ipa_opt_pass pass_ipa_lower_emutls;
Index: gcc/testsuite/gcc.dg/tree-ssa/slsr-1.c
===
--- gcc/testsuite/gcc.dg/tree-ssa/slsr-1.c  (revision 0)
+++ gcc/testsuite/gcc.dg/tree-ssa/slsr-1.c  (revision 0)
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -fdump-tree-optimized" } */
+
+extern void foo (int);
+
+void
+f (int *p, unsigned int n)
+{
+  foo (*(p + n * 4));
+  foo (*(p + 32 + n * 4));
+  if (n > 3)
+foo (*(p + 16 + n * 4));
+  else
+foo (*(p + 48 + n * 4));
+}
+
+/* { dg-final { scan-tree-dump-times "\\+ 128" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "\\+ 64" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "\\+ 192" 1 "optimized" } } */
+/* { dg-final { cleanup-tree-dump "optimized" } } */
Index: gcc/testsuite/gcc.dg/tree-ssa/slsr-2.c
===
--- gcc/testsuite/gcc.dg/tree-ssa/slsr-2.c  (revision 0)
+++ gcc/testsuite/gcc.dg/tree-ssa/slsr-2.c  (revision 0)
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -fdump-tree-optimized" } */
+
+extern void foo (int);
+
+void
+f (int *p, int n)
+{
+  foo (*(p + n++ * 4));
+  foo (*(p + 32 + n++ * 4));
+  foo (*(p + 16 + n * 4));
+}
+
+/* { dg-final { scan-tree-dump-times "\\+ 144" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "\\+ 96" 1 "optimized" } } */
+/* { dg-final { cleanup-tree-dump "optimized" } } */
Index: gcc/testsuite/gcc.dg/tree-ssa/slsr-3.c
===
--- gcc/testsuite/gcc.dg/tree-ssa/slsr-3.c  (revision 0)
+++ gcc/testsuite/gcc.dg/tree-ssa/slsr-3.c  (revision 0)
@@ -0,0 +1,22 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -fdump-tree-optimized" } */
+
+int
+foo (int a[], int b[], int i)
+{
+  a[i] = b[i] + 2;
+  i++;
+  a[i] = b[i] + 2;
+  i++;
+  a[i] = b[i] + 2;
+  i++;
+  a[i] = b[i] + 2;
+  i++;
+  return i;
+}
+
+/* { dg-final { scan-tree-dump-times "\\* 4" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "\\+ 4" 2 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "\\+ 8" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "\\+ 12" 1 "optimized" } } */
+/* { dg-final { cleanup-tree-dump "optimized" } } */
Index: gcc/testsuite/gcc.dg/tree-ssa/slsr-4.c
===
--- gcc/testsuite/gcc.dg/tree-ssa/slsr-4.c  (revision 0)
+++ gcc/testsuite/gcc.dg/tree-

[PING] Re: [PATCH] Fix PR46556 (poor address generation)

2011-10-30 Thread William J. Schmidt

Ping.

On Mon, 2011-10-24 at 08:38 -0500, William J. Schmidt wrote:
> OK, I've removed the pointer-arithmetic case from expand, to be handled
> later by straight-line strength reduction.  Here's the patch to deal
> with just the specific pattern of PR46556 (which will also eventually be
> handled by strength reduction, but not as quickly).
> 
> (FYI, I've been thinking through the strength reduction pass, and my
> plan is to stage in some of the easiest cases first, hopefully for 4.7,
> and gradually add the more complex pieces.  Explicit multiplies in the
> IL with known constants can be done pretty easily.  More complexity is
> added when the multiplier is a variable, when conditional increments are
> present, and when multiplies are hidden in addressing expressions.)
> 
> The present patch was bootstrapped and regression-tested on
> powerpc64-linux.  OK for trunk?
> 
> Thanks,
> Bill
> 
> 
> 2011-10-24  Bill Schmidt  
> 
> gcc:
> 
>   PR rtl-optimization/46556
>   * expr.c (restructure_base_and_offset): New function.
>   (expand_expr_real_1): Replace result of get_inner_reference
>   with result of restructure_base_and_offset when applicable.
>   * Makefile.in (expr.o): Update dependencies.
> 
> gcc/testsuite:
> 
>   PR rtl-optimization/46556
>   * gcc.dg/tree-ssa-pr46556-1.c: New testcase.
>   * gcc.dg/tree-ssa-pr46556-2.c: Likewise.
>   * gcc.dg/tree-ssa-pr46556-3.c: Likewise.
> 
> 
> Index: gcc/testsuite/gcc.dg/tree-ssa/pr46556-1.c
> ===
> --- gcc/testsuite/gcc.dg/tree-ssa/pr46556-1.c (revision 0)
> +++ gcc/testsuite/gcc.dg/tree-ssa/pr46556-1.c (revision 0)
> @@ -0,0 +1,22 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-rtl-expand" } */
> +
> +struct x
> +{
> +  int a[16];
> +  int b[16];
> +  int c[16];
> +};
> +
> +extern void foo (int, int, int);
> +
> +void
> +f (struct x *p, unsigned int n)
> +{
> +  foo (p->a[n], p->c[n], p->b[n]);
> +}
> +
> +/* { dg-final { scan-rtl-dump-times "\\(mem/s:SI \\(plus:" 2 "expand" } } */
> +/* { dg-final { scan-rtl-dump-times "const_int 128" 1 "expand" } } */
> +/* { dg-final { scan-rtl-dump-times "const_int 64 \\\[0x40\\\]\\)\\) \\\[" 1 
> "expand" } } */
> +/* { dg-final { cleanup-rtl-dump "expand" } } */
> Index: gcc/testsuite/gcc.dg/tree-ssa/pr46556-2.c
> ===
> --- gcc/testsuite/gcc.dg/tree-ssa/pr46556-2.c (revision 0)
> +++ gcc/testsuite/gcc.dg/tree-ssa/pr46556-2.c (revision 0)
> @@ -0,0 +1,26 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-rtl-expand" } */
> +
> +struct x
> +{
> +  int a[16];
> +  int b[16];
> +  int c[16];
> +};
> +
> +extern void foo (int, int, int);
> +
> +void
> +f (struct x *p, unsigned int n)
> +{
> +  foo (p->a[n], p->c[n], p->b[n]);
> +  if (n > 12)
> +foo (p->a[n], p->c[n], p->b[n]);
> +  else if (n > 3)
> +foo (p->b[n], p->a[n], p->c[n]);
> +}
> +
> +/* { dg-final { scan-rtl-dump-times "\\(mem/s:SI \\(plus:" 6 "expand" } } */
> +/* { dg-final { scan-rtl-dump-times "const_int 128" 3 "expand" } } */
> +/* { dg-final { scan-rtl-dump-times "const_int 64 \\\[0x40\\\]\\)\\) \\\[" 3 
> "expand" } } */
> +/* { dg-final { cleanup-rtl-dump "expand" } } */
> Index: gcc/testsuite/gcc.dg/tree-ssa/pr46556-3.c
> ===
> --- gcc/testsuite/gcc.dg/tree-ssa/pr46556-3.c (revision 0)
> +++ gcc/testsuite/gcc.dg/tree-ssa/pr46556-3.c (revision 0)
> @@ -0,0 +1,27 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-rtl-expand" } */
> +struct x
> +{
> +  int a[16];
> +  int b[16];
> +  int c[16];
> +};
> +
> +extern void foo (int, int, int);
> +
> +void
> +f (struct x *p, unsigned int n)
> +{
> +  foo (p->a[n], p->c[n], p->b[n]);
> +  if (n > 3)
> +{
> +  foo (p->a[n], p->c[n], p->b[n]);
> +  if (n > 12)
> + foo (p->b[n], p->a[n], p->c[n]);
> +}
> +}
> +
> +/* { dg-final { scan-rtl-dump-times "\\(mem/s:SI \\(plus:" 6 "expand" } } */
> +/* { dg-final { scan-rtl-dump-times "const_int 128" 3 "expand" } } */
> +/* { dg-final { scan-rtl-dump-times "const_int 64 \\\[0x40\\\]\\)\\) \\\[" 3 
> "expand" } } */
> +/* { dg-final { cleanup-rtl-dump "expand" } } */
&g

Re: [PING] Re: [PATCH] Fix PR46556 (poor address generation)

2011-11-02 Thread William J. Schmidt

On Wed, 2011-11-02 at 12:55 +0100, Richard Guenther wrote:
> On Sun, 30 Oct 2011, William J. Schmidt wrote:
> 
> > Ping.
> > 
> > On Mon, 2011-10-24 at 08:38 -0500, William J. Schmidt wrote:
> > > OK, I've removed the pointer-arithmetic case from expand, to be handled
> > > later by straight-line strength reduction.  Here's the patch to deal
> > > with just the specific pattern of PR46556 (which will also eventually be
> > > handled by strength reduction, but not as quickly).
> > > 
> > > (FYI, I've been thinking through the strength reduction pass, and my
> > > plan is to stage in some of the easiest cases first, hopefully for 4.7,
> > > and gradually add the more complex pieces.  Explicit multiplies in the
> > > IL with known constants can be done pretty easily.  More complexity is
> > > added when the multiplier is a variable, when conditional increments are
> > > present, and when multiplies are hidden in addressing expressions.)
> > > 
> > > The present patch was bootstrapped and regression-tested on
> > > powerpc64-linux.  OK for trunk?
> 
> Hmmm ...
> 
> > > Thanks,
> > > Bill
> > > 
> > > 
> > > 2011-10-24  Bill Schmidt  
> > > 
> > > gcc:
> > > 
> > >   PR rtl-optimization/46556
> > >   * expr.c (restructure_base_and_offset): New function.
> > >   (expand_expr_real_1): Replace result of get_inner_reference
> > >   with result of restructure_base_and_offset when applicable.
> > >   * Makefile.in (expr.o): Update dependencies.
> > > 
> > > gcc/testsuite:
> > > 
> > >   PR rtl-optimization/46556
> > >   * gcc.dg/tree-ssa-pr46556-1.c: New testcase.
> > >   * gcc.dg/tree-ssa-pr46556-2.c: Likewise.
> > >   * gcc.dg/tree-ssa-pr46556-3.c: Likewise.
> > > 
> > > 
> > > Index: gcc/testsuite/gcc.dg/tree-ssa/pr46556-1.c
> > > ===
> > > --- gcc/testsuite/gcc.dg/tree-ssa/pr46556-1.c (revision 0)
> > > +++ gcc/testsuite/gcc.dg/tree-ssa/pr46556-1.c (revision 0)
> > > @@ -0,0 +1,22 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -fdump-rtl-expand" } */
> > > +
> > > +struct x
> > > +{
> > > +  int a[16];
> > > +  int b[16];
> > > +  int c[16];
> > > +};
> > > +
> > > +extern void foo (int, int, int);
> > > +
> > > +void
> > > +f (struct x *p, unsigned int n)
> > > +{
> > > +  foo (p->a[n], p->c[n], p->b[n]);
> > > +}
> > > +
> > > +/* { dg-final { scan-rtl-dump-times "\\(mem/s:SI \\(plus:" 2 "expand" } 
> > > } */
> > > +/* { dg-final { scan-rtl-dump-times "const_int 128" 1 "expand" } } */
> > > +/* { dg-final { scan-rtl-dump-times "const_int 64 \\\[0x40\\\]\\)\\) 
> > > \\\[" 1 "expand" } } */
> > > +/* { dg-final { cleanup-rtl-dump "expand" } } */
> > > Index: gcc/testsuite/gcc.dg/tree-ssa/pr46556-2.c
> > > ===
> > > --- gcc/testsuite/gcc.dg/tree-ssa/pr46556-2.c (revision 0)
> > > +++ gcc/testsuite/gcc.dg/tree-ssa/pr46556-2.c (revision 0)
> > > @@ -0,0 +1,26 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -fdump-rtl-expand" } */
> > > +
> > > +struct x
> > > +{
> > > +  int a[16];
> > > +  int b[16];
> > > +  int c[16];
> > > +};
> > > +
> > > +extern void foo (int, int, int);
> > > +
> > > +void
> > > +f (struct x *p, unsigned int n)
> > > +{
> > > +  foo (p->a[n], p->c[n], p->b[n]);
> > > +  if (n > 12)
> > > +foo (p->a[n], p->c[n], p->b[n]);
> > > +  else if (n > 3)
> > > +foo (p->b[n], p->a[n], p->c[n]);
> > > +}
> > > +
> > > +/* { dg-final { scan-rtl-dump-times "\\(mem/s:SI \\(plus:" 6 "expand" } 
> > > } */
> > > +/* { dg-final { scan-rtl-dump-times "const_int 128" 3 "expand" } } */
> > > +/* { dg-final { scan-rtl-dump-times "const_int 64 \\\[0x40\\\]\\)\\) 
> > > \\\[" 3 "expand" } } */
> > > +/* { dg-final { cleanup-rtl-dump "expand" } } */
> > > Index: g

Re: [PATCH] Straight-line strength reduction, stage 1

2011-11-04 Thread William J. Schmidt

Richard, thanks for the quick reply!  I realize there's a lot of traffic
on gcc-patches right now, so I appreciate your time.

On Fri, 2011-11-04 at 14:55 +0100, Richard Guenther wrote:
> On Sun, Oct 30, 2011 at 5:10 PM, William J. Schmidt
>  wrote:
> > Greetings,
> >
> > IVOPTS handles strength reduction of induction variables, but GCC does
> > not currently perform strength reduction in straight-line code.  This
> > has been noted before in PR22586 and PR35308.  PR46556 is also a case
> > that could be handled with strength reduction.  This patch adds a pass
> > to perform strength reduction along dominator paths on the easiest
> > cases, where replacements are obviously profitable.
> >
> > My intent is to add subsequent installments to handle more involved
> > cases, as described in the code commentary.  The cases not yet covered
> > will require target-specific cost analysis to determine profitability.
> > It is likely that this will leverage some of the cost function
> > infrastructure in tree-ssa-loop-ivopts.c.
> >
> > I've bootstrapped and tested the code on powerpc64-linux with no
> > regressions.  I've also run SPEC CPU2006 to compare results.  32-bit
> > PowerPC gains about 1% on integer code.  Other results are in the noise
> > range.  64-bit integer code would have also shown gains, except for one
> > bad result (400.perlbench degraded 4%).  Initial analysis shows that
> > very different control flow is created for regmatch with the patch than
> > without.  I will be investigating further this week, but presumably some
> > touchy threshold was no longer met for a downstream optimization --
> > probably an indirect effect.
> >
> > My hope is to commit this first stage as part of 4.7, with the remainder
> > to follow in 4.8.
> >
> > Thanks for your consideration,
> 
> I've had a quick look for now and noted two things
> 
> 1) you try to handle casting already - for the specific patterns, it seems
> the constraints are the same as for detecting when using a widening
> multiplication/add is possible?  I think we should have some common
> code for the legality checks.

I took a quick look at the logic in the widening mult/add code this
morning.  I don't think I can use the same logic, since it doesn't
appear to constrain a widening operation on unsigned when wrap semantics
might be in play.  I found cases in libiberty (IIRC) where this breaks
code.

The issue is for code like:

unsigned a;
long unsigned b, c;
b1 = (long unsigned) a1;
c1 = b1 * 10;
a2 = a + 1; /* may wrap */
b2 = (long unsigned) a2;
c2 = b2 * 10;

We'd like to replace the last statement with

c2 = c1 + 10;

but this isn't correct when a is max(unsigned).  I am not sure why the
widening mult/add code wouldn't have similar concerns.  Perhaps I missed
something there.

> 
> 2) you do not handle POINTER_PLUS_EXPR - which looks most
> interesting for the pure pointer arithmetic case.

For the patterns I'm looking at here, the PLUS_EXPR is creating an
integer index that feeds an integer multiply, which doesn't make sense
for addresses.  I think the issue here is that I am handling one set of
cases and you are thinking about some others that I need to pick up as
well.  More on this below...

>   (and you do not
> treat PLUS_EXPR as commutative, thus a + b*c you handle but
> not b*c + a(?), for a*b + c*d we'd have two candidates)

The cases I'm looking at in this first patch are of the form (x + c) *
d, where x is an SSA name and c and d are constants.  I was under the
impression that I could expect the constant to always be the second RHS
operand for an add of this form; if this is not the case, then I do need
to handle the commutativity of the operands.  (I thought this was the
case for multiplies as well, but I just ran across a case where the
vectorizer produced x = 10 * y, so I know now that I can't count on it
there.)

> 
> I'm not sure I follow the dominance checks - usually instead of
> 
> + || !dominated_by_p (CDI_DOMINATORS,
> + gimple_bb (c->cand_stmt),
> + gimple_bb (use_stmt)))
> 
> you'd special case gimple_bb (c->cand_stmt) == gimple_bb (use_stmt)
> and have stmt uids to resolve dominance inside a basic block.

The statement uids aren't needed because of the partial ordering
established on the candidate list.  If one candidate dominates another,
it will have a lower candidate number.  This allows me to avoid a pass
to assign stmt uids.

The first time this routine is called for a candidate, it isn't yet in
the table; all candidates in the table that may serve as it

Re: [PATCH] Straight-line strength reduction, stage 1

2011-11-05 Thread William J. Schmidt

On Fri, 2011-11-04 at 14:55 +0100, Richard Guenther wrote:
> On Sun, Oct 30, 2011 at 5:10 PM, William J. Schmidt
>  wrote:
> > 
> 
> You do not seem to transform stmts on-the-fly, so for
> 
> a1 = c + d;
> a2 = c + 2*d;
> a3 = c + 3*d;
> 
> are you able to generate
> 
> a1 = c + d;
> a2 = a1 + d;
> a3 = a2 + d;
> 
> ?  On-the-fly operation would also handle this if the candidate info
> for a2 is kept as c + 2*d.  Though it's probably worth lookign at
> 
> a1 = c + d;
> a2 = a1 + d;
> a3 = c + 3*d;
> 
> and see if that still figures out that a3 = a2 + d (thus, do you,
> when you find the candidate a1 + 1 * d, fold in candidate information
> for a1?  thus, try to reduce it to an ultimate base?)
> 
> Thanks,
> Richard.

Just a couple of quick thoughts here.

As I mentioned, this patch is only for the cases where the stride is a
constant.  The only interesting patterns I could think of for that case
is what I'm currently handling, where an add-immediate feeds a multiply,
e.g., y = (x + c) * d where c and d are constants.

Once the stride is a variable, we have not only those cases, but also
cases like you show here where the multiply feeds into an add.  Those
can be handled with the existing infrastructure in a slightly different
way.  The main differences are:

 - The cand_stmt will be the add in this case.  We always want the
candidate to be the statement that we hope to replace.

 - The base_name will be the "ultimate base," so that all the original
candidates in your example will have c for the base.  This may involve
looking back through casts.

 - The index will be the multiplier applied to the stride.

The logic for finding the nearest dominating basis will be pretty much
identical.  

The candidate table again contains enough information that we don't need
to do on-the-fly replacement, but can examine all the related candidates
at once.  This will be important for the add-feeding-multiply case with
an SSA name stride, since we sometimes need to introduce multiplies by a
constant in order to remove general multiplies of two registers.

But again, that's all for a follow-on patch.  My thought was to get this
one set of easy candidates handled in a first patch so you could get a
look at the general infrastructure without having to review a huge chunk
of code at once.  Once that patch is in place, the next stages would be:

2. SSA-name strides, both multiply-add and add-multiply forms.
3. Cases involving conditional increments (looking through PHIs).
4. Cases where the multiplies are hidden in addressing expressions.

I have a pretty good idea where I'm going with stages 2 and 3.  Stage 4
is where things are likely to get a bit bloodier, and I will be glad for
any advice about the best way to handle those as we get to that point.

Thanks again,
Bill

Re: [PATCH] Strength reduction part 3 of 4: candidates with unknown strides

2012-08-08 Thread William J. Schmidt

On Wed, 2012-08-08 at 15:35 -0700, Janis Johnson wrote:
> On 08/08/2012 03:27 PM, Andrew Pinski wrote:
> > On Wed, Aug 8, 2012 at 3:25 PM, H.J. Lu  wrote:
> >> On Wed, Aug 1, 2012 at 10:36 AM, William J. Schmidt
> >>  wrote:
> >>> Greetings,
> >>>
> >>> Thanks for the review of part 2!  Here's another chunk of the SLSR code
> >>> (I feel I owe you a few beers at this point).  This performs analysis
> >>> and replacement on groups of related candidates having an SSA name
> >>> (rather than a constant) for a stride.
> >>>
> >>> This leaves only the conditional increment (CAND_PHI) case, which will
> >>> be handled in the last patch of the series.
> >>>
> >>> Bootstrapped and tested on powerpc64-unknown-linux-gnu with no new
> >>> regressions.  Ok for trunk?
> >>>
> >>> Thanks,
> >>> Bill
> >>>
> >>>
> >>> gcc:
> >>>
> >>> 2012-08-01  Bill Schmidt  
> >>>
> >>> * gimple-ssa-strength-reduction.c (struct incr_info_d): New 
> >>> struct.
> >>> (incr_vec): New static var.
> >>> (incr_vec_len): Likewise.
> >>> (address_arithmetic_p): Likewise.
> >>> (stmt_cost): Remove dead assignment.
> >>> (dump_incr_vec): New function.
> >>> (cand_abs_increment): Likewise.
> >>> (lazy_create_slsr_reg): Likewise.
> >>> (incr_vec_index): Likewise.
> >>> (count_candidates): Likewise.
> >>> (record_increment): Likewise.
> >>> (record_increments): Likewise.
> >>> (unreplaced_cand_in_tree): Likewise.
> >>> (optimize_cands_for_speed_p): Likewise.
> >>> (lowest_cost_path): Likewise.
> >>> (total_savings): Likewise.
> >>> (analyze_increments): Likewise.
> >>> (ncd_for_two_cands): Likewise.
> >>> (nearest_common_dominator_for_cands): Likewise.
> >>> (profitable_increment_p): Likewise.
> >>> (insert_initializers): Likewise.
> >>> (introduce_cast_before_cand): Likewise.
> >>> (replace_rhs_if_not_dup): Likewise.
> >>> (replace_one_candidate): Likewise.
> >>> (replace_profitable_candidates): Likewise.
> >>> (analyze_candidates_and_replace): Handle candidates with SSA-name
> >>> strides.
> >>>
> >>> gcc/testsuite:
> >>>
> >>> 2012-08-01  Bill Schmidt  
> >>>
> >>> * gcc.dg/tree-ssa/slsr-5.c: New.
> >>> * gcc.dg/tree-ssa/slsr-6.c: New.
> >>> * gcc.dg/tree-ssa/slsr-7.c: New.
> >>> * gcc.dg/tree-ssa/slsr-8.c: New.
> >>> * gcc.dg/tree-ssa/slsr-9.c: New.
> >>> * gcc.dg/tree-ssa/slsr-10.c: New.
> >>> * gcc.dg/tree-ssa/slsr-11.c: New.
> >>> * gcc.dg/tree-ssa/slsr-12.c: New.
> >>> * gcc.dg/tree-ssa/slsr-13.c: New.
> >>> * gcc.dg/tree-ssa/slsr-14.c: New.
> >>> * gcc.dg/tree-ssa/slsr-15.c: New.
> >>> * gcc.dg/tree-ssa/slsr-16.c: New.
> >>> * gcc.dg/tree-ssa/slsr-17.c: New.
> >>> * gcc.dg/tree-ssa/slsr-18.c: New.
> >>> * gcc.dg/tree-ssa/slsr-19.c: New.
> >>> * gcc.dg/tree-ssa/slsr-20.c: New.
> >>> * gcc.dg/tree-ssa/slsr-21.c: New.
> >>> * gcc.dg/tree-ssa/slsr-22.c: New.
> >>> * gcc.dg/tree-ssa/slsr-23.c: New.
> >>> * gcc.dg/tree-ssa/slsr-24.c: New.
> >>> * gcc.dg/tree-ssa/slsr-25.c: New.
> >>> * gcc.dg/tree-ssa/slsr-26.c: New.
> >>> * gcc.dg/tree-ssa/slsr-30.c: New.
> >>> * gcc.dg/tree-ssa/slsr-31.c: New.
> >>>
> >>>
> >> ==
> >>> --- gcc/testsuite/gcc.dg/tree-ssa/slsr-30.c (revision 0)
> >>> +++ gcc/testsuite/gcc.dg/tree-ssa/slsr-30.c (revision 0)
> >>> @@ -0,0 +1,25 @@
> >>> +/* Verify straight-line strength reduction fails for simple integer 
> >>> addition
> >>> +   with casts thrown in when -fwrapv is used.  */
> >>> +
> >>> +/* { dg-do compile } */
> >>> +/* { dg-options "-O3 -fdump-tree-dom2 -fwrapv" }

Re: [PATCH] Strength reduction part 3 of 4: candidates with unknown strides

2012-08-09 Thread William J. Schmidt

On Wed, 2012-08-08 at 19:22 -0700, Janis Johnson wrote:
> On 08/08/2012 06:41 PM, William J. Schmidt wrote:
> > On Wed, 2012-08-08 at 15:35 -0700, Janis Johnson wrote:
> >> On 08/08/2012 03:27 PM, Andrew Pinski wrote:
> >>> On Wed, Aug 8, 2012 at 3:25 PM, H.J. Lu  wrote:
> >>>> On Wed, Aug 1, 2012 at 10:36 AM, William J. Schmidt
> >>>>  wrote:
> 
> >>>>> +/* { dg-do compile } */
> >>>>> +/* { dg-options "-O3 -fdump-tree-dom2 -fwrapv" } */
> >>>>> +/* { dg-skip-if "" { ilp32 } { "-m32" } { "" } } */
> >>>>> +
> >>>>
> >>>> This doesn't work on x32 nor Linux/ia32 since -m32
> >>>> may not be needed for ILP32.  This patch works for
> >>>> me.  OK to install?
> >>>
> >>> This also does not work for mips64 where the options are either
> >>> -mabi=32 or -mabi=n32 for ILP32.
> >>>
> >>> HJL's patch looks correct.
> >>>
> >>> Thanks,
> >>> Andrew
> >>
> >> There are GCC targets with 16-bit integers.  What's the actual
> >> set of targets on which this test is meant to run?  There's a list
> >> of effective-target names based on data type sizes in
> >> <http://gcc.gnu.org/onlinedocs/gccint/Effective_002dTarget-Keywords.html#Effective_002dTarget-Keywords>.
> > 
> > Yes, sorry.  The test really is only valid when int and long have
> > different sizes.  So according to that link we should skip ilp32 and
> > llp64 at a minimum.  It isn't clear what we should do for int16 since
> > the size of long isn't specified, so I suppose we should skip that as
> > well.  So, perhaps modify HJ's patch to have
> > 
> > /* { dg-do compile { target { ! { ilp32 llp64 int16 } } } } */
> > 
> > ?
> > 
> > Thanks,
> > Bill
> 
> That's confusing.  Perhaps what you really need is a new effective
> target for "sizeof(int) != sizeof(long)".

Good idea.  I'll work up a patch when I get a moment.

Thanks,
Bill

> 
> Janis
>

[PATCH] Fix PR54211

2012-08-09 Thread William J. Schmidt

Fix a thinko in strength reduction.  I was checking the type of the
wrong operand to determine whether address arithmetic should be used in
replacing expressions.  This produced a spurious POINTER_PLUS_EXPR when
an address was converted to an unsigned long and back again.

Bootstrapped and tested on powerpc64-unknown-linux-gnu with no new
regressions.  Ok for trunk?

Thanks,
Bill


gcc:

2012-08-09  Bill Schmidt  

PR middle-end/54211
* gimple-ssa-strength-reduction.c (analyze_candidates_and_replace):
Use cand_type to determine whether pointer arithmetic will be generated.

gcc/testsuite:

2012-08-09  Bill Schmidt  

PR middle-end/54211
* gcc.dg/tree-ssa/pr54211.c: New test.


Index: gcc/testsuite/gcc.dg/tree-ssa/pr54211.c
===
--- gcc/testsuite/gcc.dg/tree-ssa/pr54211.c (revision 0)
+++ gcc/testsuite/gcc.dg/tree-ssa/pr54211.c (revision 0)
@@ -0,0 +1,28 @@
+/* { dg-do compile } */
+/* { dg-options "-Os" } */
+
+int a, b;
+unsigned char e;
+void fn1 ()
+{
+unsigned char *c=0;
+for (;; a++)
+{
+unsigned char d = *(c + b);
+for (; &e<&d; c++)
+goto Found_Top;
+}
+Found_Top:
+if (0)
+goto Empty_Bitmap;
+for (;; a++)
+{
+unsigned char *e = c + b;
+for (; c < e; c++)
+goto Found_Bottom;
+c -= b;
+}
+Found_Bottom:
+Empty_Bitmap:
+;
+}
Index: gcc/gimple-ssa-strength-reduction.c
===
--- gcc/gimple-ssa-strength-reduction.c (revision 190260)
+++ gcc/gimple-ssa-strength-reduction.c (working copy)
@@ -2534,7 +2534,7 @@ analyze_candidates_and_replace (void)
  /* Determine whether we'll be generating pointer arithmetic
 when replacing candidates.  */
  address_arithmetic_p = (c->kind == CAND_ADD
- && POINTER_TYPE_P (TREE_TYPE (c->base_expr)));
+ && POINTER_TYPE_P (c->cand_type));
 
  /* If all candidates have already been replaced under other
 interpretations, nothing remains to be done.  */

[PATCH, testsuite] New effective target long_neq_int

2012-08-09 Thread William J. Schmidt

As suggested by Janis regarding testsuite/gcc.dg/tree-ssa/slsr-30.c,
this patch adds a new effective target for machines having long and int
of differing sizes.

Tested on powerpc64-unknown-linux-gnu, where the test passes for -m64
and is skipped for -m32.  Ok for trunk?

Thanks,
Bill


doc:

2012-08-09  Bill Schmidt  

* sourcebuild.texi: Document long_neq_int effective target.


testsuite:

2012-08-09  Bill Schmidt  

* lib/target-supports.exp (check_effective_target_long_neq_int): New.
* gcc.dg/tree-ssa/slsr-30.c: Check for long_neq_int effective target.


Index: gcc/doc/sourcebuild.texi
===
--- gcc/doc/sourcebuild.texi(revision 190260)
+++ gcc/doc/sourcebuild.texi(working copy)
@@ -1303,6 +1303,9 @@ Target has @code{int} that is at 32 bits or longer
 @item int16
 Target has @code{int} that is 16 bits or shorter.
 
+@item long_neq_int
+Target has @code{int} and @code{long} with different sizes.
+
 @item large_double
 Target supports @code{double} that is longer than @code{float}.
 
Index: gcc/testsuite/lib/target-supports.exp
===
--- gcc/testsuite/lib/target-supports.exp   (revision 190260)
+++ gcc/testsuite/lib/target-supports.exp   (working copy)
@@ -1689,6 +1689,15 @@ proc check_effective_target_llp64 { } {
 }]
 }
 
+# Return 1 if long and int have different sizes,
+# 0 otherwise.
+
+proc check_effective_target_long_neq_int { } {
+return [check_no_compiler_messages long_ne_int object {
+   int dummy[sizeof (int) != sizeof (long) ? 1 : -1];
+}]
+}
+
 # Return 1 if the target supports long double larger than double,
 # 0 otherwise.
 
Index: gcc/testsuite/gcc.dg/tree-ssa/slsr-30.c
===
--- gcc/testsuite/gcc.dg/tree-ssa/slsr-30.c (revision 190260)
+++ gcc/testsuite/gcc.dg/tree-ssa/slsr-30.c (working copy)
@@ -1,7 +1,7 @@
 /* Verify straight-line strength reduction fails for simple integer addition
with casts thrown in when -fwrapv is used.  */
 
-/* { dg-do compile { target { ! { ilp32 } } } } */
+/* { dg-do compile { target { long_neq_int } } } */
 /* { dg-options "-O3 -fdump-tree-dom2 -fwrapv" } */
 
 long

[PATCH] Fix PR54240

2012-08-14 Thread William J. Schmidt

Replace the once vacuously true, and now vacuously false, test for
existence of a conditional move instruction for a given mode, with one
that actually checks what it's supposed to.  Add a test case so we don't
miss such things in future.

The test is powerpc-specific.  It would be good to have an i386 version
of the test as well, if someone can help with that.

Bootstrapped and tested on powerpc64-unknown-linux-gnu with no new
regressions.  Ok for trunk?

Thanks,
Bill


gcc:

2012-08-13  Bill Schmidt  

PR tree-optimization/54240
* tree-ssa-phiopt.c (hoist_adjacent_loads): Correct test for
existence of conditional move with given mode.


gcc/testsuite:

2012-08-13  Bill Schmidt  

PR tree-optimization/54240
* gcc.target/powerpc/pr54240.c: New test.


Index: gcc/testsuite/gcc.target/powerpc/pr54240.c
===
--- gcc/testsuite/gcc.target/powerpc/pr54240.c  (revision 0)
+++ gcc/testsuite/gcc.target/powerpc/pr54240.c  (revision 0)
@@ -0,0 +1,27 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -misel -fdump-tree-phiopt-details" } */
+
+typedef struct s {
+  int v;
+  int b;
+  struct s *l;
+  struct s *r;
+} S;
+
+
+int foo(S *s)
+{
+  S *this;
+  S *next;
+
+  this = s;
+  if (this->b)
+next = this->l;
+  else
+next = this->r;
+
+  return next->v;
+}
+
+/* { dg-final { scan-tree-dump "Hoisting adjacent loads" "phiopt1" } } */
+/* { dg-final { cleanup-tree-dump "phiopt1" } } */
Index: gcc/tree-ssa-phiopt.c
===
--- gcc/tree-ssa-phiopt.c   (revision 190305)
+++ gcc/tree-ssa-phiopt.c   (working copy)
@@ -1843,7 +1843,8 @@ hoist_adjacent_loads (basic_block bb0, basic_block
 
   /* Check the mode of the arguments to be sure a conditional move
 can be generated for it.  */
-  if (!optab_handler (cmov_optab, TYPE_MODE (TREE_TYPE (arg1
+  if (optab_handler (movcc_optab, TYPE_MODE (TREE_TYPE (arg1)))
+ == CODE_FOR_nothing)
continue;
 
   /* Both statements must be assignments whose RHS is a COMPONENT_REF.  */

Re: [PATCH] Fix PR54240

2012-08-14 Thread William J. Schmidt

Thanks, Andrew!

Bill

On Tue, 2012-08-14 at 14:17 -0700, Andrew Pinski wrote:
> On Tue, Aug 14, 2012 at 2:15 PM, Andrew Pinski  wrote:
> > On Tue, Aug 14, 2012 at 2:11 PM, William J. Schmidt
> >  wrote:
> >> Replace the once vacuously true, and now vacuously false, test for
> >> existence of a conditional move instruction for a given mode, with one
> >> that actually checks what it's supposed to.  Add a test case so we don't
> >> miss such things in future.
> >>
> >> The test is powerpc-specific.  It would be good to have an i386 version
> >> of the test as well, if someone can help with that.
> >>
> >> Bootstrapped and tested on powerpc64-unknown-linux-gnu with no new
> >> regressions.  Ok for trunk?
> >
> > Here is one which can go into gcc.target/mips :
> > /* { dg-do compile } */
> > /* { dg-options "-O2 -fdump-tree-phiopt-details" } */
> 
> Sorry the dg-options should be:
>  /* { dg-options "-O2 -fdump-tree-phiopt-details isa>=4" } */
> 
> Thanks,
> Andrew
> 
> >
> > typedef struct s {
> >   int v;
> >   int b;
> >   struct s *l;
> >   struct s *r;
> > } S;
> >
> >
> > int foo(S *s)
> > {
> >   S *this;
> >   S *next;
> >
> >   this = s;
> >   if (this->b)
> > next = this->l;
> >   else
> > next = this->r;
> >
> >   return next->v;
> > }
> >
> > /* { dg-final { scan-tree-dump "Hoisting adjacent loads" "phiopt1" } } */
> > /* { dg-final { cleanup-tree-dump "phiopt1" } } */
>

[PATCH] Fix PR54245

2012-08-14 Thread William J. Schmidt

Currently we can insert an initializer that performs a multiply in too
small of a type for correctness.  For now, detect the problem and avoid
the optimization when this would happen.  Eventually I will fix this up
to cause the multiply to be performed in a sufficiently wide type.

Bootstrapped and tested on powerpc64-unknown-linux-gnu with no new
regressions.  Ok for trunk?

Thanks,
Bill


gcc:

2012-08-14  Bill Schmidt  

PR tree-optimization/54245
* gimple-ssa-strength-reduction.c (legal_cast_p_1): New function.
(legal_cast_p): Split out logic to legal_cast_p_1.
(analyze_increments): Avoid introducing multiplies in smaller types.


gcc/testsuite:

2012-08-14  Bill Schmidt  

PR tree-optimization/54245
* gcc.dg/tree-ssa/pr54245.c: New test.


Index: gcc/testsuite/gcc.dg/tree-ssa/pr54245.c
===
--- gcc/testsuite/gcc.dg/tree-ssa/pr54245.c (revision 0)
+++ gcc/testsuite/gcc.dg/tree-ssa/pr54245.c (revision 0)
@@ -0,0 +1,49 @@
+/* { dg-do compile } */
+/* { dg-options "-O1 -fdump-tree-slsr-details" } */
+
+#include 
+
+#define W1  22725
+#define W2  21407
+#define W3  19266
+#define W6  8867
+
+void idct_row(short *row, int *dst)
+{
+int a0, a1, b0, b1;
+
+a0 = W1 * row[0];
+a1 = a0;
+
+a0 += W2 * row[2];
+a1 += W6 * row[2];
+
+b0 = W1 * row[1];
+b1 = W3 * row[1];
+
+dst[0] = a0 + b0;
+dst[1] = a0 - b0;
+dst[2] = a1 + b1;
+dst[3] = a1 - b1;
+}
+
+static short block[8] = { 1, 2, 3, 4 };
+
+int main(void)
+{
+int out[4];
+int i;
+
+idct_row(block, out);
+
+for (i = 0; i < 4; i++)
+printf("%d\n", out[i]);
+
+return !(out[2] == 87858 && out[3] == 10794);
+}
+
+/* For now, disable inserting an initializer when the multiplication will
+   take place in a smaller type than originally.  This test may be deleted
+   in future when this case is handled more precisely.  */
+/* { dg-final { scan-tree-dump-times "Inserting initializer" 0 "slsr" } } */
+/* { dg-final { cleanup-tree-dump "slsr" } } */
Index: gcc/gimple-ssa-strength-reduction.c
===
--- gcc/gimple-ssa-strength-reduction.c (revision 190305)
+++ gcc/gimple-ssa-strength-reduction.c (working copy)
@@ -1089,6 +1089,32 @@ slsr_process_neg (gimple gs, tree rhs1, bool speed
   add_cand_for_stmt (gs, c);
 }
 
+/* Help function for legal_cast_p, operating on two trees.  Checks
+   whether it's allowable to cast from RHS to LHS.  See legal_cast_p
+   for more details.  */
+
+static bool
+legal_cast_p_1 (tree lhs, tree rhs)
+{
+  tree lhs_type, rhs_type;
+  unsigned lhs_size, rhs_size;
+  bool lhs_wraps, rhs_wraps;
+
+  lhs_type = TREE_TYPE (lhs);
+  rhs_type = TREE_TYPE (rhs);
+  lhs_size = TYPE_PRECISION (lhs_type);
+  rhs_size = TYPE_PRECISION (rhs_type);
+  lhs_wraps = TYPE_OVERFLOW_WRAPS (lhs_type);
+  rhs_wraps = TYPE_OVERFLOW_WRAPS (rhs_type);
+
+  if (lhs_size < rhs_size
+  || (rhs_wraps && !lhs_wraps)
+  || (rhs_wraps && lhs_wraps && rhs_size != lhs_size))
+return false;
+
+  return true;
+}
+
 /* Return TRUE if GS is a statement that defines an SSA name from
a conversion and is legal for us to combine with an add and multiply
in the candidate table.  For example, suppose we have:
@@ -1129,28 +1155,11 @@ slsr_process_neg (gimple gs, tree rhs1, bool speed
 static bool
 legal_cast_p (gimple gs, tree rhs)
 {
-  tree lhs, lhs_type, rhs_type;
-  unsigned lhs_size, rhs_size;
-  bool lhs_wraps, rhs_wraps;
-
   if (!is_gimple_assign (gs)
   || !CONVERT_EXPR_CODE_P (gimple_assign_rhs_code (gs)))
 return false;
 
-  lhs = gimple_assign_lhs (gs);
-  lhs_type = TREE_TYPE (lhs);
-  rhs_type = TREE_TYPE (rhs);
-  lhs_size = TYPE_PRECISION (lhs_type);
-  rhs_size = TYPE_PRECISION (rhs_type);
-  lhs_wraps = TYPE_OVERFLOW_WRAPS (lhs_type);
-  rhs_wraps = TYPE_OVERFLOW_WRAPS (rhs_type);
-
-  if (lhs_size < rhs_size
-  || (rhs_wraps && !lhs_wraps)
-  || (rhs_wraps && lhs_wraps && rhs_size != lhs_size))
-return false;
-
-  return true;
+  return legal_cast_p_1 (gimple_assign_lhs (gs), rhs);
 }
 
 /* Given GS which is a cast to a scalar integer type, determine whether
@@ -1996,6 +2005,31 @@ analyze_increments (slsr_cand_t first_dep, enum ma
   != POINTER_PLUS_EXPR)))
incr_vec[i].cost = COST_NEUTRAL;
   
+  /* FORNOW: If we need to add an initializer, give up if a cast from
+the candidate's type to its stride's type can lose precision.
+This could eventually be handled better by expressly retaining the
+result of a cast to a wider type in the stride.  Example:
+
+   short int _1;
+  _2 = (int) _1;
+  _3 = _2 * 10;
+  _4 = x + _3;ADD: x + (10 * _1) : int
+  _5 = _2 * 15;
+  _6 = x + _3;ADD: x + (15 * _1) : int
+
+ Right now replacing _6 would cause insertion of an initializ

Re: [patch] rs6000: plug a leak

2012-08-22 Thread William J. Schmidt

On Thu, 2012-08-23 at 00:53 +0200, Steven Bosscher wrote:
> Hello Bill,
> 
> This patch plugs a leak in rs6000.c:rs6000_density_test(). You have to
> free the array that get_loop_body returns. Noticed while going over
> all uses of get_loop_body (it's a common mistake to leak the return
> array).
> 
> Patch is completely untested because I don't know when/how this
> function is used. You've added this function:
> 
> 2012-07-31  Bill Schmidt  <...>
> 
> * config/rs6000/rs6000.c (rs6000_builtin_vectorization_cost): Revise
> costs for vec_perm and vec_promote_demote down to more natural values.
> (struct _rs6000_cost_data): New data structure.
> -->(rs6000_density_test): New function
> 
> so I suppose you know what it's for and how to test this patch :-)
> 
> Could you test this for me and commit it if nothing strange happens?

Sure thing!  Thanks for catching this.

Bill

> 
> Thanks,
> 
> Ciao!
> Steven
> 
> 
> 
> Index: config/rs6000/rs6000.c
> ===
> --- config/rs6000/rs6000.c  (revision 190601)
> +++ config/rs6000/rs6000.c  (working copy)
> @@ -3509,6 +3509,7 @@ rs6000_density_test (rs6000_cost_data *d
> not_vec_cost++;
> }
>  }
> +  free (bbs);
> 
>density_pct = (vec_cost * 100) / (vec_cost + not_vec_cost);
>

Re: [patch] rs6000: plug a leak

2012-08-23 Thread William J. Schmidt

On Thu, 2012-08-23 at 00:53 +0200, Steven Bosscher wrote:
> Hello Bill,
> 
> This patch plugs a leak in rs6000.c:rs6000_density_test(). You have to
> free the array that get_loop_body returns. Noticed while going over
> all uses of get_loop_body (it's a common mistake to leak the return
> array).
> 
> Patch is completely untested because I don't know when/how this
> function is used. You've added this function:
> 
> 2012-07-31  Bill Schmidt  <...>
> 
> * config/rs6000/rs6000.c (rs6000_builtin_vectorization_cost): Revise
> costs for vec_perm and vec_promote_demote down to more natural values.
> (struct _rs6000_cost_data): New data structure.
> -->(rs6000_density_test): New function
> 
> so I suppose you know what it's for and how to test this patch :-)
> 
> Could you test this for me and commit it if nothing strange happens?

Hi Steven,

Regstrapped with no additional failures on powerpc64-unknown-linux-gnu.
Built CPU2006 without error.  Committed as obvious.

Thanks again,
Bill

> 
> Thanks,
> 
> Ciao!
> Steven
> 
> 
> 
> Index: config/rs6000/rs6000.c
> ===
> --- config/rs6000/rs6000.c  (revision 190601)
> +++ config/rs6000/rs6000.c  (working copy)
> @@ -3509,6 +3509,7 @@ rs6000_density_test (rs6000_cost_data *d
> not_vec_cost++;
> }
>  }
> +  free (bbs);
> 
>density_pct = (vec_cost * 100) / (vec_cost + not_vec_cost);
>

[PATCH] Fix PR54492

2012-09-10 Thread William J. Schmidt

Richard found some N^2 behavior in SLSR that has to be suppressed.
Searching for the best possible basis is overkill when there are
hundreds of thousands of possibilities.  This patch constrains the
search to "good enough" in such cases.

Bootstrapped and tested on powerpc64-unknown-linux-gnu with no
regressions.  Ok for trunk?

Thanks,
Bill


2012-08-10  Bill Schmidt  

* gimple-ssa-strength-reduction.c (find_basis_for_candidate): Limit
the time spent searching for a basis.


Index: gcc/gimple-ssa-strength-reduction.c
===
--- gcc/gimple-ssa-strength-reduction.c (revision 191135)
+++ gcc/gimple-ssa-strength-reduction.c (working copy)
@@ -353,10 +353,14 @@ find_basis_for_candidate (slsr_cand_t c)
   cand_chain_t chain;
   slsr_cand_t basis = NULL;
 
+  // Limit potential of N^2 behavior for long candidate chains.
+  int iters = 0;
+  const int MAX_ITERS = 50;
+
   mapping_key.base_expr = c->base_expr;
   chain = (cand_chain_t) htab_find (base_cand_map, &mapping_key);
 
-  for (; chain; chain = chain->next)
+  for (; chain && iters < MAX_ITERS; chain = chain->next, ++iters)
 {
   slsr_cand_t one_basis = chain->cand;

Re: [PATCH] Fix PR54492

2012-09-10 Thread William J. Schmidt

On Mon, 2012-09-10 at 16:45 +0200, Richard Guenther wrote:
> On Mon, 10 Sep 2012, William J. Schmidt wrote:
> 
> > Richard found some N^2 behavior in SLSR that has to be suppressed.
> > Searching for the best possible basis is overkill when there are
> > hundreds of thousands of possibilities.  This patch constrains the
> > search to "good enough" in such cases.
> > 
> > Bootstrapped and tested on powerpc64-unknown-linux-gnu with no
> > regressions.  Ok for trunk?
> 
> Hm, rather than stopping the search, can we stop adding new candidates
> instead so the list never grows that long?  If that's not easy
> the patch is ok as-is.

I think this way is probably better.  Right now the potential bases are
organized as a stack with new ones added to the front and considered
first.  To disable it there would require adding state to keep a count,
and then we would only be looking at the most distant ones.  This way
the 50 most recently added potential bases (most likely to be local) are
considered.

Thanks,
Bill

> 
> Thanks,
> Richard.
> 
> > Thanks,
> > Bill
> > 
> > 
> > 2012-08-10  Bill Schmidt  
> > 
> > * gimple-ssa-strength-reduction.c (find_basis_for_candidate): Limit
> > the time spent searching for a basis.
> > 
> > 
> > Index: gcc/gimple-ssa-strength-reduction.c
> > ===
> > --- gcc/gimple-ssa-strength-reduction.c (revision 191135)
> > +++ gcc/gimple-ssa-strength-reduction.c (working copy)
> > @@ -353,10 +353,14 @@ find_basis_for_candidate (slsr_cand_t c)
> >cand_chain_t chain;
> >slsr_cand_t basis = NULL;
> >  
> > +  // Limit potential of N^2 behavior for long candidate chains.
> > +  int iters = 0;
> > +  const int MAX_ITERS = 50;
> > +
> >mapping_key.base_expr = c->base_expr;
> >chain = (cand_chain_t) htab_find (base_cand_map, &mapping_key);
> >  
> > -  for (; chain; chain = chain->next)
> > +  for (; chain && iters < MAX_ITERS; chain = chain->next, ++iters)
> >  {
> >slsr_cand_t one_basis = chain->cand;
> >  
> > 
> > 
> > 
>

Re: [PATCH] Fix PR54492

2012-09-10 Thread William J. Schmidt

On Mon, 2012-09-10 at 16:56 +0200, Richard Guenther wrote:
> On Mon, 10 Sep 2012, Jakub Jelinek wrote:
> 
> > On Mon, Sep 10, 2012 at 04:45:24PM +0200, Richard Guenther wrote:
> > > On Mon, 10 Sep 2012, William J. Schmidt wrote:
> > > 
> > > > Richard found some N^2 behavior in SLSR that has to be suppressed.
> > > > Searching for the best possible basis is overkill when there are
> > > > hundreds of thousands of possibilities.  This patch constrains the
> > > > search to "good enough" in such cases.
> > > > 
> > > > Bootstrapped and tested on powerpc64-unknown-linux-gnu with no
> > > > regressions.  Ok for trunk?
> > > 
> > > Hm, rather than stopping the search, can we stop adding new candidates
> > > instead so the list never grows that long?  If that's not easy
> > > the patch is ok as-is.
> > 
> > Don't we want a param for that, or is a hardcoded magic constant fine here?
> 
> I suppose a param for it would be nice.

OK, I'll get a param in place and get back to you.  Thanks...

Bill

> 
> Richard.
> 
> > > > 2012-08-10  Bill Schmidt  
> > > > 
> > > > * gimple-ssa-strength-reduction.c (find_basis_for_candidate): 
> > > > Limit
> > > > the time spent searching for a basis.
> > > > 
> > > > 
> > > > Index: gcc/gimple-ssa-strength-reduction.c
> > > > ===
> > > > --- gcc/gimple-ssa-strength-reduction.c (revision 191135)
> > > > +++ gcc/gimple-ssa-strength-reduction.c (working copy)
> > > > @@ -353,10 +353,14 @@ find_basis_for_candidate (slsr_cand_t c)
> > > >cand_chain_t chain;
> > > >slsr_cand_t basis = NULL;
> > > >  
> > > > +  // Limit potential of N^2 behavior for long candidate chains.
> > > > +  int iters = 0;
> > > > +  const int MAX_ITERS = 50;
> > > > +
> > > >mapping_key.base_expr = c->base_expr;
> > > >chain = (cand_chain_t) htab_find (base_cand_map, &mapping_key);
> > > >  
> > > > -  for (; chain; chain = chain->next)
> > > > +  for (; chain && iters < MAX_ITERS; chain = chain->next, ++iters)
> > > >  {
> > > >slsr_cand_t one_basis = chain->cand;
> > 
> > Jakub
> > 
> > 
>

Re: [PATCH] Fix PR54492

2012-09-10 Thread William J. Schmidt

Here's the revised patch with a param.  Bootstrapped and tested in the
same manner.  Ok for trunk?

Thanks,
Bill


2012-08-10  Bill Schmidt  

* doc/invoke.texi (max-slsr-cand-scan): New description.
* gimple-ssa-strength-reduction.c (find_basis_for_candidate): Limit
the time spent searching for a basis.
* params.def (PARAM_MAX_SLSR_CANDIDATE_SCAN): New param.


Index: gcc/doc/invoke.texi
===
--- gcc/doc/invoke.texi (revision 191135)
+++ gcc/doc/invoke.texi (working copy)
@@ -9407,6 +9407,11 @@ having a regular register file and accurate regist
 See @file{haifa-sched.c} in the GCC sources for more details.
 
 The default choice depends on the target.
+
+@item max-slsr-cand-scan
+Set the maximum number of existing candidates that will be considered when
+seeking a basis for a new straight-line strength reduction candidate.
+
 @end table
 @end table
 
Index: gcc/gimple-ssa-strength-reduction.c
===
--- gcc/gimple-ssa-strength-reduction.c (revision 191135)
+++ gcc/gimple-ssa-strength-reduction.c (working copy)
@@ -54,6 +54,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "domwalk.h"
 #include "pointer-set.h"
 #include "expmed.h"
+#include "params.h"
 
 /* Information about a strength reduction candidate.  Each statement
in the candidate table represents an expression of one of the
@@ -353,10 +354,14 @@ find_basis_for_candidate (slsr_cand_t c)
   cand_chain_t chain;
   slsr_cand_t basis = NULL;
 
+  // Limit potential of N^2 behavior for long candidate chains.
+  int iters = 0;
+  int max_iters = PARAM_VALUE (PARAM_MAX_SLSR_CANDIDATE_SCAN);
+
   mapping_key.base_expr = c->base_expr;
   chain = (cand_chain_t) htab_find (base_cand_map, &mapping_key);
 
-  for (; chain; chain = chain->next)
+  for (; chain && iters < max_iters; chain = chain->next, ++iters)
 {
   slsr_cand_t one_basis = chain->cand;
 
Index: gcc/params.def
===
--- gcc/params.def  (revision 191135)
+++ gcc/params.def  (working copy)
@@ -973,6 +973,13 @@ DEFPARAM (PARAM_SCHED_PRESSURE_ALGORITHM,
  "Which -fsched-pressure algorithm to apply",
  1, 1, 2)
 
+/* Maximum length of candidate scans in straight-line strength reduction.  */
+DEFPARAM (PARAM_MAX_SLSR_CANDIDATE_SCAN,
+ "max-slsr-cand-scan",
+ "Maximum length of candidate scans for straight-line "
+ "strength reduction",
+ 50, 1, 99)
+
 /*
 Local variables:
 mode:c

[PATCH] Fix PR54674

2012-09-24 Thread William J. Schmidt

In cases where pointers and ints are cast back and forth, SLSR can be
tricked into introducing a multiply where one of the operands is of
pointer type.  Don't do that!

Verified that the reduced test case in the PR is fixed with a
cross-compile to sh4-unknown-linux-gnu with -Os, which is the only known
situation where the replacement looks profitable.  (It appears multiply
costs are underestimated.)

Bootstrapped and tested on powerpc64-unknown-linux-gnu with no new
regressions.  Ok for trunk?

Thanks,
Bill


2012-09-24  Bill Schmidt  

* gimple-ssa-strength-reduction.c (analyze_increments): Don't
introduce a multiplication with a pointer operand.


Index: gcc/gimple-ssa-strength-reduction.c
===
--- gcc/gimple-ssa-strength-reduction.c (revision 191665)
+++ gcc/gimple-ssa-strength-reduction.c (working copy)
@@ -2028,6 +2028,15 @@ analyze_increments (slsr_cand_t first_dep, enum ma
 
incr_vec[i].cost = COST_INFINITE;
 
+  /* If we need to add an initializer, make sure we don't introduce
+a multiply by a pointer type, which can happen in certain cast
+scenarios.  */
+  else if (!incr_vec[i].initializer
+  && TREE_CODE (first_dep->stride) != INTEGER_CST
+  && POINTER_TYPE_P (TREE_TYPE (first_dep->stride)))
+
+   incr_vec[i].cost = COST_INFINITE;
+
   /* For any other increment, if this is a multiply candidate, we
 must introduce a temporary T and initialize it with
 T_0 = stride * increment.  When optimizing for speed, walk the

Re: [PATCH] Fix PR54674

2012-09-25 Thread William J. Schmidt



On Tue, 2012-09-25 at 09:14 +0200, Richard Guenther wrote:
> On Mon, 24 Sep 2012, William J. Schmidt wrote:
> 
> > In cases where pointers and ints are cast back and forth, SLSR can be
> > tricked into introducing a multiply where one of the operands is of
> > pointer type.  Don't do that!
> > 
> > Verified that the reduced test case in the PR is fixed with a
> > cross-compile to sh4-unknown-linux-gnu with -Os, which is the only known
> > situation where the replacement looks profitable.  (It appears multiply
> > costs are underestimated.)
> > 
> > Bootstrapped and tested on powerpc64-unknown-linux-gnu with no new
> > regressions.  Ok for trunk?
> 
> Ok.  Btw, a multiply by/of a pointer in GIMPLE is done by casting
> to an appropriate unsigned type, doing the multiply, and then
> casting back to the pointer type.  Just in case it _is_ profitable
> to do the transform (the patch seems to try to avoid the situation
> only?)

Ok, that's good to know, thanks.  There's a general to-do in that area
to make the whole casting part better than it is right now, and that
should be addressed when I can get back to GCC and work on some of these
things.  I'll add a comment to that effect.  Appreciate the information!

Thanks,
Bill

> 
> Thanks,
> Richard.
> 
> > Thanks,
> > Bill
> > 
> > 
> > 2012-09-24  Bill Schmidt  
> > 
> > * gimple-ssa-strength-reduction.c (analyze_increments): Don't
> > introduce a multiplication with a pointer operand.
> > 
> > 
> > Index: gcc/gimple-ssa-strength-reduction.c
> > ===
> > --- gcc/gimple-ssa-strength-reduction.c (revision 191665)
> > +++ gcc/gimple-ssa-strength-reduction.c (working copy)
> > @@ -2028,6 +2028,15 @@ analyze_increments (slsr_cand_t first_dep, enum ma
> >  
> > incr_vec[i].cost = COST_INFINITE;
> >  
> > +  /* If we need to add an initializer, make sure we don't introduce
> > +a multiply by a pointer type, which can happen in certain cast
> > +scenarios.  */
> > +  else if (!incr_vec[i].initializer
> > +  && TREE_CODE (first_dep->stride) != INTEGER_CST
> > +  && POINTER_TYPE_P (TREE_TYPE (first_dep->stride)))
> > +
> > +   incr_vec[i].cost = COST_INFINITE;
> > +
> >/* For any other increment, if this is a multiply candidate, we
> >  must introduce a temporary T and initialize it with
> >  T_0 = stride * increment.  When optimizing for speed, walk the
> > 
> > 
> > 
>

[PATCH] Some vector cost model cleanup

2012-06-13 Thread William J. Schmidt

This is just some general maintenance to the vectorizer's cost model
code:

 * Corrects a typo in a function name;
 * Eliminates an unnecessary function;
 * Combines some duplicate inline functions.

Bootstrapped and tested on powerpc64-unknown-linux-gnu, no new
regressions.  Ok for trunk?

Thanks,
Bill


2012-06-13  Bill Schmidt  

* tree-vectorizer.h (vect_get_stmt_cost): Move from tree-vect-stmts.c.
(cost_for_stmt): Remove decl.
(vect_get_single_scalar_iteration_cost): Correct typo in name.
* tree-vect-loop.c (vect_get_cost): Remove.
(vect_get_single_scalar_iteration_cost): Correct typo in name; use
vect_get_stmt_cost rather than vect_get_cost.
(vect_get_known_peeling_cost): Use vect_get_stmt_cost rather than
vect_get_cost.
(vect_estimate_min_profitable_iters): Correct typo in call to
vect_get_single_scalar_iteration_cost; use vect_get_stmt_cost rather
than vect_get_cost.
(vect_model_reduction_cost): Use vect_get_stmt_cost rather than
vect_get_cost.
(vect_model_induction_cost): Likewise.
* tree-vect-data-refs.c (vect_peeling_hash_get_lowest_cost): Correct
typo in call to vect_get_single_scalar_iteration_cost.
* tree-vect-stmts.c (vect_get_stmt_cost): Move to tree-vectorizer.h.
(cost_for_stmt): Remove unnecessary function.
* Makefile.in (TREE_VECTORIZER_H): Update dependencies.


Index: gcc/tree-vectorizer.h
===
--- gcc/tree-vectorizer.h   (revision 188507)
+++ gcc/tree-vectorizer.h   (working copy)
@@ -23,6 +23,7 @@ along with GCC; see the file COPYING3.  If not see
 #define GCC_TREE_VECTORIZER_H
 
 #include "tree-data-ref.h"
+#include "target.h"
 
 typedef source_location LOC;
 #define UNKNOWN_LOC UNKNOWN_LOCATION
@@ -769,6 +770,18 @@ vect_pow2 (int x)
   return res;
 }
 
+/* Get cost by calling cost target builtin.  */
+
+static inline
+int vect_get_stmt_cost (enum vect_cost_for_stmt type_of_cost)
+{
+  tree dummy_type = NULL;
+  int dummy = 0;
+
+  return targetm.vectorize.builtin_vectorization_cost (type_of_cost,
+   dummy_type, dummy);
+}
+
 /*-*/
 /* Info on data references alignment.  */
 /*-*/
@@ -843,7 +856,6 @@ extern void vect_model_load_cost (stmt_vec_info, i
 extern void vect_finish_stmt_generation (gimple, gimple,
  gimple_stmt_iterator *);
 extern bool vect_mark_stmts_to_be_vectorized (loop_vec_info);
-extern int cost_for_stmt (gimple);
 extern tree vect_get_vec_def_for_operand (tree, gimple, tree *);
 extern tree vect_init_vector (gimple, tree, tree,
   gimple_stmt_iterator *);
@@ -919,7 +931,7 @@ extern int vect_estimate_min_profitable_iters (loo
 extern tree get_initial_def_for_reduction (gimple, tree, tree *);
 extern int vect_min_worthwhile_factor (enum tree_code);
 extern int vect_get_known_peeling_cost (loop_vec_info, int, int *, int);
-extern int vect_get_single_scalar_iteraion_cost (loop_vec_info);
+extern int vect_get_single_scalar_iteration_cost (loop_vec_info);
 
 /* In tree-vect-slp.c.  */
 extern void vect_free_slp_instance (slp_instance);
Index: gcc/tree-vect-loop.c
===
--- gcc/tree-vect-loop.c(revision 188507)
+++ gcc/tree-vect-loop.c(working copy)
@@ -1201,19 +1201,6 @@ vect_analyze_loop_form (struct loop *loop)
 }
 
 
-/* Get cost by calling cost target builtin.  */
-
-static inline int
-vect_get_cost (enum vect_cost_for_stmt type_of_cost)
-{
-  tree dummy_type = NULL;
-  int dummy = 0;
-
-  return targetm.vectorize.builtin_vectorization_cost (type_of_cost,
-   dummy_type, dummy);
-}
-
- 
 /* Function vect_analyze_loop_operations.
 
Scan the loop stmts and make sure they are all vectorizable.  */
@@ -2385,7 +2372,7 @@ vect_force_simple_reduction (loop_vec_info loop_in
 
 /* Calculate the cost of one scalar iteration of the loop.  */
 int
-vect_get_single_scalar_iteraion_cost (loop_vec_info loop_vinfo)
+vect_get_single_scalar_iteration_cost (loop_vec_info loop_vinfo)
 {
   struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
   basic_block *bbs = LOOP_VINFO_BBS (loop_vinfo);
@@ -2434,12 +2421,12 @@ int
   if (STMT_VINFO_DATA_REF (vinfo_for_stmt (stmt)))
 {
   if (DR_IS_READ (STMT_VINFO_DATA_REF (vinfo_for_stmt (stmt
-   stmt_cost = vect_get_cost (scalar_load);
+   stmt_cost = vect_get_stmt_cost (scalar_load);
  else
-   stmt_cost = vect_get_cost (scalar_store);
+   stmt_cost = vect_get_stmt_cost (scalar_store);
 }
   else
-st

Re: [Patch ping] Strength reduction

2012-06-14 Thread William J. Schmidt

Pro forma ping. :)

Thanks,
Bill

On Sun, 2012-04-29 at 18:17 -0500, William J. Schmidt wrote:
> Thought I'd ping http://gcc.gnu.org/ml/gcc-patches/2012-03/msg01225.html
> since it's been about six weeks.  Any initial feedback would be very
> much appreciated!
> 
> Thanks,
> Bill

[PATCH] Fix PR53703

2012-06-17 Thread William J. Schmidt

The test case exposes a bug that occurs only when a diamond control flow
pattern has the arguments of the joining phi in a different order from
the successor arcs of the entry block.  My logic for setting
bb_for_def[12] was just brain-dead.  This cleans that up and also
prevents wasting time examining phis of virtual ops, which I noticed
happening while debugging this.

Bootstrapped and regtested on powerpc64-unknown-linux-gnu with no new
failures.  Ok for trunk?

Thanks,
Bill


gcc:

2012-06-17  Bill Schmidt  

PR tree-optimization/53703
* tree-ssa-phiopt.c (hoist_adjacent_loads): Skip virtual phis;
correctly set bb_for_def[12].

gcc/testsuite:

2012-06-17  Bill Schmidt  

PR tree-optimization/53703
* gcc.dg/torture/pr53703.c: New test.


Index: gcc/testsuite/gcc.dg/torture/pr53703.c
===
--- gcc/testsuite/gcc.dg/torture/pr53703.c  (revision 0)
+++ gcc/testsuite/gcc.dg/torture/pr53703.c  (revision 0)
@@ -0,0 +1,149 @@
+/* Reduced test case from PR53703.  Used to ICE.  */
+
+/* { dg-do compile } */
+/* { dg-options "-w" } */
+
+typedef long unsigned int size_t;
+typedef unsigned short int sa_family_t;
+struct sockaddr   {};
+typedef unsigned char __u8;
+typedef unsigned short __u16;
+typedef unsigned int __u32;
+struct nlmsghdr {
+  __u32 nlmsg_len;
+  __u16 nlmsg_type;
+};
+struct ifaddrmsg {
+  __u8 ifa_family;
+};
+enum {
+  IFA_ADDRESS,
+  IFA_LOCAL,
+};
+enum {
+  RTM_NEWLINK = 16,
+  RTM_NEWADDR = 20,
+};
+struct rtattr {
+  unsigned short rta_len;
+  unsigned short rta_type;
+};
+struct ifaddrs {
+  struct ifaddrs *ifa_next;
+  unsigned short ifa_flags;
+};
+typedef unsigned short int uint16_t;
+typedef unsigned int uint32_t;
+struct nlmsg_list {
+  struct nlmsg_list *nlm_next;
+  int size;
+};
+struct rtmaddr_ifamap {
+  void *address;
+  void *local;
+  int address_len;
+  int local_len;
+};
+int usagi_getifaddrs (struct ifaddrs **ifap)
+{
+  struct nlmsg_list *nlmsg_list, *nlmsg_end, *nlm;
+  size_t dlen, xlen, nlen;
+  int build;
+  for (build = 0; build <= 1; build++)
+{
+  struct ifaddrs *ifl = ((void *)0), *ifa = ((void *)0);
+  struct nlmsghdr *nlh, *nlh0;
+  uint16_t *ifflist = ((void *)0);
+  struct rtmaddr_ifamap ifamap;
+  for (nlm = nlmsg_list; nlm; nlm = nlm->nlm_next)
+   {
+ int nlmlen = nlm->size;
+ for (nlh = nlh0;
+  ((nlmlen) >= (int)sizeof(struct nlmsghdr)
+   && (nlh)->nlmsg_len >= sizeof(struct nlmsghdr)
+   && (nlh)->nlmsg_len <= (nlmlen));
+  nlh = ((nlmlen) -= ( (((nlh)->nlmsg_len)+4U -1) & ~(4U -1) ),
+ (struct nlmsghdr*)(((char*)(nlh))
++ ( (((nlh)->nlmsg_len)+4U -1)
+& ~(4U -1) 
+   {
+ struct ifinfomsg *ifim = ((void *)0);
+ struct ifaddrmsg *ifam = ((void *)0);
+ struct rtattr *rta;
+ sa_family_t nlm_family = 0;
+ uint32_t nlm_scope = 0, nlm_index = 0;
+ memset (&ifamap, 0, sizeof (ifamap));
+ switch (nlh->nlmsg_type)
+   {
+   case RTM_NEWLINK:
+ ifim = (struct ifinfomsg *)
+   ((void*)(((char*)nlh)
++ ((0)+( int)
+( ((sizeof(struct nlmsghdr))+4U -1)
+  & ~(4U -1) )))+4U -1)
+ & ~(4U -1) ;
+   case RTM_NEWADDR:
+ ifam = (struct ifaddrmsg *)
+   ((void*)(((char*)nlh)
++ ((0)+( int)
+( ((sizeof(struct nlmsghdr))+4U -1)
+  & ~(4U -1) )))+4U -1)
+ & ~(4U -1) ;
+ nlm_family = ifam->ifa_family;
+ if (build)
+   ifa->ifa_flags = ifflist[nlm_index];
+ break;
+   default:
+ continue;
+   }
+ if (!build)
+   {
+ void *rtadata = ((void*)(((char*)(rta))
+  + (( ((sizeof(struct rtattr))+4 -1)
+   & ~(4 -1) ) + (0;
+ size_t rtapayload = ((int)((rta)->rta_len)
+  - (( ((sizeof(struct rtattr))+4 -1)
+   & ~(4 -1) ) + (0)));
+ switch (nlh->nlmsg_type)
+   {
+   case RTM_NEWLINK:
+ break;
+   case RTM_NEWADDR:
+ if (nlm_family == 17)
+   break;
+ switch (rta->rta_type)
+   {
+   case IFA_ADDRESS:

Re: [PATCH] Add vector cost model density heuristic

2012-06-18 Thread William J. Schmidt

On Mon, 2012-06-11 at 13:40 +0200, Richard Guenther wrote:
> On Fri, 8 Jun 2012, William J. Schmidt wrote:
> 

> 
> Hmm.  I don't like this patch or its general idea too much.  Instead
> I'd like us to move more of the cost model detail to the target, giving
> it a chance to look at the whole loop before deciding on a cost.  ISTR
> posting the overall idea at some point, but let me repeat it here instead
> of trying to find that e-mail.
> 
> The basic interface of the cost model should be, in targetm.vectorize
> 
>   /* Tell the target to start cost analysis of a loop or a basic-block
>  (if the loop argument is NULL).  Returns an opaque pointer to
>  target-private data.  */
>   void *init_cost (struct loop *loop);
> 
>   /* Add cost for N vectorized-stmt-kind statements in vector_mode.  */
>   void add_stmt_cost (void *data, unsigned n,
> vectorized-stmt-kind,
>   enum machine_mode vector_mode);
> 
>   /* Tell the target to compute and return the cost of the accumulated
>  statements and free any target-private data.  */
>   unsigned finish_cost (void *data);
> 
> with eventually slightly different signatures for add_stmt_cost
> (like pass in the original scalar stmt?).
> 
> It allows the target, at finish_cost time, to evaluate things like
> register pressure and resource utilization.
> 
> Thanks,
> Richard.

I've been looking at this in between other projects.  I wanted to be
sure I understood the SLP infrastructure and whether it would cause any
problems.  It looks to me like it will be mostly ok.  One issue I
noticed is a possible difference in the order in which SLP instructions
are analyzed and the order in which the instructions are "issued" during
transformation.

For both loop analysis and basic block analysis, SLP trees are
constructed and analyzed prior to examining other vectorizable
instructions.  Their costs are calculated and stored in the SLP trees at
this time.  Later, when transforming statements to their vector
equivalents, instructions in the block (or loop body) are processed in
order until the first instruction that's part of an SLP tree is
encountered.  At that point, every instruction that's part of any SLP
tree is transformed; then the vectorizer continues with the remaining
non-SLP vectorizable statements.

So if we do the natural and easy thing of placing calls to add_stmt_cost
everywhere that costs are calculated today, the order that those costs
are presented to the back end model will possibly be different than the
order they are actually "emitted."

For a first cut at this, I suggest ignoring the problem other than to
document it as an opportunity for improvement.  Later we could improve
it by using an add_stmt_slp_cost () interface (or adding an is_slp
flag), and another interface to be called at the time during analysis
when the SLP statements will be issued during transformation.  This
would allow the back end model to queue up the SLP costs in a separate
vector and later place them in its internal structures at the
appropriate place.

It should eventually be possible to remove these fields/accessors:

 * STMT_VINFO_{IN,OUT}SIDE_OF_LOOP_COST
 * SLP_TREE_{IN,OUT}SIDE_OF_LOOP_COST
 * SLP_INSTANCE_{IN,OUT}SIDE_OF_LOOP_COST

However, I think this should be delayed until we have the basic
infrastructure in place for the new model and well-tested.

The other issue is that we should have the model track both the inside
and outside costs if we're going to get everything into the target
model.  For a first pass we can ignore this and keep the existing logic
for the outside costs.  Later we should add some interfaces analogous to
add_stmt_cost such as add_stmt_prolog_cost and add_stmt_epilog_cost so
the model can track this stuff as carefully as it wants to.

So, I'd propose going at this in several phases:

(1) Add calls to the new interface without disturbing existing logic;
modify the profitability algorithms to query the new model for inside
costs.  Default algorithm for the model is to just sum costs as is done
today.
(x) Add heuristics to target models as desired.
(2) Handle the SLP ordering problem.
(3) Handle outside costs in the target model.
(4) Remove the now unnecessary cost fields and the calls that set them.

Item (x) can happen anytime after item (1).

I don't think this work is terribly difficult, just a bit tedious.  The
only really time-consuming aspect of it will be in very careful testing
to keep from changing existing behavior.

All comments welcome -- please let me know what you think.

Thanks,
Bill

Re: [PATCH] Add vector cost model density heuristic

2012-06-18 Thread William J. Schmidt

On Mon, 2012-06-18 at 13:49 -0500, William J. Schmidt wrote:
> On Mon, 2012-06-11 at 13:40 +0200, Richard Guenther wrote:
> > On Fri, 8 Jun 2012, William J. Schmidt wrote:
> > 
> 
> > 
> > Hmm.  I don't like this patch or its general idea too much.  Instead
> > I'd like us to move more of the cost model detail to the target, giving
> > it a chance to look at the whole loop before deciding on a cost.  ISTR
> > posting the overall idea at some point, but let me repeat it here instead
> > of trying to find that e-mail.
> > 
> > The basic interface of the cost model should be, in targetm.vectorize
> > 
> >   /* Tell the target to start cost analysis of a loop or a basic-block
> >  (if the loop argument is NULL).  Returns an opaque pointer to
> >  target-private data.  */
> >   void *init_cost (struct loop *loop);
> > 
> >   /* Add cost for N vectorized-stmt-kind statements in vector_mode.  */
> >   void add_stmt_cost (void *data, unsigned n,
> >   vectorized-stmt-kind,
> >   enum machine_mode vector_mode);
> > 
> >   /* Tell the target to compute and return the cost of the accumulated
> >  statements and free any target-private data.  */
> >   unsigned finish_cost (void *data);

By the way, I don't see much point in passing the void *data around
here.  Too many levels of interfaces that we'd have to pass it around in
the vectorizer, so it would just sit in a static variable.  Might as
well let the data be wholly private to the target.
> > 
> > with eventually slightly different signatures for add_stmt_cost
> > (like pass in the original scalar stmt?).
> > 
> > It allows the target, at finish_cost time, to evaluate things like
> > register pressure and resource utilization.
> > 
> > Thanks,
> > Richard.

Re: [PATCH] Add vector cost model density heuristic

2012-06-19 Thread William J. Schmidt

On Tue, 2012-06-19 at 12:08 +0200, Richard Guenther wrote:
> On Mon, 18 Jun 2012, William J. Schmidt wrote:
> 
> > On Mon, 2012-06-11 at 13:40 +0200, Richard Guenther wrote:
> > > On Fri, 8 Jun 2012, William J. Schmidt wrote:
> > > 
> > 
> > > 
> > > Hmm.  I don't like this patch or its general idea too much.  Instead
> > > I'd like us to move more of the cost model detail to the target, giving
> > > it a chance to look at the whole loop before deciding on a cost.  ISTR
> > > posting the overall idea at some point, but let me repeat it here instead
> > > of trying to find that e-mail.
> > > 
> > > The basic interface of the cost model should be, in targetm.vectorize
> > > 
> > >   /* Tell the target to start cost analysis of a loop or a basic-block
> > >  (if the loop argument is NULL).  Returns an opaque pointer to
> > >  target-private data.  */
> > >   void *init_cost (struct loop *loop);
> > > 
> > >   /* Add cost for N vectorized-stmt-kind statements in vector_mode.  */
> > >   void add_stmt_cost (void *data, unsigned n,
> > > vectorized-stmt-kind,
> > >   enum machine_mode vector_mode);
> > > 
> > >   /* Tell the target to compute and return the cost of the accumulated
> > >  statements and free any target-private data.  */
> > >   unsigned finish_cost (void *data);
> > > 
> > > with eventually slightly different signatures for add_stmt_cost
> > > (like pass in the original scalar stmt?).
> > > 
> > > It allows the target, at finish_cost time, to evaluate things like
> > > register pressure and resource utilization.
> > > 
> > > Thanks,
> > > Richard.
> > 
> > I've been looking at this in between other projects.  I wanted to be
> > sure I understood the SLP infrastructure and whether it would cause any
> > problems.  It looks to me like it will be mostly ok.  One issue I
> > noticed is a possible difference in the order in which SLP instructions
> > are analyzed and the order in which the instructions are "issued" during
> > transformation.
> > 
> > For both loop analysis and basic block analysis, SLP trees are
> > constructed and analyzed prior to examining other vectorizable
> > instructions.  Their costs are calculated and stored in the SLP trees at
> > this time.  Later, when transforming statements to their vector
> > equivalents, instructions in the block (or loop body) are processed in
> > order until the first instruction that's part of an SLP tree is
> > encountered.  At that point, every instruction that's part of any SLP
> > tree is transformed; then the vectorizer continues with the remaining
> > non-SLP vectorizable statements.
> > 
> > So if we do the natural and easy thing of placing calls to add_stmt_cost
> > everywhere that costs are calculated today, the order that those costs
> > are presented to the back end model will possibly be different than the
> > order they are actually "emitted."
> 
> Interesting.  But I suppose this is similar to how pattern statements
> are handled?  Thus, the whole pattern sequence is processed when
> we encounter the "main" pattern statement?

Yes, but the difference is that both vect_analyze_stmt and
vect_transform_loop handle the pattern statements in the same order
(thankfully -- I would hate to have to deal with the pattern mess).
With SLP, all SLP statements are analyzed ahead of time, but they aren't
transformed until one of them is encountered in the statement walk.

> 
> > For a first cut at this, I suggest ignoring the problem other than to
> > document it as an opportunity for improvement.  Later we could improve
> > it by using an add_stmt_slp_cost () interface (or adding an is_slp
> > flag), and another interface to be called at the time during analysis
> > when the SLP statements will be issued during transformation.  This
> > would allow the back end model to queue up the SLP costs in a separate
> > vector and later place them in its internal structures at the
> > appropriate place.
> >
> > It should eventually be possible to remove these fields/accessors:
> > 
> >  * STMT_VINFO_{IN,OUT}SIDE_OF_LOOP_COST
> >  * SLP_TREE_{IN,OUT}SIDE_OF_LOOP_COST
> >  * SLP_INSTANCE_{IN,OUT}SIDE_OF_LOOP_COST
> > 
> > However, I think this should be delayed until we have the basic
> > infrastructure in place for the new model and well-tested.
> 
> Indeed.
> 
> > The other issue is t

Re: [PATCH] Add vector cost model density heuristic

2012-06-19 Thread William J. Schmidt

On Tue, 2012-06-19 at 12:10 +0200, Richard Guenther wrote:
> On Mon, 18 Jun 2012, William J. Schmidt wrote:
> 
> > On Mon, 2012-06-18 at 13:49 -0500, William J. Schmidt wrote:
> > > On Mon, 2012-06-11 at 13:40 +0200, Richard Guenther wrote:
> > > > On Fri, 8 Jun 2012, William J. Schmidt wrote:
> > > > 
> > > 
> > > > 
> > > > Hmm.  I don't like this patch or its general idea too much.  Instead
> > > > I'd like us to move more of the cost model detail to the target, giving
> > > > it a chance to look at the whole loop before deciding on a cost.  ISTR
> > > > posting the overall idea at some point, but let me repeat it here 
> > > > instead
> > > > of trying to find that e-mail.
> > > > 
> > > > The basic interface of the cost model should be, in targetm.vectorize
> > > > 
> > > >   /* Tell the target to start cost analysis of a loop or a basic-block
> > > >  (if the loop argument is NULL).  Returns an opaque pointer to
> > > >  target-private data.  */
> > > >   void *init_cost (struct loop *loop);
> > > > 
> > > >   /* Add cost for N vectorized-stmt-kind statements in vector_mode.  */
> > > >   void add_stmt_cost (void *data, unsigned n,
> > > >   vectorized-stmt-kind,
> > > >   enum machine_mode vector_mode);
> > > > 
> > > >   /* Tell the target to compute and return the cost of the accumulated
> > > >  statements and free any target-private data.  */
> > > >   unsigned finish_cost (void *data);
> > 
> > By the way, I don't see much point in passing the void *data around
> > here.  Too many levels of interfaces that we'd have to pass it around in
> > the vectorizer, so it would just sit in a static variable.  Might as
> > well let the data be wholly private to the target.
> 
> Ok, so you'd have void init_cost (struct loop *) and
> unsigned finish_cost (void); then?  Static variables are of couse
> not properly "abstracted" so we can't ever compute two set of costs
> at the same time ... but that's true all-over-the-place in GCC ...

It's a fair point, and perhaps I'll decide to pass the data pointer
around anyway to keep that option open.  We'll see which looks uglier.

> 
> With previous discussion the add_stmt_cost hook would be split up
> to also allow passing the operation code for example.

I remember having this discussion, and I was looking for it to check on
the details, but I can't seem to find it either in my inbox or in the
archives.  Can you please point me to that again?  Sorry for the bother.

Thanks,
Bill

> 
> Richard.
>

Re: [PATCH] Add vector cost model density heuristic

2012-06-19 Thread William J. Schmidt

On Tue, 2012-06-19 at 14:48 +0200, Richard Guenther wrote:
> On Tue, 19 Jun 2012, William J. Schmidt wrote:
> 
> > I remember having this discussion, and I was looking for it to check on
> > the details, but I can't seem to find it either in my inbox or in the
> > archives.  Can you please point me to that again?  Sorry for the bother.
> 
> It was in the "Correct cost model for strided loads" thread.

Ah, right, thanks.  I think it will be best to make that a separate
patch in the series.  Like so:

(1) Add calls to the new interface without disturbing existing logic;
modify the profitability algorithms to query the new model for inside
costs.  Default algorithm for the model is to just sum costs as is done
today.
(1a) Split up the cost hooks (one for loads/stores with misalign parm,
one for vector_stmt with tree_code, etc.).
(x) Add heuristics to target models as desired.
(2) Handle the SLP ordering problem.
(3) Handle outside costs in the target model.
(4) Remove the now unnecessary cost fields and the calls that set them.

I'll start work on this series of patches as I have time between other
projects.

Thanks,
Bill

> 
> Richard.
>

Re: [Patch ping] Strength reduction

2012-06-20 Thread William J. Schmidt

On Wed, 2012-06-20 at 13:11 +0200, Richard Guenther wrote:
> On Thu, Jun 14, 2012 at 3:21 PM, William J. Schmidt
>  wrote:
> > Pro forma ping. :)
> 
> ;)
> 
> I notice (with all of these functions)
> 
> +unsigned
> +negate_cost (enum machine_mode mode, bool speed)
> +{
> +  static unsigned costs[NUM_MACHINE_MODES];
> +  rtx seq;
> +  unsigned cost;
> +
> +  if (costs[mode])
> +return costs[mode];
> +
> +  start_sequence ();
> +  force_operand (gen_rtx_fmt_e (NEG, mode,
> + gen_raw_REG (mode, LAST_VIRTUAL_REGISTER + 1)),
> +  NULL_RTX);
> +  seq = get_insns ();
> +  end_sequence ();
> +
> +  cost = seq_cost (seq, speed);
> +  if (!cost)
> +cost = 1;
> 
> that the cost[] array is independent on the speed argument.  Thus whatever
> comes first determines the cost.  Odd, and probably not good.  A fix
> would be appreciated (even for the current code ...) - simply make the
> array costs[NUM_MACHINE_MODES][2].
> 
> As for the renaming - can you name the functions consistently?  Thus
> the above would be negate_reg_cost?  And maybe rename the other
> FIXME function, too?

I agree with all this.  I'll prepare all the cost model changes as a
separate preliminaries patch.

> 
> Index: gcc/tree-ssa-strength-reduction.c
> ===
> --- gcc/tree-ssa-strength-reduction.c (revision 0)
> +++ gcc/tree-ssa-strength-reduction.c (revision 0)
> @@ -0,0 +1,1611 @@
> +/* Straight-line strength reduction.
> +   Copyright (C) 2012  Free Software Foundation, Inc.
> 
> I know we have these 'tree-ssa-' names, but really this is gimple-ssa now ;)
> So, please name it gimple-ssa-strength-reduction.c.

Will do.  Vive la revolution? ;)

> 
> +  /* Access to the statement for subsequent modification.  Cached to
> + save compile time.  */
> +  gimple_stmt_iterator cand_gsi;
> 
> this is a iterator for cand_stmt?  Then caching it is no longer necessary
> as the iterator is the stmt itself after recent infrastructure changes.

Oh yeah, I remember seeing that go by.  Nice.  Will change.

> 
> +/* Hash table embodying a mapping from statements to candidates.  */
> +static htab_t stmt_cand_map;
> ...
> +static hashval_t
> +stmt_cand_hash (const void *p)
> +{
> +  return htab_hash_pointer (((const_slsr_cand_t) p)->cand_stmt);
> +}
> 
> use a pointer-map instead.
> 
> +/* Callback to produce a hash value for a candidate chain header.  */
> +
> +static hashval_t
> +base_cand_hash (const void *p)
> +{
> +  tree ssa_name = ((const_cand_chain_t) p)->base_name;
> +
> +  if (TREE_CODE (ssa_name) != SSA_NAME)
> +return (hashval_t) 0;
> +
> +  return (hashval_t) SSA_NAME_VERSION (ssa_name);
> +}
> 
> does it ever happen that ssa_name is not an SSA_NAME?  

Not in this patch, but when I introduce CAND_REF in a later patch it
could happen since the base field of a CAND_REF is a MEM_REF.  It's a
safety valve in case of misuse.  I'll think about this some more.

> I'm not sure
> the memory savings over simply using a fixed-size (num_ssa_names)
> array indexed by SSA_NAME_VERSION pointing to the chain is worth
> using a hashtable for this?

That's reasonable.  I'll do that.

> 
> +  node = (cand_chain_t) pool_alloc (chain_pool);
> +  node->base_name = c->base_name;
> 
> If you never free pool entries it's more efficient to use an obstack.
> alloc-pool
> only pays off if you get freed item re-use.

OK.  I'll change both cand_pool and chain_pool to obstacks.

> 
> +  switch (gimple_assign_rhs_code (gs))
> +{
> +case MULT_EXPR:
> +  rhs2 = gimple_assign_rhs2 (gs);
> +
> +  if (TREE_CODE (rhs2) == INTEGER_CST)
> + return multiply_by_cost (TREE_INT_CST_LOW (rhs2), lhs_mode, speed);
> +
> +  if (TREE_CODE (rhs1) == INTEGER_CST)
> + return multiply_by_cost (TREE_INT_CST_LOW (rhs1), lhs_mode, speed);
> 
> In theory all commutative statements should have constant operands only
> at rhs2 ...

I'm glad I'm not the only one who thought that was the theory. ;)  I
wasn't sure, and I've seen violations of this come up in practice.
Should I assert when that happens instead, and track down the offending
optimizations?

> 
> Also you do not verify that the constant fits in a host-wide-int - but maybe
> you do not care?  Thus, I'd do
> 
>if (host_integerp (rhs2, 0))
>  return multiply_by_cost (TREE_INT_CST_LOW (rhs2), lhs_mode, speed);
> 
> or make multiply_by[_const?]_cost take a double-int instead.  Likewise below
> for add.

Ok.  Name change looks good also, I'll include that in the cost mode

Re: [Patch ping] Strength reduction

2012-06-20 Thread William J. Schmidt

On Wed, 2012-06-20 at 11:52 -0700, Richard Henderson wrote:
> On 06/20/2012 04:11 AM, Richard Guenther wrote:
> > I notice (with all of these functions)
> > 
> > +unsigned
> > +negate_cost (enum machine_mode mode, bool speed)
> > +{
> > +  static unsigned costs[NUM_MACHINE_MODES];
> > +  rtx seq;
> > +  unsigned cost;
> > +
> > +  if (costs[mode])
> > +return costs[mode];
> > +
> > +  start_sequence ();
> > +  force_operand (gen_rtx_fmt_e (NEG, mode,
> > +   gen_raw_REG (mode, LAST_VIRTUAL_REGISTER + 1)),
> > +NULL_RTX);
> 
> I don't suppose there's any way to share data with what init_expmed computes?
> 
> Not, strictly speaking, the cleanest thing to include expmed.h here, but 
> surely
> a tad better than re-computing identical data (and without the clever rtl
> garbage avoidance tricks).

Interesting.  I was building on what ivopts already has; not sure of the
history there.  It looks like there is some overlap in function, but
expmed doesn't have everything ivopts uses today (particularly the hash
table of costs for multiplies by various constants).  The stuff I need
for type promotion/demotion is also not present (which I'm computing on
demand for whatever mode pairs are encountered).  Not sure how great it
would be to precompute that for all pairs, and obviously precomputing
costs of multiplying by all constants isn't going to work.  So if the
two functionalities were to be combined, it would seem to require some
modification to how expmed works.

Thanks,
Bill
> 
> 
> r~
>

Re: [PATCH] Add vector cost model density heuristic

2012-06-21 Thread William J. Schmidt

On Tue, 2012-06-19 at 16:20 +0200, Richard Guenther wrote:
> On Tue, 19 Jun 2012, William J. Schmidt wrote:
> 
> > On Tue, 2012-06-19 at 14:48 +0200, Richard Guenther wrote:
> > > On Tue, 19 Jun 2012, William J. Schmidt wrote:
> > > 
> > > > I remember having this discussion, and I was looking for it to check on
> > > > the details, but I can't seem to find it either in my inbox or in the
> > > > archives.  Can you please point me to that again?  Sorry for the bother.
> > > 
> > > It was in the "Correct cost model for strided loads" thread.
> > 
> > Ah, right, thanks.  I think it will be best to make that a separate
> > patch in the series.  Like so:
> > 
> > (1) Add calls to the new interface without disturbing existing logic;
> > modify the profitability algorithms to query the new model for inside
> > costs.  Default algorithm for the model is to just sum costs as is done
> > today.

Just FYI, this is not quite as straightforward as I thought.  There is
some code in tree-vect-data-refs.c that computes costs for various
peeling options and picks one of them.  In most other places we can just
pass the instructions to the back end at the same place that the costs
are currently calculated, but not here.  This will require some more
major surgery to save the instructions needed from each peeling option
and only pass along the ones that end up being chosen.

The upside is the same sort of "delayed emit" is needed for the SLP
ordering problem, so the infrastructure for this will be reusable for
that problem.

Grumble.

Bill

> > (1a) Split up the cost hooks (one for loads/stores with misalign parm,
> > one for vector_stmt with tree_code, etc.).
> > (x) Add heuristics to target models as desired.
> > (2) Handle the SLP ordering problem.
> > (3) Handle outside costs in the target model.
> > (4) Remove the now unnecessary cost fields and the calls that set them.
> > 
> > I'll start work on this series of patches as I have time between other
> > projects.
> 
> Thanks!
> Richard.
>

[PATCH] Strength reduction preliminaries

2012-06-21 Thread William J. Schmidt

As promised, this breaks out the changes to the IVOPTS cost model and
the added function in double-int.c.  Please let me know if you would
rather see me attempt to consolidate the IVOPTS logic into expmed.c per
Richard H's suggestion.

I ran into a glitch with multiply_by_const_cost.  The original code
declared a static htab_t in the function and allocated it on demand.
When I tried adding a second one in the same manner, I ran into a
locking problem in the memory management library code during a call to
delete_htab.  The original implementation seemed a bit dicey to me
anyway, so I changed this to explicitly allocate and deallocate the hash
tables on (entry to/exit from) IVOPTS.

This reduces the scope of the hash table from a compilation unit to each
individual function.  If it's preferred to maintain compilation unit
scope, then the initialization/finalization of the htabs can be pushed
out to do_compile.  But I doubt it's worth that.

Bootstrapped and tested on powerpc64-unknown-linux-gnu with no new
regressions.  Ok for trunk?

Thanks,
Bill


2012-06-21  Bill Schmidt  

* double-int.c (double_int_multiple_of): New function.
* double-int.h (double_int_multiple_of): New decl.
* tree-ssa-loop-ivopts.c (add_cost, zero_cost): Remove undefs.
(mbc_entry_hash): New forward decl.
(mbc_entry_eq): Likewise.
(zero_cost): Change to no_cost.
(mult_costs): New static var.
(tree_ssa_iv_optimize_init): Initialize mult_costs.
(add_cost): Change to add_regs_cost; distinguish costs by speed.
(multiply_regs_cost): New function.
(add_const_cost): Likewise.
(extend_or_trunc_reg_cost): Likewise.
(negate_reg_cost): Likewise.
(multiply_by_cost): Change to multiply_by_const_cost; distinguish
costs by speed.
(get_address_cost): Change add_cost to add_regs_cost; change
multiply_by_cost to multiply_by_const_cost.
(force_expr_to_var_cost): Change zero_cost to no_cost; change
add_cost to add_regs_cost; change multiply_by_cost to
multiply_by_const_cost.
(split_cost): Change zero_cost to no_cost.
(ptr_difference_cost): Likewise.
(difference_cost): Change zero_cost to no_cost; change multiply_by_cost
to multiply_by_const_cost.
(get_computation_cost_at): Change add_cost to add_regs_cost; change
multiply_by_cost to multiply_by_const_cost.
(determine_use_iv_cost_generic): Change zero_cost to no_cost.
(determine_iv_cost): Change add_cost to add_regs_cost.
(iv_ca_new): Change zero_cost to no_cost.
(tree_ssa_iv_optimize_finalize): Release storage for mult_costs.
* tree-ssa-address.c (most_expensive_mult_to_index): Change
multiply_by_cost to multiply_by_const_cost.
* tree-flow.h (multiply_by_cost): Change to multiply_by_const_cost.
(add_regs_cost): New decl.
(multiply_regs_cost): Likewise.
(add_const_cost): Likewise.
(extend_or_trunc_reg_cost): Likewise.
(negate_reg_cost): Likewise.


Index: gcc/double-int.c
===
--- gcc/double-int.c(revision 188839)
+++ gcc/double-int.c(working copy)
@@ -865,6 +865,26 @@ double_int_umod (double_int a, double_int b, unsig
   return double_int_mod (a, b, true, code);
 }
 
+/* Return TRUE iff PRODUCT is an integral multiple of FACTOR, and return
+   the multiple in *MULTIPLE.  Otherwise return FALSE and leave *MULTIPLE
+   unchanged.  */
+
+bool
+double_int_multiple_of (double_int product, double_int factor,
+   bool unsigned_p, double_int *multiple)
+{
+  double_int remainder;
+  double_int quotient = double_int_divmod (product, factor, unsigned_p,
+  TRUNC_DIV_EXPR, &remainder);
+  if (double_int_zero_p (remainder))
+{
+  *multiple = quotient;
+  return true;
+}
+
+  return false;
+}
+
 /* Set BITPOS bit in A.  */
 double_int
 double_int_setbit (double_int a, unsigned bitpos)
Index: gcc/double-int.h
===
--- gcc/double-int.h(revision 188839)
+++ gcc/double-int.h(working copy)
@@ -150,6 +150,8 @@ double_int double_int_divmod (double_int, double_i
 double_int double_int_sdivmod (double_int, double_int, unsigned, double_int *);
 double_int double_int_udivmod (double_int, double_int, unsigned, double_int *);
 
+bool double_int_multiple_of (double_int, double_int, bool, double_int *);
+
 double_int double_int_setbit (double_int, unsigned);
 int double_int_ctz (double_int);
 
Index: gcc/tree-ssa-loop-ivopts.c
===
--- gcc/tree-ssa-loop-ivopts.c  (revision 188839)
+++ gcc/tree-ssa-loop-ivopts.c  (working copy)
@@ -89,13 +89,11 @@ along with GCC; see the file COPYING3.  If not see
 #include "target.h"
 #include "tree-inline.h"
 #include "tree-

Re: [PATCH] Strength reduction preliminaries

2012-06-22 Thread William J. Schmidt

On Fri, 2012-06-22 at 10:44 +0200, Richard Guenther wrote:
> On Thu, 21 Jun 2012, William J. Schmidt wrote:
> 
> > As promised, this breaks out the changes to the IVOPTS cost model and
> > the added function in double-int.c.  Please let me know if you would
> > rather see me attempt to consolidate the IVOPTS logic into expmed.c per
> > Richard H's suggestion.
> 
> If we start to use it from multiple places that definitely makes sense,
> but you can move the stuff as a followup.

OK, I'll put it on my list.

> 
> > I ran into a glitch with multiply_by_const_cost.  The original code
> > declared a static htab_t in the function and allocated it on demand.
> > When I tried adding a second one in the same manner, I ran into a
> > locking problem in the memory management library code during a call to
> > delete_htab.  The original implementation seemed a bit dicey to me
> > anyway, so I changed this to explicitly allocate and deallocate the hash
> > tables on (entry to/exit from) IVOPTS.
> 
> Huh.  That's weird and should not happen.  Still it makes sense to
> move this to a per-function cache given that its size is basically
> unbound.
> 
> Can you introduce a initialize_costs () / finalize_costs () function
> pair that allocates / frees the tables and sets a global flag that
> you can then assert in the functions using those tables?

Ok.

> 
> > +  if (speed)
> > +speed = 1;
> 
> I suppose this is because bool is not bool when building with a
> C compiler?  It really looks weird and if such is necessary I'd
> prefer something like
> 
> > +add_regs_cost (enum machine_mode mode, bool speed)
> >  {
> > +  static unsigned costs[NUM_MACHINE_MODES][2];
> >rtx seq;
> >unsigned cost;
>  unsigned sidx = speed ? 0 : 1;
> >  
> > +  if (costs[mode][sidx])
> > +return costs[mode][sidx];
> > +
> 
> instead.

I'm always paranoid about misuse of bools in C, but I suppose this is
overkill.  I'll just remove the code.

Thanks,
Bill

> 
> Otherwise the patch is ok.
> 
> Thanks,
> Richard.
> 
> > This reduces the scope of the hash table from a compilation unit to each
> > individual function.  If it's preferred to maintain compilation unit
> > scope, then the initialization/finalization of the htabs can be pushed
> > out to do_compile.  But I doubt it's worth that.
> > 
> > Bootstrapped and tested on powerpc64-unknown-linux-gnu with no new
> > regressions.  Ok for trunk?
> > 
> > Thanks,
> > Bill
> > 
> > 
> > 2012-06-21  Bill Schmidt  
> > 
> > * double-int.c (double_int_multiple_of): New function.
> > * double-int.h (double_int_multiple_of): New decl.
> > * tree-ssa-loop-ivopts.c (add_cost, zero_cost): Remove undefs.
> > (mbc_entry_hash): New forward decl.
> > (mbc_entry_eq): Likewise.
> > (zero_cost): Change to no_cost.
> > (mult_costs): New static var.
> > (tree_ssa_iv_optimize_init): Initialize mult_costs.
> > (add_cost): Change to add_regs_cost; distinguish costs by speed.
> > (multiply_regs_cost): New function.
> > (add_const_cost): Likewise.
> > (extend_or_trunc_reg_cost): Likewise.
> > (negate_reg_cost): Likewise.
> > (multiply_by_cost): Change to multiply_by_const_cost; distinguish
> > costs by speed.
> > (get_address_cost): Change add_cost to add_regs_cost; change
> > multiply_by_cost to multiply_by_const_cost.
> > (force_expr_to_var_cost): Change zero_cost to no_cost; change
> > add_cost to add_regs_cost; change multiply_by_cost to
> > multiply_by_const_cost.
> > (split_cost): Change zero_cost to no_cost.
> > (ptr_difference_cost): Likewise.
> > (difference_cost): Change zero_cost to no_cost; change multiply_by_cost
> > to multiply_by_const_cost.
> > (get_computation_cost_at): Change add_cost to add_regs_cost; change
> > multiply_by_cost to multiply_by_const_cost.
> > (determine_use_iv_cost_generic): Change zero_cost to no_cost.
> > (determine_iv_cost): Change add_cost to add_regs_cost.
> > (iv_ca_new): Change zero_cost to no_cost.
> > (tree_ssa_iv_optimize_finalize): Release storage for mult_costs.
> > * tree-ssa-address.c (most_expensive_mult_to_index): Change
> > multiply_by_cost to multiply_by_const_cost.
> > * tree-flow.h (multiply_by_cost): Change to multiply_by_const_cost.
> > (add_regs_cost): New decl.
> > (multiply_regs_cost): Likewise.
> > (add_const_cost): Likewise.
> > (extend_or_

Re: [PATCH] Strength reduction preliminaries

2012-06-22 Thread William J. Schmidt

On Fri, 2012-06-22 at 10:44 +0200, Richard Guenther wrote:
> On Thu, 21 Jun 2012, William J. Schmidt wrote:

> > I ran into a glitch with multiply_by_const_cost.  The original code
> > declared a static htab_t in the function and allocated it on demand.
> > When I tried adding a second one in the same manner, I ran into a
> > locking problem in the memory management library code during a call to
> > delete_htab.  The original implementation seemed a bit dicey to me
> > anyway, so I changed this to explicitly allocate and deallocate the hash
> > tables on (entry to/exit from) IVOPTS.
> 
> Huh.  That's weird and should not happen.  Still it makes sense to
> move this to a per-function cache given that its size is basically
> unbound.
> 

Hm, this appears not to be related to my changes.  I ran into the same
issue when bootstrapping some other change without any of the IVOPTS
changes committed.  In both cases the stuck lock occurred when compiling
tree-vect-stmts.c.  I'll try to debug this when I get some time, unless
somebody else figures it out sooner.

Bill

[PATCH] Strength reduction

2012-06-25 Thread William J. Schmidt

Here's a new version of the main strength reduction patch, addressing
previous comments.  A couple of quick notes:

* I opened PR53773 and PR53774 for the cases where commutative
operations were encountered with a constant in rhs1.  This version of
the patch still has the gcc_asserts in place to catch those cases, but
I'll plan to remove those once the patch is approved.

 * You previously asked:

>>
>> +static slsr_cand_t
>> +base_cand_from_table (tree base_in)
>> +{
>> +  slsr_cand mapping_key;
>> +
>> +  gimple def = SSA_NAME_DEF_STMT (base_in);
>> +  if (!def)
>> +return (slsr_cand_t) NULL;
>> +
>> +  mapping_key.cand_stmt = def;
>> +  return (slsr_cand_t) htab_find (stmt_cand_map, &mapping_key);
>>
>> isn't that reachable via the base-name -> chain mapping for base_in?

I had to review this a bit, but the answer is no.  If you look at one of
the algebraic manipulations in create_mul_ssa_cand as an example,
base_in corresponds to Y.  base_cand_from_table is looking for a
candidate that has Y for its LHS.  The base-name -> chain mapping is
used to find all candidates that have B as the base_name.

 * I added a detailed explanation of what's going on with legal_cast_p.
Hopefully this will be easier to understand now.

I've bootstrapped this on powerpc64-unknown-linux-gnu with three new
regressions (for which I opened the two bug reports).  Ok for trunk
after removing the asserts?

Thanks,
Bill



gcc:

2012-06-25  Bill Schmidt  

* tree-pass.h (pass_strength_reduction): New decl.
* tree-ssa-loop-ivopts.c (initialize_costs): Make non-static.
(finalize_costs): Likewise.
* timevar.def (TV_TREE_SLSR): New timevar.
* gimple-ssa-strength-reduction.c: New.
* tree-flow.h (initialize_costs): New decl.
(finalize_costs): Likewise.
* Makefile.in (tree-ssa-strength-reduction.o): New dependencies.
* passes.c (init_optimization_passes): Add pass_strength_reduction.

gcc/testsuite:

2012-06-25  Bill Schmidt  

* gcc.dg/tree-ssa/slsr-1.c: New test.
* gcc.dg/tree-ssa/slsr-2.c: Likewise.
* gcc.dg/tree-ssa/slsr-3.c: Likewise.
* gcc.dg/tree-ssa/slsr-4.c: Likewise.



Index: gcc/tree-pass.h
===
--- gcc/tree-pass.h (revision 188890)
+++ gcc/tree-pass.h (working copy)
@@ -452,6 +452,7 @@ extern struct gimple_opt_pass pass_tm_memopt;
 extern struct gimple_opt_pass pass_tm_edges;
 extern struct gimple_opt_pass pass_split_functions;
 extern struct gimple_opt_pass pass_feedback_split_functions;
+extern struct gimple_opt_pass pass_strength_reduction;
 
 /* IPA Passes */
 extern struct simple_ipa_opt_pass pass_ipa_lower_emutls;
Index: gcc/testsuite/gcc.dg/tree-ssa/slsr-1.c
===
--- gcc/testsuite/gcc.dg/tree-ssa/slsr-1.c  (revision 0)
+++ gcc/testsuite/gcc.dg/tree-ssa/slsr-1.c  (revision 0)
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -fdump-tree-optimized" } */
+
+extern void foo (int);
+
+void
+f (int *p, unsigned int n)
+{
+  foo (*(p + n * 4));
+  foo (*(p + 32 + n * 4));
+  if (n > 3)
+foo (*(p + 16 + n * 4));
+  else
+foo (*(p + 48 + n * 4));
+}
+
+/* { dg-final { scan-tree-dump-times "\\+ 128" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "\\+ 64" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "\\+ 192" 1 "optimized" } } */
+/* { dg-final { cleanup-tree-dump "optimized" } } */
Index: gcc/testsuite/gcc.dg/tree-ssa/slsr-2.c
===
--- gcc/testsuite/gcc.dg/tree-ssa/slsr-2.c  (revision 0)
+++ gcc/testsuite/gcc.dg/tree-ssa/slsr-2.c  (revision 0)
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -fdump-tree-optimized" } */
+
+extern void foo (int);
+
+void
+f (int *p, int n)
+{
+  foo (*(p + n++ * 4));
+  foo (*(p + 32 + n++ * 4));
+  foo (*(p + 16 + n * 4));
+}
+
+/* { dg-final { scan-tree-dump-times "\\+ 144" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "\\+ 96" 1 "optimized" } } */
+/* { dg-final { cleanup-tree-dump "optimized" } } */
Index: gcc/testsuite/gcc.dg/tree-ssa/slsr-3.c
===
--- gcc/testsuite/gcc.dg/tree-ssa/slsr-3.c  (revision 0)
+++ gcc/testsuite/gcc.dg/tree-ssa/slsr-3.c  (revision 0)
@@ -0,0 +1,22 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -fdump-tree-optimized" } */
+
+int
+foo (int a[], int b[], int i)
+{
+  a[i] = b[i] + 2;
+  i++;
+  a[i] = b[i] + 2;
+  i++;
+  a[i] = b[i] + 2;
+  i++;
+  a[i] = b[i] + 2;
+  i++;
+  return i;
+}
+
+/* { dg-final { scan-tree-dump-times "\\* 4" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "\\+ 4" 2 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "\\+ 8" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "\\+ 12" 1 "optimized" } } */
+/* { dg-final { cleanup-tree-dump "optimized" } }

[PATCH] Fix PR46556 (straight-line strength reduction, part 2)

2012-06-28 Thread William J. Schmidt

Here's a relatively small piece of strength reduction that solves that
pesky addressing bug that got me looking at this in the first place...

The main part of the code is the stuff that was reviewed last year, but
which needed to find a good home.  So hopefully that's in pretty good
shape.  I recast base_cand_map as an htab again since I now need to look
up trees other than SSA names.  I plan to put together a follow-up patch
to change code and commentary references so that "base_name" becomes
"base_expr".  Doing that now would clutter up the patch too much.

Bootstrapped and tested on powerpc64-linux-gnu with no new regressions.
Ok for trunk?

Thanks,
Bill


gcc:

PR tree-optimization/46556
* gimple-ssa-strength-reduction.c (enum cand_kind): Add CAND_REF.
(base_cand_map): Change to hash table.
(base_cand_hash): New function.
(base_cand_free): Likewise.
(base_cand_eq): Likewise.
(lookup_cand): Change base_cand_map to hash table.
(find_basis_for_candidate): Likewise.
(base_cand_from_table): Exclude CAND_REF.
(restructure_reference): New function.
(slsr_process_ref): Likewise.
(find_candidates_in_block): Call slsr_process_ref.
(dump_candidate): Handle CAND_REF.
(base_cand_dump_callback): New function.
(dump_cand_chains): Change base_cand_map to hash table.
(replace_ref): New function.
(replace_refs): Likewise.
(analyze_candidates_and_replace): Call replace_refs.
(execute_strength_reduction): Change base_cand_map to hash table.

gcc/testsuite:

PR tree-optimization/46556
* testsuite/gcc.dg/tree-ssa/slsr-27.c: New.
* testsuite/gcc.dg/tree-ssa/slsr-28.c: New.
* testsuite/gcc.dg/tree-ssa/slsr-29.c: New.


Index: gcc/testsuite/gcc.dg/tree-ssa/slsr-27.c
===
--- gcc/testsuite/gcc.dg/tree-ssa/slsr-27.c (revision 0)
+++ gcc/testsuite/gcc.dg/tree-ssa/slsr-27.c (revision 0)
@@ -0,0 +1,22 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-dom2" } */
+
+struct x
+{
+  int a[16];
+  int b[16];
+  int c[16];
+};
+
+extern void foo (int, int, int);
+
+void
+f (struct x *p, unsigned int n)
+{
+  foo (p->a[n], p->c[n], p->b[n]);
+}
+
+/* { dg-final { scan-tree-dump-times "\\* 4;" 1 "dom2" } } */
+/* { dg-final { scan-tree-dump-times "p_\\d\+\\(D\\) \\+ D" 1 "dom2" } } */
+/* { dg-final { scan-tree-dump-times "MEM\\\[\\(struct x \\*\\)D" 3 "dom2" } } 
*/
+/* { dg-final { cleanup-tree-dump "dom2" } } */
Index: gcc/testsuite/gcc.dg/tree-ssa/slsr-28.c
===
--- gcc/testsuite/gcc.dg/tree-ssa/slsr-28.c (revision 0)
+++ gcc/testsuite/gcc.dg/tree-ssa/slsr-28.c (revision 0)
@@ -0,0 +1,26 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-dom2" } */
+
+struct x
+{
+  int a[16];
+  int b[16];
+  int c[16];
+};
+
+extern void foo (int, int, int);
+
+void
+f (struct x *p, unsigned int n)
+{
+  foo (p->a[n], p->c[n], p->b[n]);
+  if (n > 12)
+foo (p->a[n], p->c[n], p->b[n]);
+  else if (n > 3)
+foo (p->b[n], p->a[n], p->c[n]);
+}
+
+/* { dg-final { scan-tree-dump-times "\\* 4;" 1 "dom2" } } */
+/* { dg-final { scan-tree-dump-times "p_\\d\+\\(D\\) \\+ D" 1 "dom2" } } */
+/* { dg-final { scan-tree-dump-times "MEM\\\[\\(struct x \\*\\)D" 9 "dom2" } } 
*/
+/* { dg-final { cleanup-tree-dump "dom2" } } */
Index: gcc/testsuite/gcc.dg/tree-ssa/slsr-29.c
===
--- gcc/testsuite/gcc.dg/tree-ssa/slsr-29.c (revision 0)
+++ gcc/testsuite/gcc.dg/tree-ssa/slsr-29.c (revision 0)
@@ -0,0 +1,28 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-dom2" } */
+
+struct x
+{
+  int a[16];
+  int b[16];
+  int c[16];
+};
+
+extern void foo (int, int, int);
+
+void
+f (struct x *p, unsigned int n)
+{
+  foo (p->a[n], p->c[n], p->b[n]);
+  if (n > 3)
+{
+  foo (p->a[n], p->c[n], p->b[n]);
+  if (n > 12)
+   foo (p->b[n], p->a[n], p->c[n]);
+}
+}
+
+/* { dg-final { scan-tree-dump-times "\\* 4;" 1 "dom2" } } */
+/* { dg-final { scan-tree-dump-times "p_\\d\+\\(D\\) \\+ D" 1 "dom2" } } */
+/* { dg-final { scan-tree-dump-times "MEM\\\[\\(struct x \\*\\)D" 9 "dom2" } } 
*/
+/* { dg-final { cleanup-tree-dump "dom2" } } */
Index: gcc/gimple-ssa-strength-reduction.c
===
--- gcc/gimple-ssa-strength-reduction.c (revision 189025)
+++ gcc/gimple-ssa-strength-reduction.c (working copy)
@@ -32,7 +32,7 @@ along with GCC; see the file COPYING3.  If not see
2) Explicit multiplies, unknown constant multipliers,
   no conditional increments. (data gathering complete,
   replacements pending)
-   3) Implicit multiplies in addressing expressions. (pending)
+   3) Implicit multiplies in addressing expressions. (complete)
4) Explicit multiplies, conditional inc

Re: [PATCH, RFC] New target interface for vectorizer cost model

2012-07-03 Thread William J. Schmidt



On Tue, 2012-07-03 at 10:00 -0500, William J. Schmidt wrote:
> On Tue, 2012-07-03 at 15:59 +0200, Richard Guenther wrote:
> > On Tue, 3 Jul 2012, William J. Schmidt wrote:



> > > > +@deftypefn {Target Hook} int TARGET_VECTORIZE_FINISH_COST (void 
> > > > *@var{})
> > > +This hook should complete calculations of the cost of vectorizing a loop 
> > > +or basic block, and return that cost as an integer.  It should also 
> > > release
> > > +any target-specific data structures allocated by 
> > > TARGET_VECTORIZE_INIT_COST.
> > > +The default returns the value of the accumulator and releases it.
> > > +@end deftypefn
> > 
> > Should return unsigned int I think.
> 
> So did I, until I started running into all the existing places where
> costs are signed for no good reason; it quickly became pretty bothersome
> to be casting back and forth.  I'll look at it again.  Probably all the
> mess will eventually go away if we can get rid of the existing cost
> fields.

This isn't as bad as I thought, fixing.
> 
> > 
> > > +@deftypefn {Target Hook} void TARGET_VECTORIZE_DESTROY_COST_DATA (void 
> > > *@var{})
> > > +This hook should release any target-specific data structures allocated by
> > > +TARGET_VECTORIZE_INIT_COST.  The default releases the accumulator.
> > > +@end deftypefn
> > > +
> > 
> > Any reason this is not unified into one?  finish also destroys the data,
> > so are you merely saving time in the not vectorized case?
> 
> This interface is for all the exceptional exit paths where we need to
> free the memory but don't care to do the calculations associated with
> finish_cost, which may become more expensive over time.  For cleanliness
> of interface we could remove releasing the data from finish_cost's
> responsibilities.

Changing so finish_cost doesn't release the data.



> > > +/* Opaque pointer to target-specific cost model data.  */
> > > +void *target_cost_data;
> > 
> > Put that into _loop_vec_info / _bb_vec_info?
> 
> I think it's problematic to do this.  If you look at record_stmt_cost,
> it needs to reference target_cost_data but currently has no way of
> knowing whether the vectorizer is processing a loop or a block.  So we'd
> need to pass the _{loop,bb}_vec_info around through a lot of interfaces
> (record_stmt_cost is called all over), or create some more global state.
> 
> The simplest thing would be for the target cost model to keep its state
> private so the rest of the vectorizer never sees target_cost_data.  This
> would give up on a certain level of "reentrancy" as we discussed, but
> I'm not convinced that's a practical issue.  Do we really need to allow
> the cost model to handle more than one loop and/or block at a time?  I'd
> rather just hide the state and simplify these issues.

Never mind all this.  I found since I had a stmt_info in
record_stmt_cost, this could be done without a lot of mess.  I'm working
on a revision that ties the lifetime of target_cost_data to that of the
_loop_vec_info or _bb_vec_info.  Much cleaner, thanks.

Thanks,
Bill

Re: [PATCH, RFC] New target interface for vectorizer cost model

2012-07-04 Thread William J. Schmidt

On Wed, 2012-07-04 at 10:49 +0200, Richard Guenther wrote:
> On Tue, 3 Jul 2012, William J. Schmidt wrote:
> 
> > Hi Richard,
> > 
> > Here's a revision incorporating changes addressing your comments.  As
> > before it passes bootstrap and regression testing on powerpc64-linux-gnu
> > and compiles SPEC cpu2000 and cpu2006 with identical cost model
> > results.  
> > 
> > Before committing the patch I would remove the two gcc_asserts that
> > verify the cost models match.  I think a follow-up patch should then fix
> 
> Will you also remove the then "dead" code computing the old cost?  I think
> it's odd to have a state committed where we compute both but only
> use one ...
> 
> > the costs that appear to be incorrectly not counted by the old model (by
> > un-commenting-out the two chunks of code identified in the patch).  I'd
> > want to verify this doesn't cause any bad changes of behavior, since it
> > could result in fewer vectorized loops.
> > 
> > Ok for trunk?
> 
> ... so I'd say yes, ok for trunk, but please wait until you have
> figured out that the followup "fixing" the existing bugs by commenting
> out the code works and that another followup removing the old cost
> stuff works (which I am confident in that both will work).
> 
> Which means we'd eventually have a single commit doing these three
> things (or three adjacent commits).

OK, sounds like a plan.  Today's a holiday here, so I'll probably hold
off committing anything until after Cauldron.

Thanks,
Bill

> 
> Thanks,
> Richard.

[PATCH, committed] Fix PR53955

2012-07-13 Thread William J. Schmidt

Configure with --disable-build-poststage1-with-cxx exposed functions
that should have been marked static.  Bootstrapped on
powerpc-unknown-linux-gnu, committed as obvious.

Thanks,
Bill


2012-07-13  Bill Schmidt  

PR bootstrap/53955
* config/spu/spu.c (spu_init_cost): Mark static.
(spu_add_stmt_cost): Likewise.
(spu_finish_cost): Likewise.
(spu_destroy_cost_data): Likewise.
* config/i386/i386.c (ix86_init_cost): Mark static.
(ix86_add_stmt_cost): Likewise.
(ix86_finish_cost): Likewise.
(ix86_destroy_cost_data): Likewise.
* config/rs6000/rs6000.c (rs6000_init_cost): Mark static.
(rs6000_add_stmt_cost): Likewise.
(rs6000_finish_cost): Likewise.
(rs6000_destroy_cost_data): Likewise.


Index: gcc/config/spu/spu.c
===
--- gcc/config/spu/spu.c(revision 189460)
+++ gcc/config/spu/spu.c(working copy)
@@ -6919,7 +6919,7 @@ spu_builtin_vectorization_cost (enum vect_cost_for
 
 /* Implement targetm.vectorize.init_cost.  */
 
-void *
+static void *
 spu_init_cost (struct loop *loop_info ATTRIBUTE_UNUSED)
 {
   unsigned *cost = XNEW (unsigned);
@@ -6929,7 +6929,7 @@ spu_init_cost (struct loop *loop_info ATTRIBUTE_UN
 
 /* Implement targetm.vectorize.add_stmt_cost.  */
 
-unsigned
+static unsigned
 spu_add_stmt_cost (void *data, int count, enum vect_cost_for_stmt kind,
   struct _stmt_vec_info *stmt_info, int misalign)
 {
@@ -6956,7 +6956,7 @@ spu_add_stmt_cost (void *data, int count, enum vec
 
 /* Implement targetm.vectorize.finish_cost.  */
 
-unsigned
+static unsigned
 spu_finish_cost (void *data)
 {
   return *((unsigned *) data);
@@ -6964,7 +6964,7 @@ spu_finish_cost (void *data)
 
 /* Implement targetm.vectorize.destroy_cost_data.  */
 
-void
+static void
 spu_destroy_cost_data (void *data)
 {
   free (data);
Index: gcc/config/i386/i386.c
===
--- gcc/config/i386/i386.c  (revision 189460)
+++ gcc/config/i386/i386.c  (working copy)
@@ -40066,7 +40066,7 @@ ix86_autovectorize_vector_sizes (void)
 
 /* Implement targetm.vectorize.init_cost.  */
 
-void *
+static void *
 ix86_init_cost (struct loop *loop_info ATTRIBUTE_UNUSED)
 {
   unsigned *cost = XNEW (unsigned);
@@ -40076,7 +40076,7 @@ ix86_init_cost (struct loop *loop_info ATTRIBUTE_U
 
 /* Implement targetm.vectorize.add_stmt_cost.  */
 
-unsigned
+static unsigned
 ix86_add_stmt_cost (void *data, int count, enum vect_cost_for_stmt kind,
struct _stmt_vec_info *stmt_info, int misalign)
 {
@@ -40103,7 +40103,7 @@ ix86_add_stmt_cost (void *data, int count, enum ve
 
 /* Implement targetm.vectorize.finish_cost.  */
 
-unsigned
+static unsigned
 ix86_finish_cost (void *data)
 {
   return *((unsigned *) data);
@@ -40111,7 +40111,7 @@ ix86_finish_cost (void *data)
 
 /* Implement targetm.vectorize.destroy_cost_data.  */
 
-void
+static void
 ix86_destroy_cost_data (void *data)
 {
   free (data);
Index: gcc/config/rs6000/rs6000.c
===
--- gcc/config/rs6000/rs6000.c  (revision 189460)
+++ gcc/config/rs6000/rs6000.c  (working copy)
@@ -3522,7 +3522,7 @@ rs6000_preferred_simd_mode (enum machine_mode mode
 
 /* Implement targetm.vectorize.init_cost.  */
 
-void *
+static void *
 rs6000_init_cost (struct loop *loop_info ATTRIBUTE_UNUSED)
 {
   unsigned *cost = XNEW (unsigned);
@@ -3532,7 +3532,7 @@ rs6000_init_cost (struct loop *loop_info ATTRIBUTE
 
 /* Implement targetm.vectorize.add_stmt_cost.  */
 
-unsigned
+static unsigned
 rs6000_add_stmt_cost (void *data, int count, enum vect_cost_for_stmt kind,
  struct _stmt_vec_info *stmt_info, int misalign)
 {
@@ -3559,7 +3559,7 @@ rs6000_add_stmt_cost (void *data, int count, enum
 
 /* Implement targetm.vectorize.finish_cost.  */
 
-unsigned
+static unsigned
 rs6000_finish_cost (void *data)
 {
   return *((unsigned *) data);
@@ -3567,7 +3567,7 @@ rs6000_finish_cost (void *data)
 
 /* Implement targetm.vectorize.destroy_cost_data.  */
 
-void
+static void
 rs6000_destroy_cost_data (void *data)
 {
   free (data);

[PATCH] Enable vectorizer cost model by default at -O3

2012-07-15 Thread William J. Schmidt

The auto-vectorizer is overly aggressive when not constrained by the
vectorizer cost model.  Although the cost model is by no means perfect,
it does a reasonable job of avoiding many poor vectorization decisions.
Since the auto-vectorizer is enabled by default at -O3 and above, we
should also enable the vectorizer cost model by default at -O3 and
above.

Bootstrapped and tested on powerpc64-unknown-linux-gnu with no new
regressions.  Ok for trunk?

Thanks,
Bill


2012-07-15  Bill Schmidt  

* opts.c (default_option): Add -fvect-cost-model to default options
at -O3 and above.


Index: gcc/opts.c
===
--- gcc/opts.c  (revision 189481)
+++ gcc/opts.c  (working copy)
@@ -501,6 +501,7 @@ static const struct default_options default_option
 { OPT_LEVELS_3_PLUS, OPT_funswitch_loops, NULL, 1 },
 { OPT_LEVELS_3_PLUS, OPT_fgcse_after_reload, NULL, 1 },
 { OPT_LEVELS_3_PLUS, OPT_ftree_vectorize, NULL, 1 },
+{ OPT_LEVELS_3_PLUS, OPT_fvect_cost_model, NULL, 1 },
 { OPT_LEVELS_3_PLUS, OPT_fipa_cp_clone, NULL, 1 },
 { OPT_LEVELS_3_PLUS, OPT_ftree_partial_pre, NULL, 1 },

[PATCH] Add flag to control straight-line strength reduction

2012-07-17 Thread William J. Schmidt

I overlooked adding a pass-control flag for strength reduction, added
here.  I named it -ftree-slsr for consistency with other -ftree- flags,
but could change it to -fgimple-slsr if you prefer that for a pass named
gimple-ssa-...

Bootstrapped and tested on powerpc-unknown-linux-gnu with no new
regressions.  Ok for trunk?

Thanks,
Bill


2012-07-17  Bill Schmidt  

* opts.c (default_option): Make -ftree-slsr default at -O1 and above.
* gimple-ssa-strength-reduction.c (gate_strength_reduction): Use
flag_tree_slsr.
* common.opt: Add -ftree-slsr with flag_tree_slsr.


Index: gcc/opts.c
===
--- gcc/opts.c  (revision 189574)
+++ gcc/opts.c  (working copy)
@@ -452,6 +452,7 @@ static const struct default_options default_option
 { OPT_LEVELS_1_PLUS, OPT_ftree_ch, NULL, 1 },
 { OPT_LEVELS_1_PLUS, OPT_fcombine_stack_adjustments, NULL, 1 },
 { OPT_LEVELS_1_PLUS, OPT_fcompare_elim, NULL, 1 },
+{ OPT_LEVELS_1_PLUS, OPT_ftree_slsr, NULL, 1 },
 
 /* -O2 optimizations.  */
 { OPT_LEVELS_2_PLUS, OPT_finline_small_functions, NULL, 1 },
Index: gcc/gimple-ssa-strength-reduction.c
===
--- gcc/gimple-ssa-strength-reduction.c (revision 189574)
+++ gcc/gimple-ssa-strength-reduction.c (working copy)
@@ -1501,7 +1501,7 @@ execute_strength_reduction (void)
 static bool
 gate_strength_reduction (void)
 {
-  return optimize > 0;
+  return flag_tree_slsr;
 }
 
 struct gimple_opt_pass pass_strength_reduction =
Index: gcc/common.opt
===
--- gcc/common.opt  (revision 189574)
+++ gcc/common.opt  (working copy)
@@ -2080,6 +2080,10 @@ ftree-sink
 Common Report Var(flag_tree_sink) Optimization
 Enable SSA code sinking on trees
 
+ftree-slsr
+Common Report Var(flag_tree_slsr) Optimization
+Perform straight-line strength reduction
+
 ftree-sra
 Common Report Var(flag_tree_sra) Optimization
 Perform scalar replacement of aggregates

Re: [PATCH] Add flag to control straight-line strength reduction

2012-07-18 Thread William J. Schmidt

On Wed, 2012-07-18 at 11:01 +0200, Richard Guenther wrote:
> On Wed, 18 Jul 2012, Steven Bosscher wrote:
> 
> > On Wed, Jul 18, 2012 at 9:59 AM, Richard Guenther  wrote:
> > > On Tue, 17 Jul 2012, William J. Schmidt wrote:
> > >
> > >> I overlooked adding a pass-control flag for strength reduction, added
> > >> here.  I named it -ftree-slsr for consistency with other -ftree- flags,
> > >> but could change it to -fgimple-slsr if you prefer that for a pass named
> > >> gimple-ssa-...
> > >>
> > >> Bootstrapped and tested on powerpc-unknown-linux-gnu with no new
> > >> regressions.  Ok for trunk?
> > >
> > > The switch needs documentation in doc/invoke.texi.  Other than that
> > > it's fine to stick with -ftree-..., even that exposes details to our
> > > users that are not necessary (RTL passes didn't have -frtl-... either).
> > > So in the end, why not re-use -fstrength-reduce that is already available
> > > (but stubbed out)?
> > 
> > In the past, -fstrength-reduce applied to loop strength reduction in
> > loop.c. I don't think it should be re-used for a completely different
> > code transformation.
> 
> Ok.  I suppose -ftree-slsr is ok then.

It turns out I was looking at a very old copy of the manual, and the
-ftree... stuff is not as prevalent now as it once was.  I'll just go
with -fslsr to be consistent with -fgcse, -fipa-sra, etc.

Thanks for the pointer to doc/invoke.texi -- it appears I also failed to
document -fhoist-adjacent-loads, so I will go ahead and do that as well.

Thanks!
Bill

> 
> Thanks,
> Richard.
>

Re: [PATCH] Add flag to control straight-line strength reduction

2012-07-18 Thread William J. Schmidt

On Wed, 2012-07-18 at 08:24 -0500, William J. Schmidt wrote:
> On Wed, 2012-07-18 at 11:01 +0200, Richard Guenther wrote:
> > On Wed, 18 Jul 2012, Steven Bosscher wrote:
> > 
> > > On Wed, Jul 18, 2012 at 9:59 AM, Richard Guenther  
> > > wrote:
> > > > On Tue, 17 Jul 2012, William J. Schmidt wrote:
> > > >
> > > >> I overlooked adding a pass-control flag for strength reduction, added
> > > >> here.  I named it -ftree-slsr for consistency with other -ftree- flags,
> > > >> but could change it to -fgimple-slsr if you prefer that for a pass 
> > > >> named
> > > >> gimple-ssa-...
> > > >>
> > > >> Bootstrapped and tested on powerpc-unknown-linux-gnu with no new
> > > >> regressions.  Ok for trunk?
> > > >
> > > > The switch needs documentation in doc/invoke.texi.  Other than that
> > > > it's fine to stick with -ftree-..., even that exposes details to our
> > > > users that are not necessary (RTL passes didn't have -frtl-... either).
> > > > So in the end, why not re-use -fstrength-reduce that is already 
> > > > available
> > > > (but stubbed out)?
> > > 
> > > In the past, -fstrength-reduce applied to loop strength reduction in
> > > loop.c. I don't think it should be re-used for a completely different
> > > code transformation.
> > 
> > Ok.  I suppose -ftree-slsr is ok then.
> 
> It turns out I was looking at a very old copy of the manual, and the
> -ftree... stuff is not as prevalent now as it once was.  I'll just go
> with -fslsr to be consistent with -fgcse, -fipa-sra, etc.

Well, posted too fast.  Paging down I see that isn't true, sorry.  I'll
use the tree- for consistency even though it is useless information.

Thanks,
Bill

> 
> Thanks for the pointer to doc/invoke.texi -- it appears I also failed to
> document -fhoist-adjacent-loads, so I will go ahead and do that as well.
> 
> Thanks!
> Bill
> 
> > 
> > Thanks,
> > Richard.
> > 
>

Re: [PATCH] Add flag to control straight-line strength reduction

2012-07-18 Thread William J. Schmidt

Here's the patch with documentation changes included.  I also cleaned up
missing work from a couple of my previous patches, so
-fhoist-adjacent-loads is documented now, and -fvect-cost-model is added
to the list of options on by default at -O3.

Ok for trunk?

Thanks,
Bill


2012-07-18  Bill Schmidt  

* doc/invoke.texi: Add -fhoist-adjacent-loads and -ftree-slsr to list
of flags controlling optimization; add -ftree-slsr to list of flags
enabled by default at -O; add -fhoist-adjacent-loads to list of flags
enabled by default at -O2; add -fvect-cost-model to list of flags
enabled by default at -O3; document -fhoist-adjacent-loads and
-ftree-slsr.
* opts.c (default_option): Make -ftree-slsr default at -O1 and above.
* gimple-ssa-strength-reduction.c (gate_strength_reduction): Use
flag_tree_slsr.
* common.opt: Add -ftree-slsr with flag_tree_slsr.


Index: gcc/doc/invoke.texi
===
--- gcc/doc/invoke.texi (revision 189574)
+++ gcc/doc/invoke.texi (working copy)
@@ -364,7 +364,8 @@ Objective-C and Objective-C++ Dialects}.
 -ffast-math -ffinite-math-only -ffloat-store -fexcess-precision=@var{style} 
@gol
 -fforward-propagate -ffp-contract=@var{style} -ffunction-sections @gol
 -fgcse -fgcse-after-reload -fgcse-las -fgcse-lm -fgraphite-identity @gol
--fgcse-sm -fif-conversion -fif-conversion2 -findirect-inlining @gol
+-fgcse-sm -fhoist-adjacent-loads -fif-conversion @gol
+-fif-conversion2 -findirect-inlining @gol
 -finline-functions -finline-functions-called-once -finline-limit=@var{n} @gol
 -finline-small-functions -fipa-cp -fipa-cp-clone -fipa-matrix-reorg @gol
 -fipa-pta -fipa-profile -fipa-pure-const -fipa-reference @gol
@@ -413,8 +414,8 @@ Objective-C and Objective-C++ Dialects}.
 -ftree-phiprop -ftree-loop-distribution -ftree-loop-distribute-patterns @gol
 -ftree-loop-ivcanon -ftree-loop-linear -ftree-loop-optimize @gol
 -ftree-parallelize-loops=@var{n} -ftree-pre -ftree-partial-pre -ftree-pta @gol
--ftree-reassoc @gol
--ftree-sink -ftree-sra -ftree-switch-conversion -ftree-tail-merge @gol
+-ftree-reassoc -ftree-sink -ftree-slsr -ftree-sra @gol
+-ftree-switch-conversion -ftree-tail-merge @gol
 -ftree-ter -ftree-vect-loop-version -ftree-vectorize -ftree-vrp @gol
 -funit-at-a-time -funroll-all-loops -funroll-loops @gol
 -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops @gol
@@ -6259,6 +6260,7 @@ compilation time.
 -ftree-forwprop @gol
 -ftree-fre @gol
 -ftree-phiprop @gol
+-ftree-slsr @gol
 -ftree-sra @gol
 -ftree-pta @gol
 -ftree-ter @gol
@@ -6286,6 +6288,7 @@ also turns on the following optimization flags:
 -fdevirtualize @gol
 -fexpensive-optimizations @gol
 -fgcse  -fgcse-lm  @gol
+-fhoist-adjacent-loads @gol
 -finline-small-functions @gol
 -findirect-inlining @gol
 -fipa-sra @gol
@@ -6311,6 +6314,7 @@ Optimize yet more.  @option{-O3} turns on all opti
 by @option{-O2} and also turns on the @option{-finline-functions},
 @option{-funswitch-loops}, @option{-fpredictive-commoning},
 @option{-fgcse-after-reload}, @option{-ftree-vectorize},
+@option{-fvect-cost-model},
 @option{-ftree-partial-pre} and @option{-fipa-cp-clone} options.
 
 @item -O0
@@ -7129,6 +7133,13 @@ This flag is enabled by default at @option{-O} and
 Perform hoisting of loads from conditional pointers on trees.  This
 pass is enabled by default at @option{-O} and higher.
 
+@item -fhoist-adjacent-loads
+@opindex hoist-adjacent-loads
+Speculatively hoist loads from both branches of an if-then-else if the
+loads are from adjacent locations in the same structure and the target
+architecture has a conditional move instruction.  This flag is enabled
+by default at @option{-O2} and higher.
+
 @item -ftree-copy-prop
 @opindex ftree-copy-prop
 Perform copy propagation on trees.  This pass eliminates unnecessary
@@ -7529,6 +7540,13 @@ defining expression.  This results in non-GIMPLE c
 much more complex trees to work on resulting in better RTL generation.  This is
 enabled by default at @option{-O} and higher.
 
+@item -ftree-slsr
+@opindex ftree-slsr
+Perform straight-line strength reduction on trees.  This recognizes related
+expressions involving multiplications and replaces them by less expensive
+calculations when possible.  This is enabled by default at @option{-O} and
+higher.
+
 @item -ftree-vectorize
 @opindex ftree-vectorize
 Perform loop vectorization on trees. This flag is enabled by default at
@@ -7550,7 +7568,8 @@ except at level @option{-Os} where it is disabled.
 
 @item -fvect-cost-model
 @opindex fvect-cost-model
-Enable cost model for vectorization.
+Enable cost model for vectorization.  This option is enabled by default at
+@option{-O3}.
 
 @item -ftree-vrp
 @opindex ftree-vrp
Index: gcc/opts.c
===
--- gcc/opts.c  (revision 189574)
+++ gcc/opts.c  (working copy)
@@ -452,6 +452,7 @@ static const struct defa

Ping: [PATCH] Fix PR46556 (straight-line strength reduction, part 2)

2012-07-22 Thread William J. Schmidt

Ping...

On Thu, 2012-06-28 at 16:45 -0500, William J. Schmidt wrote:
> Here's a relatively small piece of strength reduction that solves that
> pesky addressing bug that got me looking at this in the first place...
> 
> The main part of the code is the stuff that was reviewed last year, but
> which needed to find a good home.  So hopefully that's in pretty good
> shape.  I recast base_cand_map as an htab again since I now need to look
> up trees other than SSA names.  I plan to put together a follow-up patch
> to change code and commentary references so that "base_name" becomes
> "base_expr".  Doing that now would clutter up the patch too much.
> 
> Bootstrapped and tested on powerpc64-linux-gnu with no new regressions.
> Ok for trunk?
> 
> Thanks,
> Bill
> 
> 
> gcc:
> 
>   PR tree-optimization/46556
>   * gimple-ssa-strength-reduction.c (enum cand_kind): Add CAND_REF.
>   (base_cand_map): Change to hash table.
>   (base_cand_hash): New function.
>   (base_cand_free): Likewise.
>   (base_cand_eq): Likewise.
>   (lookup_cand): Change base_cand_map to hash table.
>   (find_basis_for_candidate): Likewise.
>   (base_cand_from_table): Exclude CAND_REF.
>   (restructure_reference): New function.
>   (slsr_process_ref): Likewise.
>   (find_candidates_in_block): Call slsr_process_ref.
>   (dump_candidate): Handle CAND_REF.
>   (base_cand_dump_callback): New function.
>   (dump_cand_chains): Change base_cand_map to hash table.
>   (replace_ref): New function.
>   (replace_refs): Likewise.
>   (analyze_candidates_and_replace): Call replace_refs.
>   (execute_strength_reduction): Change base_cand_map to hash table.
> 
> gcc/testsuite:
> 
>   PR tree-optimization/46556
>   * testsuite/gcc.dg/tree-ssa/slsr-27.c: New.
>   * testsuite/gcc.dg/tree-ssa/slsr-28.c: New.
>   * testsuite/gcc.dg/tree-ssa/slsr-29.c: New.
> 
> 
> Index: gcc/testsuite/gcc.dg/tree-ssa/slsr-27.c
> ===
> --- gcc/testsuite/gcc.dg/tree-ssa/slsr-27.c   (revision 0)
> +++ gcc/testsuite/gcc.dg/tree-ssa/slsr-27.c   (revision 0)
> @@ -0,0 +1,22 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-dom2" } */
> +
> +struct x
> +{
> +  int a[16];
> +  int b[16];
> +  int c[16];
> +};
> +
> +extern void foo (int, int, int);
> +
> +void
> +f (struct x *p, unsigned int n)
> +{
> +  foo (p->a[n], p->c[n], p->b[n]);
> +}
> +
> +/* { dg-final { scan-tree-dump-times "\\* 4;" 1 "dom2" } } */
> +/* { dg-final { scan-tree-dump-times "p_\\d\+\\(D\\) \\+ D" 1 "dom2" } } */
> +/* { dg-final { scan-tree-dump-times "MEM\\\[\\(struct x \\*\\)D" 3 "dom2" } 
> } */
> +/* { dg-final { cleanup-tree-dump "dom2" } } */
> Index: gcc/testsuite/gcc.dg/tree-ssa/slsr-28.c
> ===
> --- gcc/testsuite/gcc.dg/tree-ssa/slsr-28.c   (revision 0)
> +++ gcc/testsuite/gcc.dg/tree-ssa/slsr-28.c   (revision 0)
> @@ -0,0 +1,26 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-dom2" } */
> +
> +struct x
> +{
> +  int a[16];
> +  int b[16];
> +  int c[16];
> +};
> +
> +extern void foo (int, int, int);
> +
> +void
> +f (struct x *p, unsigned int n)
> +{
> +  foo (p->a[n], p->c[n], p->b[n]);
> +  if (n > 12)
> +foo (p->a[n], p->c[n], p->b[n]);
> +  else if (n > 3)
> +foo (p->b[n], p->a[n], p->c[n]);
> +}
> +
> +/* { dg-final { scan-tree-dump-times "\\* 4;" 1 "dom2" } } */
> +/* { dg-final { scan-tree-dump-times "p_\\d\+\\(D\\) \\+ D" 1 "dom2" } } */
> +/* { dg-final { scan-tree-dump-times "MEM\\\[\\(struct x \\*\\)D" 9 "dom2" } 
> } */
> +/* { dg-final { cleanup-tree-dump "dom2" } } */
> Index: gcc/testsuite/gcc.dg/tree-ssa/slsr-29.c
> ===
> --- gcc/testsuite/gcc.dg/tree-ssa/slsr-29.c   (revision 0)
> +++ gcc/testsuite/gcc.dg/tree-ssa/slsr-29.c   (revision 0)
> @@ -0,0 +1,28 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-dom2" } */
> +
> +struct x
> +{
> +  int a[16];
> +  int b[16];
> +  int c[16];
> +};
> +
> +extern void foo (int, int, int);
> +
> +void
> +f (struct x *p, unsigned int n)
> +{
> +  foo (p->a[n], p->c[n], p->b[n]);
> +  if (n > 3)
> +{
> +  foo (p->a[n], p->c[n], p->b[n]);
> +  if (n >

Re: [PATCH] Vectorizer cost model outside-cost changes

2012-07-24 Thread William J. Schmidt

On Tue, 2012-07-24 at 10:57 +0200, Richard Guenther wrote:
> On Mon, 23 Jul 2012, William J. Schmidt wrote:
> 
> > This patch completes the conversion of the vectorizer cost model to use
> > target hooks for recording vectorization information and calculating
> > costs.  Previous work handled the costs inside the loop body or basic
> > block being vectorized.  This patch similarly converts the prologue and
> > epilogue costs.
> > 
> > As before, I first verified that the new model provides the same results
> > as the old model on the regression testsuite and on SPEC CPU2006.  I
> > then removed the old model, rather than submitting an intermediate patch
> > with both present.  I have a patch that shows both if it's needed for
> > reference.
> > 
> > Also as before, I found an error in the old cost model wherein prologue
> > costs of phi reduction statements were not being considered during the
> > final vectorization decision.  I have fixed this in the new model; thus,
> > this version of the cost model will be slightly more conservative than
> > the original.  I am currently running SPEC tests to ensure there aren't
> > any resulting degradations.
> > 
> > One thing that could be done in future for further cleanup would be to
> > handle the scalar iteration cost in a similar manner.  Right now this is
> > dealt with by recording N scalar_stmts, where N is the length of the
> > scalar iteration; as with the old model, there is no attempt to
> > differentiate between different scalar statements.  This results in some
> > hackish stuff in, e.g., tree-vect-stmts.c:record_stmt_cost (), where we
> > have to deal with the fact that we may not have a stmt_info for the
> > statement being recorded.  This is only true for these aggregated
> > scalar_stmt costs.
> > 
> > Bootstrapped and tested on powerpc-unknown-linux-gnu with no new
> > regressions.  Assuming the SPEC performance tests come out ok, is this
> > ok for trunk?
> 
> So all costs we query from the backend even for the prologue/epilogue
> are costs for vector stmts (like inits of invariant vectors or
> outer-loop parts in outer loop vectorization)?

Yes, with the exception of copies of scalar iterations introduced by
loop peeling (the N * scalar_stmt business).

There are comments in several places indicating opportunities for
improvement in the modeling, including for the outer-loop case, but for
now your statement holds otherwise.

Thanks,
Bill

> 
> Ok in that case.
> 
> Thanks,
> Richard.
> 
> > Thanks!
> > Bill
> >

[PATCH] Change IVOPTS and strength reduction to use expmed cost model

2012-07-25 Thread William J. Schmidt

Per Richard Henderson's suggestion
(http://gcc.gnu.org/ml/gcc-patches/2012-06/msg01370.html), this patch
changes the IVOPTS and straight-line strength reduction passes to make
use of data computed by init_expmed.  This required adding a new
convert_cost array in expmed to store the costs of converting between
various scalar integer modes, and exposing expmed's multiplication hash
table for external use (new function mult_by_coeff_cost).  Richard H,
I'd appreciate it if you could look at what I did there and make sure
it's correct.  Thanks!

I decided it wasn't worth distinguishing between reg-reg add costs and
reg-constant add costs, so I simplified the strength reduction
calculations rather than adding another array to expmed for this
purpose.  But I can make this distinction if that's preferable.

Bootstrapped and tested on powerpc64-unknown-linux-gnu with no new
regressions.  Ok for trunk?

Thanks,
Bill


2012-07-25  Bill Schmidt  

* tree-ssa-loop-ivopts.c (mbc_entry_hash): Remove.
(mbc_entry_eq): Likewise.
(mult_costs): Likewise.
(cost_tables_exist): Likewise.
(initialize_costs): Likewise.
(finalize_costs): Likewise.
(tree_ssa_iv_optimize_init): Remove call to initialize_costs.
(add_regs_cost): Remove.
(multiply_regs_cost): Likewise.
(add_const_cost): Likewise.
(extend_or_trunc_reg_cost): Likewise.
(negate_reg_cost): Likewise.
(struct mbc_entry): Likewise.
(multiply_by_const_cost): Likewise.
(get_address_cost): Change add_regs_cost calls to add_cost lookups;
change multiply_by_const_cost to mult_by_coeff_cost.
(force_expr_to_var_cost): Likewise.
(difference_cost): Change multiply_by_const_cost to mult_by_coeff_cost.
(get_computation_cost_at): Change add_regs_cost calls to add_cost
lookups; change multiply_by_const_cost to mult_by_coeff_cost.
(determine_iv_cost): Change add_regs_cost calls to add_cost lookups.
(tree_ssa_iv_optimize_finalize): Remove call to finalize_costs.
* tree-ssa-address.c (expmed.h): New #include.
(most_expensive_mult_to_index): Change multiply_by_const_cost to
mult_by_coeff_cost.
* gimple-ssa-strength-reduction.c (expmed.h): New #include.
(stmt_cost): Change to use mult_by_coeff_cost, mul_cost, add_cost,
neg_cost, and convert_cost instead of IVOPTS interfaces.
(execute_strength_reduction): Remove calls to initialize_costs and
finalize_costs.
* expmed.c (struct init_expmed_rtl): Add convert rtx_def.
(init_expmed_one_mode): Initialize convert rtx_def; initialize
convert_cost for related modes.
(mult_by_coeff_cost): New function.
* expmed.h (struct target_expmed): Add x_convert_cost matrix.
(convert_cost): New #define.
(mult_by_coeff_cost): New extern decl.
* tree-flow.h (initialize_costs): Remove decl.
(finalize_costs): Likewise.
(multiply_by_const_cost): Likewise.
(add_regs_cost): Likewise.
(multiply_regs_cost): Likewise.
(add_const_cost): Likewise.
(extend_or_trunc_reg_cost): Likewise.
(negate_reg_cost): Likewise.


Index: gcc/tree-ssa-loop-ivopts.c
===
--- gcc/tree-ssa-loop-ivopts.c  (revision 189845)
+++ gcc/tree-ssa-loop-ivopts.c  (working copy)
@@ -88,9 +88,6 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-ssa-propagate.h"
 #include "expmed.h"
 
-static hashval_t mbc_entry_hash (const void *);
-static int mbc_entry_eq (const void*, const void *);
-
 /* FIXME: Expressions are expanded to RTL in this pass to determine the
cost of different addressing modes.  This should be moved to a TBD
interface between the GIMPLE and RTL worlds.  */
@@ -381,11 +378,6 @@ struct iv_ca_delta
 
 static VEC(tree,heap) *decl_rtl_to_reset;
 
-/* Cached costs for multiplies by constants, and a flag to indicate
-   when they're valid.  */
-static htab_t mult_costs[2];
-static bool cost_tables_exist = false;
-
 static comp_cost force_expr_to_var_cost (tree, bool);
 
 /* Number of uses recorded in DATA.  */
@@ -851,26 +843,6 @@ htab_inv_expr_hash (const void *ent)
   return expr->hash;
 }
 
-/* Allocate data structures for the cost model.  */
-
-void
-initialize_costs (void)
-{
-  mult_costs[0] = htab_create (100, mbc_entry_hash, mbc_entry_eq, free);
-  mult_costs[1] = htab_create (100, mbc_entry_hash, mbc_entry_eq, free);
-  cost_tables_exist = true;
-}
-
-/* Release data structures for the cost model.  */
-
-void
-finalize_costs (void)
-{
-  cost_tables_exist = false;
-  htab_delete (mult_costs[0]);
-  htab_delete (mult_costs[1]);
-}
-
 /* Initializes data structures used by the iv optimization pass, stored
in DATA.  */
 
@@ -889,8 +861,6 @@ tree_ssa_iv_optimize_init (struct ivopts_data *dat
 htab_inv_expr_eq, free)

Re: [PATCH] Change IVOPTS and strength reduction to use expmed cost model

2012-07-25 Thread William J. Schmidt

On Wed, 2012-07-25 at 09:59 -0700, Richard Henderson wrote:
> On 07/25/2012 09:13 AM, William J. Schmidt wrote:
> > Per Richard Henderson's suggestion
> > (http://gcc.gnu.org/ml/gcc-patches/2012-06/msg01370.html), this patch
> > changes the IVOPTS and straight-line strength reduction passes to make
> > use of data computed by init_expmed.  This required adding a new
> > convert_cost array in expmed to store the costs of converting between
> > various scalar integer modes, and exposing expmed's multiplication hash
> > table for external use (new function mult_by_coeff_cost).  Richard H,
> > I'd appreciate it if you could look at what I did there and make sure
> > it's correct.  Thanks!
> 
> Correctness looks good.
> 
> > I decided it wasn't worth distinguishing between reg-reg add costs and
> > reg-constant add costs, so I simplified the strength reduction
> > calculations rather than adding another array to expmed for this
> > purpose.  But I can make this distinction if that's preferable.
> 
> I don't think this is worth thinking about at this level.  This is
> something that some rtl-level optimization ought to be able to fix
> up trivially, e.g. cse.
> 
> > Index: gcc/expmed.h
> > ===
> > --- gcc/expmed.h(revision 189845)
> > +++ gcc/expmed.h(working copy)
> > @@ -155,6 +155,11 @@ struct target_expmed {
> >int x_udiv_cost[2][NUM_MACHINE_MODES];
> >int x_mul_widen_cost[2][NUM_MACHINE_MODES];
> >int x_mul_highpart_cost[2][NUM_MACHINE_MODES];
> > +
> > +  /* Conversion costs are only defined between two scalar integer modes
> > + of different sizes.  The first machine mode is the destination mode,
> > + and the second is the source mode.  */
> > +  int x_convert_cost[2][NUM_MACHINE_MODES][NUM_MACHINE_MODES];
> >  };
> 
> 2 * NUM_MACHINE_MODES is quite large...  I think we could do better with
> 
> #define NUM_MODE_INT (MAX_MODE_INT - MIN_MODE_INT + 1)
> 
>   x_convert_cost[2][NUM_MODE_INT][NUM_MODE_INT];
> 
> though really that could be done with all of these fields all at once.
> 
> That does suggest it would be better to leave at least inline functions
> to access these elements, rather than open code the array access.
> 
> 
> r~
> 

Thanks for the quick review!  Excellent point about the array size.  The
attached revised patch follows your suggestion to limit the size.

I only did this for the new field, as changing all the existing
accessors to inline functions is more effort than I have time for right
now.  This is left as an exercise for the reader. ;)

Bootstrapped and tested on powepc64-unknown-linux-gnu with no new
failures.  Is this ok?

Thanks,
Bill

2012-07-25  Bill Schmidt  

* tree-ssa-loop-ivopts.c (mbc_entry_hash): Remove.
(mbc_entry_eq): Likewise.
(mult_costs): Likewise.
(cost_tables_exist): Likewise.
(initialize_costs): Likewise.
(finalize_costs): Likewise.
(tree_ssa_iv_optimize_init): Remove call to initialize_costs.
(add_regs_cost): Remove.
(multiply_regs_cost): Likewise.
(add_const_cost): Likewise.
(extend_or_trunc_reg_cost): Likewise.
(negate_reg_cost): Likewise.
(struct mbc_entry): Likewise.
(multiply_by_const_cost): Likewise.
(get_address_cost): Change add_regs_cost calls to add_cost lookups;
change multiply_by_const_cost to mult_by_coeff_cost.
(force_expr_to_var_cost): Likewise.
(difference_cost): Change multiply_by_const_cost to mult_by_coeff_cost.
(get_computation_cost_at): Change add_regs_cost calls to add_cost
lookups; change multiply_by_const_cost to mult_by_coeff_cost.
(determine_iv_cost): Change add_regs_cost calls to add_cost lookups.
(tree_ssa_iv_optimize_finalize): Remove call to finalize_costs.
* tree-ssa-address.c (expmed.h): New #include.
(most_expensive_mult_to_index): Change multiply_by_const_cost to
mult_by_coeff_cost.
* gimple-ssa-strength-reduction.c (expmed.h): New #include.
(stmt_cost): Change to use mult_by_coeff_cost, mul_cost, add_cost,
neg_cost, and convert_cost instead of IVOPTS interfaces.
(execute_strength_reduction): Remove calls to initialize_costs and
finalize_costs.
* expmed.c (struct init_expmed_rtl): Add convert rtx_def.
(init_expmed_one_mode): Initialize convert rtx_def; initialize
x_convert_cost for related modes.
(mult_by_coeff_cost): New function.
* expmed.h (NUM_MODE_INT): New #define.
(struct target_expmed): Add x_convert_cost matrix.
(set_convert_cost): New inline function.
(convert_cos

Re: [PING] Re: [RFC, ivopts] fix bugs in ivopts address cost computation

2012-07-25 Thread William J. Schmidt

On Wed, 2012-07-25 at 13:39 -0600, Sandra Loosemore wrote:
> On 07/17/2012 05:22 AM, Richard Guenther wrote:
> > On Wed, Jul 4, 2012 at 6:35 PM, Sandra Loosemore
> >   wrote:
> >>
> >> Ping?  Original post with patch is here:
> >>
> >> http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00319.html
> >
> > Can you update the patch and numbers based on what Bill did for
> > straight-line strength reduction which re-uses this analysis/caching part?
> 
> I will try to take another look at this once Bill has finished his work 
> that touches on this; it's been hard for me to track a moving target.  I 
> was wondering if it might be more consistent with Bill's work to defer 
> some of the address cost computation to new target hooks, after all.
> 
> -Sandra
> 

Hi Sandra,

I apologize for the mess.  I should be done causing distress to this
part of the code as soon as the patch I submitted today is committed.

Sorry!
Bill

Re: [patch[ Add explanations to sbitmap, bitmap, and sparseset

2012-07-27 Thread William J. Schmidt

On Fri, 2012-07-27 at 15:40 +0200, Richard Guenther wrote:
> On Thu, Jul 26, 2012 at 11:57 AM, Steven Bosscher  
> wrote:
> > On Thu, Jul 26, 2012 at 11:23 AM, Richard Guenther
> >  wrote:
> >> Ok!  Thanks for adding this exhaustive documentation.
> >
> > There's more to come! I want to add some explanations to ebitmap,
> > pointer-set, fibheap, and splay-tree as sets, and add a chapter in the
> > gccint manual too.
> >
> > Now if only you'd document those loop changes... ;-)
> 
> Eh ...
> 
> >
> >> Btw, ebitmap is unused since it was added - maybe we should simply remove
> >> it ...?
> >
> > I wouldn't remove it just yet. I'm going to make sure that bitmap.[ch]
> > and ebitmap.[ch] provide the same interface and see if there are
> > places where ebitmap is a better choice than bitmap or sbitmap (cprop
> > and gcse.c come to mind).
> 
> Btw, just looking over sparseset.h what needs to be documented is that
> iterating over the set is faster than for an sbitmap but element ordering
> is random!  Also it looks less efficient than sbitmap in the case when
> your main operation is adding to the set and querying the set randomly.
> It's space overhead is really huge - for smaller universes a smaller
> SPARSESET_ELT_TYPE would be nice, templates to the rescue!  I
> wonder in which cases a unsigned HOST_WIDEST_FAST_INT sized
> universe is even useful (but a short instead of an int is probably too
> small ...)

Another option for sparse sets would be a templatized version of Pugh's
skip lists.  Iteration is the same as a linked list and random access is
logarithmic in the size of the set (not the universe).  Space overhead
is also logarithmic.  The potential downside is that it involves
pointers.

Bill

> 
> Richard.
> 
> > Ciao!
> > Steven

[PATCH] Fix PR53733

2012-07-30 Thread William J. Schmidt

This fixes the de-canonicalization of commutative GIMPLE operations in
the vectorizer that occurs when processing reductions.  A loop_vec_info
is flagged for cleanup when a de-canonicalization has occurred in that
loop, and the cleanup is done when the loop_vec_info is destroyed.

Bootstrapped on powerpc64-unknown-linux-gnu with no new regressions.  Ok
for trunk?

Thanks,
Bill


gcc:

2012-07-30  Bill Schmidt  

PR tree-optimization/53773
* tree-vectorizer.h (struct _loop_vec_info): Add operands_swapped.
(LOOP_VINFO_OPERANDS_SWAPPED): New macro.
* tree-vect-loop.c (new_loop_vec_info): Initialize
LOOP_VINFO_OPERANDS_SWAPPED field.
(destroy_loop_vec_info): Restore canonical form.
(vect_is_slp_reduction): Set LOOP_VINFO_OPERANDS_SWAPPED field.
(vect_is_simple_reduction_1): Likewise.

gcc/testsuite:

2012-07-30  Bill Schmidt  

PR tree-optimization/53773
* testsuite/gcc.dg/vect/pr53773.c: New test.


Index: gcc/testsuite/gcc.dg/vect/pr53773.c
===
--- gcc/testsuite/gcc.dg/vect/pr53773.c (revision 0)
+++ gcc/testsuite/gcc.dg/vect/pr53773.c (revision 0)
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+
+int
+foo (int integral, int decimal, int power_ten)
+{
+  while (power_ten > 0)
+{
+  integral *= 10;
+  decimal *= 10;
+  power_ten--;
+}
+
+  return integral+decimal;
+}
+
+/* Two occurrences in annotations, two in code.  */
+/* { dg-final { scan-tree-dump-times "\\* 10" 4 "vect" } } */
+/* { dg-final { cleanup-tree-dump "vect" } } */
+
Index: gcc/tree-vectorizer.h
===
--- gcc/tree-vectorizer.h   (revision 189938)
+++ gcc/tree-vectorizer.h   (working copy)
@@ -296,6 +296,12 @@ typedef struct _loop_vec_info {
  this.  */
   bool peeling_for_gaps;
 
+  /* Reductions are canonicalized so that the last operand is the reduction
+ operand.  If this places a constant into RHS1, this decanonicalizes
+ GIMPLE for other phases, so we must track when this has occurred and
+ fix it up.  */
+  bool operands_swapped;
+
 } *loop_vec_info;
 
 /* Access Functions.  */
@@ -326,6 +332,7 @@ typedef struct _loop_vec_info {
 #define LOOP_VINFO_PEELING_HTAB(L) (L)->peeling_htab
 #define LOOP_VINFO_TARGET_COST_DATA(L) (L)->target_cost_data
 #define LOOP_VINFO_PEELING_FOR_GAPS(L) (L)->peeling_for_gaps
+#define LOOP_VINFO_OPERANDS_SWAPPED(L) (L)->operands_swapped
 
 #define LOOP_REQUIRES_VERSIONING_FOR_ALIGNMENT(L) \
 VEC_length (gimple, (L)->may_misalign_stmts) > 0
Index: gcc/tree-vect-loop.c
===
--- gcc/tree-vect-loop.c(revision 189938)
+++ gcc/tree-vect-loop.c(working copy)
@@ -853,6 +853,7 @@ new_loop_vec_info (struct loop *loop)
   LOOP_VINFO_PEELING_HTAB (res) = NULL;
   LOOP_VINFO_TARGET_COST_DATA (res) = init_cost (loop);
   LOOP_VINFO_PEELING_FOR_GAPS (res) = false;
+  LOOP_VINFO_OPERANDS_SWAPPED (res) = false;
 
   return res;
 }
@@ -873,6 +874,7 @@ destroy_loop_vec_info (loop_vec_info loop_vinfo, b
   int j;
   VEC (slp_instance, heap) *slp_instances;
   slp_instance instance;
+  bool swapped;
 
   if (!loop_vinfo)
 return;
@@ -881,6 +883,7 @@ destroy_loop_vec_info (loop_vec_info loop_vinfo, b
 
   bbs = LOOP_VINFO_BBS (loop_vinfo);
   nbbs = loop->num_nodes;
+  swapped = LOOP_VINFO_OPERANDS_SWAPPED (loop_vinfo);
 
   if (!clean_stmts)
 {
@@ -905,6 +908,22 @@ destroy_loop_vec_info (loop_vec_info loop_vinfo, b
   for (si = gsi_start_bb (bb); !gsi_end_p (si); )
 {
   gimple stmt = gsi_stmt (si);
+
+ /* We may have broken canonical form by moving a constant
+into RHS1 of a commutative op.  Fix such occurrences.  */
+ if (swapped && is_gimple_assign (stmt))
+   {
+ enum tree_code code = gimple_assign_rhs_code (stmt);
+
+ if ((code == PLUS_EXPR
+  || code == POINTER_PLUS_EXPR
+  || code == MULT_EXPR)
+ && CONSTANT_CLASS_P (gimple_assign_rhs1 (stmt)))
+   swap_tree_operands (stmt,
+   gimple_assign_rhs1_ptr (stmt),
+   gimple_assign_rhs2_ptr (stmt));
+   }
+
  /* Free stmt_vec_info.  */
  free_stmt_vec_info (stmt);
   gsi_next (&si);
@@ -1920,6 +1939,9 @@ vect_is_slp_reduction (loop_vec_info loop_info, gi
  gimple_assign_rhs1_ptr (next_stmt),
   gimple_assign_rhs2_ptr (next_stmt));
  update_stmt (next_stmt);
+
+ if (CONSTANT_CLASS_P (gimple_assign_rhs1 (next_stmt)))
+   LOOP_VINFO_OPERANDS_SWAPPED (loop_info) = true;
}
  else
return false;
@@ -2324,6 +2346,9 @@ vect_is_simple_reduction_1 (loop_vec_info loop_inf

[PATCH, rs6000] Vectorizer heuristic

2012-07-31 Thread William J. Schmidt

Now that the vectorizer cost model is set up to facilitate per-target
heuristics, I'm revisiting the "density" heuristic I submitted
previously.  This allows the vec_permute and vec_promote_demote costs to
be set to their natural values, but inhibits vectorization in cases like
sphinx3 where vectorizing a loop leads to issue stalls from
overcommitted resources.

Bootstrapped on powerpc64-unknown-linux-gnu with no new regressions.
Measured performance on cpu2000 and cpu2006 with no significant changes
in performance.  Ok for trunk?

Thanks,
Bill


2012-07-31  Bill Schmidt  

* config/rs6000/rs6000.c (rs6000_builtin_vectorization_cost): Revise
costs for vec_perm and vec_promote_demote down to more natural values.
(struct _rs6000_cost_data): New data structure.
(rs6000_density_test): New function.
(rs6000_init_cost): Change to use rs6000_cost_data.
(rs6000_add_stmt_cost): Likewise.
(rs6000_finish_cost): Perform density test when vectorizing a loop.


Index: gcc/config/rs6000/rs6000.c
===
--- gcc/config/rs6000/rs6000.c  (revision 189845)
+++ gcc/config/rs6000/rs6000.c  (working copy)
@@ -60,6 +60,7 @@
 #include "params.h"
 #include "tm-constrs.h"
 #include "opts.h"
+#include "tree-vectorizer.h"
 #if TARGET_XCOFF
 #include "xcoffout.h"  /* get declarations of xcoff_*_section_name */
 #endif
@@ -3378,13 +3379,13 @@ rs6000_builtin_vectorization_cost (enum vect_cost_
 
   case vec_perm:
if (TARGET_VSX)
- return 4;
+ return 3;
else
  return 1;
 
   case vec_promote_demote:
 if (TARGET_VSX)
-  return 5;
+  return 4;
 else
   return 1;
 
@@ -3520,14 +3521,71 @@ rs6000_preferred_simd_mode (enum machine_mode mode
   return word_mode;
 }
 
+typedef struct _rs6000_cost_data
+{
+  struct loop *loop_info;
+  unsigned cost[3];
+} rs6000_cost_data;
+
+/* Test for likely overcommitment of vector hardware resources.  If a
+   loop iteration is relatively large, and too large a percentage of
+   instructions in the loop are vectorized, the cost model may not
+   adequately reflect delays from unavailable vector resources.
+   Penalize the loop body cost for this case.  */
+
+static void
+rs6000_density_test (rs6000_cost_data *data)
+{
+  const int DENSITY_PCT_THRESHOLD = 85;
+  const int DENSITY_SIZE_THRESHOLD = 70;
+  const int DENSITY_PENALTY = 10;
+  struct loop *loop = data->loop_info;
+  basic_block *bbs = get_loop_body (loop);
+  int nbbs = loop->num_nodes;
+  int vec_cost = data->cost[vect_body], not_vec_cost = 0;
+  int i, density_pct;
+
+  for (i = 0; i < nbbs; i++)
+{
+  basic_block bb = bbs[i];
+  gimple_stmt_iterator gsi;
+
+  for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+   {
+ gimple stmt = gsi_stmt (gsi);
+ stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
+
+ if (!STMT_VINFO_RELEVANT_P (stmt_info)
+ && !STMT_VINFO_IN_PATTERN_P (stmt_info))
+   not_vec_cost++;
+   }
+}
+
+  density_pct = (vec_cost * 100) / (vec_cost + not_vec_cost);
+
+  if (density_pct > DENSITY_PCT_THRESHOLD
+  && vec_cost + not_vec_cost > DENSITY_SIZE_THRESHOLD)
+{
+  data->cost[vect_body] = vec_cost * (100 + DENSITY_PENALTY) / 100;
+  if (vect_print_dump_info (REPORT_DETAILS))
+   fprintf (vect_dump,
+"density %d%%, cost %d exceeds threshold, penalizing "
+"loop body cost by %d%%", density_pct, 
+vec_cost + not_vec_cost, DENSITY_PENALTY);
+}
+}
+
 /* Implement targetm.vectorize.init_cost.  */
 
 static void *
-rs6000_init_cost (struct loop *loop_info ATTRIBUTE_UNUSED)
+rs6000_init_cost (struct loop *loop_info)
 {
-  unsigned *cost = XNEWVEC (unsigned, 3);
-  cost[vect_prologue] = cost[vect_body] = cost[vect_epilogue] = 0;
-  return cost;
+  rs6000_cost_data *data = XNEW (struct _rs6000_cost_data);
+  data->loop_info = loop_info;
+  data->cost[vect_prologue] = 0;
+  data->cost[vect_body] = 0;
+  data->cost[vect_epilogue] = 0;
+  return data;
 }
 
 /* Implement targetm.vectorize.add_stmt_cost.  */
@@ -3537,7 +3595,7 @@ rs6000_add_stmt_cost (void *data, int count, enum
  struct _stmt_vec_info *stmt_info, int misalign,
  enum vect_cost_model_location where)
 {
-  unsigned *cost = (unsigned *) data;
+  rs6000_cost_data *cost_data = (rs6000_cost_data*) data;
   unsigned retval = 0;
 
   if (flag_vect_cost_model)
@@ -3552,7 +3610,7 @@ rs6000_add_stmt_cost (void *data, int count, enum
count *= 50;  /* FIXME.  */
 
   retval = (unsigned) (count * stmt_cost);
-  cost[where] += retval;
+  cost_data->cost[where] += retval;
 }
 
   return retval;
@@ -3564,10 +3622,14 @@ static void
 rs6000_finish_cost (void *data, unsigned *prologue_cost,
unsigned *body_cost, unsigned *epilogue_cost)
 {
-  unsigned *cost

[PATCH, committed] Strength reduction clean-up (base name => base expr)

2012-08-01 Thread William J. Schmidt

This cleans up terminology in strength reduction.  What used to be a
base SSA name is now sometimes other tree expressions, so the term "base
name" is replaced by "base expression" throughout.

Bootstrapped and tested with no new regressions on
powerpc64-unknown-linux-gnu; committed as obvious.

Thanks,
Bill


2012-08-01  Bill Schmidt  

* gimple-ssa-strength-reduction.c (struct slsr_cand_d): Change
base_name to base_expr.
(struct cand_chain_d): Likewise.
(base_cand_hash): Likewise.
(base_cand_eq): Likewise.
(record_potential_basis): Likewise.
(alloc_cand_and_find_basis): Likewise.
(create_mul_ssa_cand): Likewise.
(create_mul_imm_cand): Likewise.
(create_add_ssa_cand): Likewise.
(create_add_imm_cand): Likewise.
(slsr_process_cast): Likewise.
(slsr_process_copy): Likewise.
(dump_candidate): Likewise.
(base_cand_dump_callback): Likewise.
(unconditional_cands_with_known_stride_p): Likewise.
(cand_increment): Likewise.


Index: gcc/gimple-ssa-strength-reduction.c
===
--- gcc/gimple-ssa-strength-reduction.c (revision 190037)
+++ gcc/gimple-ssa-strength-reduction.c (working copy)
@@ -166,8 +166,8 @@ struct slsr_cand_d
   /* The candidate statement S1.  */
   gimple cand_stmt;
 
-  /* The base SSA name B.  */
-  tree base_name;
+  /* The base expression B:  often an SSA name, but not always.  */
+  tree base_expr;
 
   /* The stride S.  */
   tree stride;
@@ -175,7 +175,7 @@ struct slsr_cand_d
   /* The index constant i.  */
   double_int index;
 
-  /* The type of the candidate.  This is normally the type of base_name,
+  /* The type of the candidate.  This is normally the type of base_expr,
  but casts may have occurred when combining feeding instructions.
  A candidate can only be a basis for candidates of the same final type.
  (For CAND_REFs, this is the type to be used for operand 1 of the
@@ -216,12 +216,13 @@ typedef struct slsr_cand_d slsr_cand, *slsr_cand_t
 typedef const struct slsr_cand_d *const_slsr_cand_t;
 
 /* Pointers to candidates are chained together as part of a mapping
-   from SSA names to the candidates that use them as a base name.  */
+   from base expressions to the candidates that use them.  */
 
 struct cand_chain_d
 {
-  /* SSA name that serves as a base name for the chain of candidates.  */
-  tree base_name;
+  /* Base expression for the chain of candidates:  often, but not
+ always, an SSA name.  */
+  tree base_expr;
 
   /* Pointer to a candidate.  */
   slsr_cand_t cand;
@@ -253,7 +254,7 @@ static struct pointer_map_t *stmt_cand_map;
 /* Obstack for candidates.  */
 static struct obstack cand_obstack;
 
-/* Hash table embodying a mapping from base names to chains of candidates.  */
+/* Hash table embodying a mapping from base exprs to chains of candidates.  */
 static htab_t base_cand_map;
 
 /* Obstack for candidate chains.  */
@@ -272,7 +273,7 @@ lookup_cand (cand_idx idx)
 static hashval_t
 base_cand_hash (const void *p)
 {
-  tree base_expr = ((const_cand_chain_t) p)->base_name;
+  tree base_expr = ((const_cand_chain_t) p)->base_expr;
   return iterative_hash_expr (base_expr, 0);
 }
 
@@ -291,10 +292,10 @@ base_cand_eq (const void *p1, const void *p2)
 {
   const_cand_chain_t const chain1 = (const_cand_chain_t) p1;
   const_cand_chain_t const chain2 = (const_cand_chain_t) p2;
-  return operand_equal_p (chain1->base_name, chain2->base_name, 0);
+  return operand_equal_p (chain1->base_expr, chain2->base_expr, 0);
 }
 
-/* Use the base name from candidate C to look for possible candidates
+/* Use the base expr from candidate C to look for possible candidates
that can serve as a basis for C.  Each potential basis must also
appear in a block that dominates the candidate statement and have
the same stride and type.  If more than one possible basis exists,
@@ -308,7 +309,7 @@ find_basis_for_candidate (slsr_cand_t c)
   cand_chain_t chain;
   slsr_cand_t basis = NULL;
 
-  mapping_key.base_name = c->base_name;
+  mapping_key.base_expr = c->base_expr;
   chain = (cand_chain_t) htab_find (base_cand_map, &mapping_key);
 
   for (; chain; chain = chain->next)
@@ -337,8 +338,8 @@ find_basis_for_candidate (slsr_cand_t c)
   return 0;
 }
 
-/* Record a mapping from the base name of C to C itself, indicating that
-   C may potentially serve as a basis using that base name.  */
+/* Record a mapping from the base expression of C to C itself, indicating that
+   C may potentially serve as a basis using that base expression.  */
 
 static void
 record_potential_basis (slsr_cand_t c)
@@ -347,7 +348,7 @@ record_potential_basis (slsr_cand_t c)
   void **slot;
 
   node = (cand_chain_t) obstack_alloc (&chain_obstack, sizeof (cand_chain));
-  node->base_name = c->base_name;
+  node->base_expr = c->base_expr;
   node->cand = c;
   node->next = NULL;
   slot = htab_find_slot (

[PATCH] Strength reduction part 3 of 4: candidates with unknown strides

2012-08-01 Thread William J. Schmidt

Greetings,

Thanks for the review of part 2!  Here's another chunk of the SLSR code
(I feel I owe you a few beers at this point).  This performs analysis
and replacement on groups of related candidates having an SSA name
(rather than a constant) for a stride.

This leaves only the conditional increment (CAND_PHI) case, which will
be handled in the last patch of the series.

Bootstrapped and tested on powerpc64-unknown-linux-gnu with no new
regressions.  Ok for trunk?

Thanks,
Bill


gcc:

2012-08-01  Bill Schmidt  

* gimple-ssa-strength-reduction.c (struct incr_info_d): New struct.
(incr_vec): New static var.
(incr_vec_len): Likewise.
(address_arithmetic_p): Likewise.
(stmt_cost): Remove dead assignment.
(dump_incr_vec): New function.
(cand_abs_increment): Likewise.
(lazy_create_slsr_reg): Likewise.
(incr_vec_index): Likewise.
(count_candidates): Likewise.
(record_increment): Likewise.
(record_increments): Likewise.
(unreplaced_cand_in_tree): Likewise.
(optimize_cands_for_speed_p): Likewise.
(lowest_cost_path): Likewise.
(total_savings): Likewise.
(analyze_increments): Likewise.
(ncd_for_two_cands): Likewise.
(nearest_common_dominator_for_cands): Likewise.
(profitable_increment_p): Likewise.
(insert_initializers): Likewise.
(introduce_cast_before_cand): Likewise.
(replace_rhs_if_not_dup): Likewise.
(replace_one_candidate): Likewise.
(replace_profitable_candidates): Likewise.
(analyze_candidates_and_replace): Handle candidates with SSA-name
strides.

gcc/testsuite:

2012-08-01  Bill Schmidt  

* gcc.dg/tree-ssa/slsr-5.c: New.
* gcc.dg/tree-ssa/slsr-6.c: New.
* gcc.dg/tree-ssa/slsr-7.c: New.
* gcc.dg/tree-ssa/slsr-8.c: New.
* gcc.dg/tree-ssa/slsr-9.c: New.
* gcc.dg/tree-ssa/slsr-10.c: New.
* gcc.dg/tree-ssa/slsr-11.c: New.
* gcc.dg/tree-ssa/slsr-12.c: New.
* gcc.dg/tree-ssa/slsr-13.c: New.
* gcc.dg/tree-ssa/slsr-14.c: New.
* gcc.dg/tree-ssa/slsr-15.c: New.
* gcc.dg/tree-ssa/slsr-16.c: New.
* gcc.dg/tree-ssa/slsr-17.c: New.
* gcc.dg/tree-ssa/slsr-18.c: New.
* gcc.dg/tree-ssa/slsr-19.c: New.
* gcc.dg/tree-ssa/slsr-20.c: New.
* gcc.dg/tree-ssa/slsr-21.c: New.
* gcc.dg/tree-ssa/slsr-22.c: New.
* gcc.dg/tree-ssa/slsr-23.c: New.
* gcc.dg/tree-ssa/slsr-24.c: New.
* gcc.dg/tree-ssa/slsr-25.c: New.
* gcc.dg/tree-ssa/slsr-26.c: New.
* gcc.dg/tree-ssa/slsr-30.c: New.
* gcc.dg/tree-ssa/slsr-31.c: New.


Index: gcc/testsuite/gcc.dg/tree-ssa/slsr-10.c
===
--- gcc/testsuite/gcc.dg/tree-ssa/slsr-10.c (revision 0)
+++ gcc/testsuite/gcc.dg/tree-ssa/slsr-10.c (revision 0)
@@ -0,0 +1,23 @@
+/* Verify straight-line strength reduction for simple integer addition
+   with stride reversed on 1st and 3rd instances.  */
+
+/* { dg-do compile } */
+/* { dg-options "-O3 -fdump-tree-optimized" } */
+
+int
+f (int s, int c)
+{
+  int a1, a2, a3, x1, x2, x3, x;
+
+  a1 = 2 * s;
+  x1 = a1 + c;
+  a2 = 4 * s;
+  x2 = c + a2;
+  a3 = 6 * s;
+  x3 = a3 + c;
+  x = x1 + x2 + x3;
+  return x;
+}
+
+/* { dg-final { scan-tree-dump-times " \\* " 1 "optimized" } } */
+/* { dg-final { cleanup-tree-dump "optimized" } } */
Index: gcc/testsuite/gcc.dg/tree-ssa/slsr-11.c
===
--- gcc/testsuite/gcc.dg/tree-ssa/slsr-11.c (revision 0)
+++ gcc/testsuite/gcc.dg/tree-ssa/slsr-11.c (revision 0)
@@ -0,0 +1,24 @@
+/* Verify straight-line strength reduction for simple integer addition
+   with casts thrown in.  */
+
+/* { dg-do compile } */
+/* { dg-options "-O3 -fdump-tree-optimized" } */
+
+long
+f (int s, long c)
+{
+  int a1, a2, a3;
+  long x1, x2, x3, x;
+
+  a1 = 2 * s;
+  x1 = c + a1;
+  a2 = 4 * s;
+  x2 = c + a2;
+  a3 = 6 * s;
+  x3 = c + a3;
+  x = x1 + x2 + x3;
+  return x;
+}
+
+/* { dg-final { scan-tree-dump-times " \\* " 1 "optimized" } } */
+/* { dg-final { cleanup-tree-dump "optimized" } } */
Index: gcc/testsuite/gcc.dg/tree-ssa/slsr-20.c
===
--- gcc/testsuite/gcc.dg/tree-ssa/slsr-20.c (revision 0)
+++ gcc/testsuite/gcc.dg/tree-ssa/slsr-20.c (revision 0)
@@ -0,0 +1,21 @@
+/* Verify straight-line strength reduction for multiply candidates
+   with stride in inconsistent positions.  */
+
+/* { dg-do compile } */
+/* { dg-options "-O3 -fdump-tree-optimized" } */
+
+int
+f (int c, int s)
+{
+  int x1, x2, y1, y2;
+
+  y1 = c + 2;
+  x1 = y1 * s;
+  y2 = y1 + 2;
+  x2 = s * y2;
+  return x1 + x2;
+}
+
+/* { dg-final { scan-tree-dump-times " \\* s" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times " \\* 2" 1 "optimized" } } */
+/

[PATCH, committed] Fix PR53773

2012-08-03 Thread William J. Schmidt

Change this test case to use the optimized dump so that the unreliable
vect-details dump can't cause different behavior on different targets.
Verified on powerpc64-unknown-linux-gnu, committed as obvious.

Thanks,
Bill


2012-08-03  Bill Schmidt  

* testsuite/gcc.dg/vect/pr53773.c: Change to use optimized dump.


Index: gcc/testsuite/gcc.dg/vect/pr53773.c
===
--- gcc/testsuite/gcc.dg/vect/pr53773.c (revision 190018)
+++ gcc/testsuite/gcc.dg/vect/pr53773.c (working copy)
@@ -1,4 +1,5 @@
 /* { dg-do compile } */
+/* { dg-options "-fdump-tree-optimized" } */
 
 int
 foo (int integral, int decimal, int power_ten)
@@ -13,7 +14,7 @@ foo (int integral, int decimal, int power_ten)
   return integral+decimal;
 }
 
-/* Two occurrences in annotations, two in code.  */
-/* { dg-final { scan-tree-dump-times "\\* 10" 4 "vect" } } */
+/* { dg-final { scan-tree-dump-times "\\* 10" 2 "optimized" } } */
 /* { dg-final { cleanup-tree-dump "vect" } } */
+/* { dg-final { cleanup-tree-dump "optimized" } } */

[Patch ping] Strength reduction

2012-04-29 Thread William J. Schmidt

Thought I'd ping http://gcc.gnu.org/ml/gcc-patches/2012-03/msg01225.html
since it's been about six weeks.  Any initial feedback would be very
much appreciated!

Thanks,
Bill

Re: [PATCH] Improve COND_EXPR expansion

2012-05-02 Thread William J. Schmidt

On Mon, 2012-04-30 at 20:22 -0700, Andrew Pinski wrote:
> Hi,
>   This patch improves the expansion of COND_EXPR into RTL, directly
> using conditional moves.
> I had to fix a bug in the x86 backend where emit_conditional_move
> could cause a crash as we had a comparison mode of DImode which is not
> handled by the 32bit part.  can_conditionally_move_p return true as we
> had an SImode for the other operands.
> Note other targets might need a similar fix as x86 had but I could not
> test those targets and this is really the first time where
> emit_conditional_move is being called with different modes for the
> comparison and the other operands mode and the comparison mode is not
> of the CC class.

Hi Andrew,

I verified your patch on powerpc64-unknown-linux-gnu.  There were no new
testcase regressions, and SPEC cpu2006 built ok with your changes.

Hope this helps!

Bill
> 
> The main reasoning to do this conversion early rather than wait for
> ifconv as the resulting code is slightly better.  Also the compiler is
> slightly faster.
> 
> OK?  Bootstrapped and tested on both mips64-linux-gnu (where it was
> originally written for) and x86_64-linux-gnu.
> 
> Thanks,
> Andrew Pinski
> 
> ChangeLog:
> * expr.c (convert_tree_comp_to_rtx): New function.
> (expand_expr_real_2): Try using conditional moves for COND_EXPRs if they 
> exist.
> * config/i386/i386.c (ix86_expand_int_movcc): Disallow comparison
> modes of DImode for 32bits and TImode.

[PATCH] Hoist adjacent pointer loads

2012-05-03 Thread William J. Schmidt

This patch was posted for comment back in February during stage 4.  It
addresses a performance issue noted in the EEMBC routelookup benchmark
on a common idiom:

  if (...)
x = y->left;
  else
x = y->right;

If the two loads can be hoisted out of the if/else, the if/else can be
replaced by a conditional move instruction on architectures that support
one.  Because this speculates one of the loads, the patch constrains the
optimization to avoid introducing page faults.

Bootstrapped and regression tested on powerpc-unknown-linux-gnu with no
new failures.  The patch provides significant improvement to the
routelookup benchmark, and is neutral on SPEC cpu2000/cpu2006.

One question is what optimization level should be required for this.
Because of the speculation, -O3 might be in order.  I don't believe
-Ofast is required as there is no potential correctness issue involved.
Right now the patch doesn't check the optimization level (like the rest
of the phi-opt transforms), which is likely a poor choice.

Ok for trunk?

Thanks,
Bill


2012-05-03  Bill Schmidt  

* tree-ssa-phiopt.c (tree_ssa_phiopt_worker): Add argument to forward
declaration.
(hoist_adjacent_loads, gate_hoist_loads): New forward declarations.
(tree_ssa_phiopt): Call gate_hoist_loads.
(tree_ssa_cs_elim): Add parm to tree_ssa_phiopt_worker call.
(tree_ssa_phiopt_worker): Add do_hoist_loads to formal arg list; call
hoist_adjacent_loads.
(local_reg_dependence): New function.
(local_mem_dependence): Likewise.
(hoist_adjacent_loads): Likewise.
(gate_hoist_loads): Likewise.
* common.opt (fhoist-adjacent-loads): New switch.
* Makefile.in (tree-ssa-phiopt.o): Added dependencies.
* params.def (PARAM_MIN_CMOVE_STRUCT_ALIGN): New param.


Index: gcc/tree-ssa-phiopt.c
===
--- gcc/tree-ssa-phiopt.c   (revision 187057)
+++ gcc/tree-ssa-phiopt.c   (working copy)
@@ -37,9 +37,17 @@ along with GCC; see the file COPYING3.  If not see
 #include "cfgloop.h"
 #include "tree-data-ref.h"
 #include "tree-pretty-print.h"
+#include "gimple-pretty-print.h"
+#include "insn-config.h"
+#include "expr.h"
+#include "optabs.h"
 
+#ifndef HAVE_conditional_move
+#define HAVE_conditional_move (0)
+#endif
+
 static unsigned int tree_ssa_phiopt (void);
-static unsigned int tree_ssa_phiopt_worker (bool);
+static unsigned int tree_ssa_phiopt_worker (bool, bool);
 static bool conditional_replacement (basic_block, basic_block,
 edge, edge, gimple, tree, tree);
 static int value_replacement (basic_block, basic_block,
@@ -53,6 +61,9 @@ static bool cond_store_replacement (basic_block, b
 static bool cond_if_else_store_replacement (basic_block, basic_block, 
basic_block);
 static struct pointer_set_t * get_non_trapping (void);
 static void replace_phi_edge_with_variable (basic_block, edge, gimple, tree);
+static void hoist_adjacent_loads (basic_block, basic_block,
+ basic_block, basic_block);
+static bool gate_hoist_loads (void);
 
 /* This pass tries to replaces an if-then-else block with an
assignment.  We have four kinds of transformations.  Some of these
@@ -138,12 +149,56 @@ static void replace_phi_edge_with_variable (basic_
  bb2:
x = PHI ;
 
-   A similar transformation is done for MAX_EXPR.  */
+   A similar transformation is done for MAX_EXPR.
 
+
+   This pass also performs a fifth transformation of a slightly different
+   flavor.
+
+   Adjacent Load Hoisting
+   --
+   
+   This transformation replaces
+
+ bb0:
+   if (...) goto bb2; else goto bb1;
+ bb1:
+   x1 = ().field1;
+   goto bb3;
+ bb2:
+   x2 = ().field2;
+ bb3:
+   # x = PHI ;
+
+   with
+
+ bb0:
+   x1 = ().field1;
+   x2 = ().field2;
+   if (...) goto bb2; else goto bb1;
+ bb1:
+   goto bb3;
+ bb2:
+ bb3:
+   # x = PHI ;
+
+   The purpose of this transformation is to enable generation of conditional
+   move instructions such as Intel CMOVE or PowerPC ISEL.  Because one of
+   the loads is speculative, the transformation is restricted to very
+   specific cases to avoid introducing a page fault.  We are looking for
+   the common idiom:
+
+ if (...)
+   x = y->left;
+ else
+   x = y->right;
+
+   where left and right are typically adjacent pointers in a tree structure.  
*/
+
 static unsigned int
 tree_ssa_phiopt (void)
 {
-  return tree_ssa_phiopt_worker (false);
+  return tree_ssa_phiopt_worker (false, gate_hoist_loads ());
 }
 
 /* This pass tries to transform conditional stores into unconditional
@@ -190,7 +245,7 @@ tree_ssa_phiopt (void)
 static unsigned int
 tree_ssa_cs_elim (void)
 {
-  return tree_ssa_phiopt_worker (true);
+  return tree_ssa_phiopt_worker (true, false);
 }
 
 /* Return the singleton PHI in the SEQ of PHIs for edges

Re: [PATCH] Hoist adjacent pointer loads

2012-05-03 Thread William J. Schmidt

On Thu, 2012-05-03 at 09:40 -0600, Jeff Law wrote:
> On 05/03/2012 08:33 AM, William J. Schmidt wrote:
> > This patch was posted for comment back in February during stage 4.  It
> > addresses a performance issue noted in the EEMBC routelookup benchmark
> > on a common idiom:
> >
> >if (...)
> >  x = y->left;
> >else
> >  x = y->right;
> >
> > If the two loads can be hoisted out of the if/else, the if/else can be
> > replaced by a conditional move instruction on architectures that support
> > one.  Because this speculates one of the loads, the patch constrains the
> > optimization to avoid introducing page faults.
> >
> > Bootstrapped and regression tested on powerpc-unknown-linux-gnu with no
> > new failures.  The patch provides significant improvement to the
> > routelookup benchmark, and is neutral on SPEC cpu2000/cpu2006.
> >
> > One question is what optimization level should be required for this.
> > Because of the speculation, -O3 might be in order.  I don't believe
> > -Ofast is required as there is no potential correctness issue involved.
> > Right now the patch doesn't check the optimization level (like the rest
> > of the phi-opt transforms), which is likely a poor choice.
> Doesn't this need to be conditionalized on the memory model that's 
> currently active?
> 
Yes and no.  What's important is that you don't want to introduce page
faults (or less urgently, cache misses) by speculating the load.  So the
patch is currently extremely constrained, and likely will always stay
that way.  Only fields that are pointers and that are strictly adjacent
are hoisted, and only if they're in the same 16-byte block.  (The number
16 is a parameter that can be adjusted.)

Hopefully I didn't miss your point -- let me know if I did and I'll try
again. :)

Thanks,
Bill

> jeff
>

Re: [PATCH] Hoist adjacent pointer loads

2012-05-03 Thread William J. Schmidt

On Thu, 2012-05-03 at 11:44 -0600, Jeff Law wrote:
> On 05/03/2012 10:47 AM, William J. Schmidt wrote:
> >>
> > Yes and no.  What's important is that you don't want to introduce page
> > faults (or less urgently, cache misses) by speculating the load.  So the
> > patch is currently extremely constrained, and likely will always stay
> > that way.  Only fields that are pointers and that are strictly adjacent
> > are hoisted, and only if they're in the same 16-byte block.  (The number
> > 16 is a parameter that can be adjusted.)
> >
> > Hopefully I didn't miss your point -- let me know if I did and I'll try
> > again. :)
> You missed the point :-)
> 
> Under the C++11 memory model you can't introduce new data races on 
> objects which might be visible to multiple threads.  This requirement 
> can restrict speculation in many cases.  Furthermore, it sounds like C11 
> will have similar constraints.
> 
> I believe there's a wiki page which touches on these kinds of issues.
> 
> That doesn't mean we can't ever do the optimization, just that we have 
> to be more careful than we have in the past when mucking around with 
> memory optimizations.

OK, thanks!  Looks like I have some reading to do about the new memory
models.

However, from the wiki page I see:  "A speculative load which has its
results thrown away are considered to not have changed the semantics of
the program, and are therefore allowed."  That seems to cover the case
here: the load is hoisted, but if the path where it was originally
loaded is not executed, its result is discarded.

If needed, though, what flags/detection mechanisms are available for
determining that the load speculation should be disabled?

Thanks,
Bill
> 
> jeff
>

[PATCH] Fix PR53217

2012-05-08 Thread William J. Schmidt

This fixes another statement-placement issue when reassociating
expressions with repeated factors.  Multiplies feeding into
__builtin_powi calls were not getting placed properly ahead of them in
some cases.

Bootstrapped and tested on powerpc64-unknown-linux-gnu with no new
regressions.  I've also run SPEC cpu2006 with no build or correctness
issues.  OK for trunk?

Thanks,
Bill


gcc:

2012-05-08  Bill Schmidt  

PR tree-optimization/53217
* tree-ssa-reassoc.c (bip_map): New static variable.
(possibly_move_powi): Move feeding multiplies with __builtin_powi call.
(attempt_builtin_powi): Save feeding multiplies on a stack.
(reassociate_bb): Create and destroy bip_map.

gcc/testsuite:

2012-05-08  Bill Schmidt  

PR tree-optimization/53217
* gfortran.dg/pr53217.f90: New test.


Index: gcc/testsuite/gfortran.dg/pr53217.f90
===
--- gcc/testsuite/gfortran.dg/pr53217.f90   (revision 0)
+++ gcc/testsuite/gfortran.dg/pr53217.f90   (revision 0)
@@ -0,0 +1,28 @@
+! { dg-do compile }
+! { dg-options "-O1 -ffast-math" }
+
+! This tests only for compile-time failure, which formerly occurred
+! when statements were emitted out of order, failing verify_ssa.
+
+MODULE xc_cs1
+  INTEGER, PARAMETER :: dp=KIND(0.0D0)
+  REAL(KIND=dp), PARAMETER :: a = 0.04918_dp, &
+  c = 0.2533_dp, &
+  d = 0.349_dp
+CONTAINS
+  SUBROUTINE cs1_u_2 ( rho, grho, r13, e_rho_rho, e_rho_ndrho, e_ndrho_ndrho,&
+   npoints, error)
+REAL(KIND=dp), DIMENSION(*), &
+  INTENT(INOUT)  :: e_rho_rho, e_rho_ndrho, &
+e_ndrho_ndrho
+DO ip = 1, npoints
+  IF ( rho(ip) > eps_rho ) THEN
+ oc = 1.0_dp/(r*r*r3*r3 + c*g*g)
+ d2rF4 = c4p*f13*f23*g**4*r3/r * (193*d*r**5*r3*r3+90*d*d*r**5*r3 &
+ -88*g*g*c*r**3*r3-100*d*d*c*g*g*r*r*r3*r3 &
+ +104*r**6)*od**3*oc**4
+ e_rho_rho(ip) = e_rho_rho(ip) + d2F1 + d2rF2 + d2F3 + d2rF4
+  END IF
+END DO
+  END SUBROUTINE cs1_u_2
+END MODULE xc_cs1
Index: gcc/tree-ssa-reassoc.c
===
--- gcc/tree-ssa-reassoc.c  (revision 187117)
+++ gcc/tree-ssa-reassoc.c  (working copy)
@@ -200,6 +200,10 @@ static long *bb_rank;
 /* Operand->rank hashtable.  */
 static struct pointer_map_t *operand_rank;
 
+/* Map from inserted __builtin_powi calls to multiply chains that
+   feed them.  */
+static struct pointer_map_t *bip_map;
+
 /* Forward decls.  */
 static long get_rank (tree);
 
@@ -2249,7 +2253,7 @@ remove_visited_stmt_chain (tree var)
 static void
 possibly_move_powi (gimple stmt, tree op)
 {
-  gimple stmt2;
+  gimple stmt2, *mpy;
   tree fndecl;
   gimple_stmt_iterator gsi1, gsi2;
 
@@ -2278,9 +2282,39 @@ possibly_move_powi (gimple stmt, tree op)
   return;
 }
 
+  /* Move the __builtin_powi.  */
   gsi1 = gsi_for_stmt (stmt);
   gsi2 = gsi_for_stmt (stmt2);
   gsi_move_before (&gsi2, &gsi1);
+
+  /* See if there are multiplies feeding the __builtin_powi base
+ argument that must also be moved.  */
+  while ((mpy = (gimple *) pointer_map_contains (bip_map, stmt2)) != NULL)
+{
+  /* If we've already moved this statement, we're done.  This is
+ identified by a NULL entry for the statement in bip_map.  */
+  gimple *next = (gimple *) pointer_map_contains (bip_map, *mpy);
+  if (next && !*next)
+   return;
+
+  stmt = stmt2;
+  stmt2 = *mpy;
+  gsi1 = gsi_for_stmt (stmt);
+  gsi2 = gsi_for_stmt (stmt2);
+  gsi_move_before (&gsi2, &gsi1);
+
+  /* The moved multiply may be DAG'd from multiple calls if it
+was the result of a cached multiply.  Only move it once.
+Rank order ensures we move it to the right place the first
+time.  */
+  if (next)
+   *next = NULL;
+  else
+   {
+ next = (gimple *) pointer_map_insert (bip_map, *mpy);
+ *next = NULL;
+   }
+}
 }
 
 /* This function checks three consequtive operands in
@@ -3281,6 +3315,7 @@ attempt_builtin_powi (gimple stmt, VEC(operand_ent
   while (true)
 {
   HOST_WIDE_INT power;
+  gimple last_mul = NULL;
 
   /* First look for the largest cached product of factors from
 preceding iterations.  If found, create a builtin_powi for
@@ -3318,16 +3353,25 @@ attempt_builtin_powi (gimple stmt, VEC(operand_ent
}
  else
{
+ gimple *value;
+
  iter_result = get_reassoc_pow_ssa_name (target, type);
  pow_stmt = gimple_build_call (powi_fndecl, 2, rf1->repr, 
build_int_cst (integer_type_node,
   power));
  gimple_call_set_lhs (pow_stmt, iter_result);
  gimple_set_l

[PATCH, 4.7] Backport fix to [un]signed_type_for

2012-05-10 Thread William J. Schmidt

Backporting this patch to 4.7 fixes a problem building Fedora 17.
Bootstrapped and regression tested on powerpc64-unknown-linux-gnu.  Is
the backport OK?

Thanks,
Bill


2012-05-10  Bill Schmidt  

Backport from trunk:
2012-03-12  Richard Guenther  

* tree.c (signed_or_unsigned_type_for): Use
build_nonstandard_integer_type.
(signed_type_for): Adjust documentation.
(unsigned_type_for): Likewise.
* tree-pretty-print.c (dump_generic_node): Use standard names
for non-standard integer types if available.


Index: gcc/tree-pretty-print.c
===
--- gcc/tree-pretty-print.c (revision 187368)
+++ gcc/tree-pretty-print.c (working copy)
@@ -723,11 +723,41 @@ dump_generic_node (pretty_printer *buffer, tree no
  }
else if (TREE_CODE (node) == INTEGER_TYPE)
  {
-   pp_string (buffer, (TYPE_UNSIGNED (node)
-   ? "");
+   if (TYPE_PRECISION (node) == CHAR_TYPE_SIZE)
+ pp_string (buffer, (TYPE_UNSIGNED (node)
+ ? "unsigned char"
+ : "signed char"));
+   else if (TYPE_PRECISION (node) == SHORT_TYPE_SIZE)
+ pp_string (buffer, (TYPE_UNSIGNED (node)
+ ? "unsigned short"
+ : "signed short"));
+   else if (TYPE_PRECISION (node) == INT_TYPE_SIZE)
+ pp_string (buffer, (TYPE_UNSIGNED (node)
+ ? "unsigned int"
+ : "signed int"));
+   else if (TYPE_PRECISION (node) == LONG_TYPE_SIZE)
+ pp_string (buffer, (TYPE_UNSIGNED (node)
+ ? "unsigned long"
+ : "signed long"));
+   else if (TYPE_PRECISION (node) == LONG_LONG_TYPE_SIZE)
+ pp_string (buffer, (TYPE_UNSIGNED (node)
+ ? "unsigned long long"
+ : "signed long long"));
+   else if (TYPE_PRECISION (node) >= CHAR_TYPE_SIZE
+&& exact_log2 (TYPE_PRECISION (node)))
+ {
+   pp_string (buffer, (TYPE_UNSIGNED (node) ? "uint" : "int"));
+   pp_decimal_int (buffer, TYPE_PRECISION (node));
+   pp_string (buffer, "_t");
+ }
+   else
+ {
+   pp_string (buffer, (TYPE_UNSIGNED (node)
+   ? "");
+ }
  }
else if (TREE_CODE (node) == COMPLEX_TYPE)
  {
Index: gcc/tree.c
===
--- gcc/tree.c  (revision 187368)
+++ gcc/tree.c  (working copy)
@@ -10162,32 +10162,26 @@ widest_int_cst_value (const_tree x)
   return val;
 }
 
-/* If TYPE is an integral type, return an equivalent type which is
-unsigned iff UNSIGNEDP is true.  If TYPE is not an integral type,
-return TYPE itself.  */
+/* If TYPE is an integral or pointer type, return an integer type with
+   the same precision which is unsigned iff UNSIGNEDP is true, or itself
+   if TYPE is already an integer type of signedness UNSIGNEDP.  */
 
 tree
 signed_or_unsigned_type_for (int unsignedp, tree type)
 {
-  tree t = type;
-  if (POINTER_TYPE_P (type))
-{
-  /* If the pointer points to the normal address space, use the
-size_type_node.  Otherwise use an appropriate size for the pointer
-based on the named address space it points to.  */
-  if (!TYPE_ADDR_SPACE (TREE_TYPE (t)))
-   t = size_type_node;
-  else
-   return lang_hooks.types.type_for_size (TYPE_PRECISION (t), unsignedp);
-}
+  if (TREE_CODE (type) == INTEGER_TYPE && TYPE_UNSIGNED (type) == unsignedp)
+return type;
 
-  if (!INTEGRAL_TYPE_P (t) || TYPE_UNSIGNED (t) == unsignedp)
-return t;
+  if (!INTEGRAL_TYPE_P (type)
+  && !POINTER_TYPE_P (type))
+return NULL_TREE;
 
-  return lang_hooks.types.type_for_size (TYPE_PRECISION (t), unsignedp);
+  return build_nonstandard_integer_type (TYPE_PRECISION (type), unsignedp);
 }
 
-/* Returns unsigned variant of TYPE.  */
+/* If TYPE is an integral or pointer type, return an integer type with
+   the same precision which is unsigned, or itself if TYPE is already an
+   unsigned integer type.  */
 
 tree
 unsigned_type_for (tree type)
@@ -10195,7 +10189,9 @@ unsigned_type_for (tree type)
   return signed_or_unsigned_type_for (1, type);
 }
 
-/* Returns signed variant of TYPE.  */
+/* If TYPE is an integral or pointer type, return an integer type with
+   the same precision which is signed, or itself if TYPE is already a
+   signed integer type.  */
 
 tree
 signed_type_for (tree type)

Re: [PATCH, 4.7] Backport fix to [un]signed_type_for

2012-05-10 Thread William J. Schmidt

On Thu, 2012-05-10 at 18:49 +0200, Jakub Jelinek wrote:
> On Thu, May 10, 2012 at 11:44:27AM -0500, William J. Schmidt wrote:
> > Backporting this patch to 4.7 fixes a problem building Fedora 17.
> > Bootstrapped and regression tested on powerpc64-unknown-linux-gnu.  Is
> > the backport OK?
> 
> For 4.7 I'd very much prefer a less intrusive change (i.e. change
> the java langhook) instead, but I'll defer to Richard if he prefers
> this over that.

OK.  If that's desired, this is the possible change to the langhook:

Index: gcc/java/typeck.c
===
--- gcc/java/typeck.c   (revision 187158)
+++ gcc/java/typeck.c   (working copy)
@@ -189,6 +189,12 @@ java_type_for_size (unsigned bits, int unsignedp)
 return unsignedp ? unsigned_int_type_node : int_type_node;
   if (bits <= TYPE_PRECISION (long_type_node))
 return unsignedp ? unsigned_long_type_node : long_type_node;
+  /* A 64-bit target with TImode requires 128-bit type definitions
+ for bitsizetype.  */
+  if (int128_integer_type_node
+  && bits == TYPE_PRECISION (int128_integer_type_node))
+return (unsignedp ? int128_unsigned_type_node
+   : int128_integer_type_node);
   return 0;
 }

which also fixed the problem and bootstraps without regressions.
Whichever you guys prefer is fine with me.

Thanks,
Bill
> 
> > 2012-05-10  Bill Schmidt  
> > 
> > Backport from trunk:
> > 2012-03-12  Richard Guenther  
> > 
> > * tree.c (signed_or_unsigned_type_for): Use
> > build_nonstandard_integer_type.
> > (signed_type_for): Adjust documentation.
> > (unsigned_type_for): Likewise.
> > * tree-pretty-print.c (dump_generic_node): Use standard names
> > for non-standard integer types if available.
> 
>   Jakub
>

PING: [PATCH] Fix PR53217

2012-05-15 Thread William J. Schmidt

Ping.

Thanks,
Bill

On Tue, 2012-05-08 at 22:04 -0500, William J. Schmidt wrote:
> This fixes another statement-placement issue when reassociating
> expressions with repeated factors.  Multiplies feeding into
> __builtin_powi calls were not getting placed properly ahead of them in
> some cases.
> 
> Bootstrapped and tested on powerpc64-unknown-linux-gnu with no new
> regressions.  I've also run SPEC cpu2006 with no build or correctness
> issues.  OK for trunk?
> 
> Thanks,
> Bill
> 
> 
> gcc:
> 
> 2012-05-08  Bill Schmidt  
> 
>   PR tree-optimization/53217
>   * tree-ssa-reassoc.c (bip_map): New static variable.
>   (possibly_move_powi): Move feeding multiplies with __builtin_powi call.
>   (attempt_builtin_powi): Save feeding multiplies on a stack.
>   (reassociate_bb): Create and destroy bip_map.
> 
> gcc/testsuite:
> 
> 2012-05-08  Bill Schmidt  
> 
>   PR tree-optimization/53217
>   * gfortran.dg/pr53217.f90: New test.
> 
> 
> Index: gcc/testsuite/gfortran.dg/pr53217.f90
> ===
> --- gcc/testsuite/gfortran.dg/pr53217.f90 (revision 0)
> +++ gcc/testsuite/gfortran.dg/pr53217.f90 (revision 0)
> @@ -0,0 +1,28 @@
> +! { dg-do compile }
> +! { dg-options "-O1 -ffast-math" }
> +
> +! This tests only for compile-time failure, which formerly occurred
> +! when statements were emitted out of order, failing verify_ssa.
> +
> +MODULE xc_cs1
> +  INTEGER, PARAMETER :: dp=KIND(0.0D0)
> +  REAL(KIND=dp), PARAMETER :: a = 0.04918_dp, &
> +  c = 0.2533_dp, &
> +  d = 0.349_dp
> +CONTAINS
> +  SUBROUTINE cs1_u_2 ( rho, grho, r13, e_rho_rho, e_rho_ndrho, 
> e_ndrho_ndrho,&
> +   npoints, error)
> +REAL(KIND=dp), DIMENSION(*), &
> +  INTENT(INOUT)  :: e_rho_rho, e_rho_ndrho, &
> +e_ndrho_ndrho
> +DO ip = 1, npoints
> +  IF ( rho(ip) > eps_rho ) THEN
> + oc = 1.0_dp/(r*r*r3*r3 + c*g*g)
> + d2rF4 = c4p*f13*f23*g**4*r3/r * (193*d*r**5*r3*r3+90*d*d*r**5*r3 &
> + -88*g*g*c*r**3*r3-100*d*d*c*g*g*r*r*r3*r3 &
> + +104*r**6)*od**3*oc**4
> + e_rho_rho(ip) = e_rho_rho(ip) + d2F1 + d2rF2 + d2F3 + d2rF4
> +  END IF
> +END DO
> +  END SUBROUTINE cs1_u_2
> +END MODULE xc_cs1
> Index: gcc/tree-ssa-reassoc.c
> ===
> --- gcc/tree-ssa-reassoc.c(revision 187117)
> +++ gcc/tree-ssa-reassoc.c(working copy)
> @@ -200,6 +200,10 @@ static long *bb_rank;
>  /* Operand->rank hashtable.  */
>  static struct pointer_map_t *operand_rank;
> 
> +/* Map from inserted __builtin_powi calls to multiply chains that
> +   feed them.  */
> +static struct pointer_map_t *bip_map;
> +
>  /* Forward decls.  */
>  static long get_rank (tree);
> 
> @@ -2249,7 +2253,7 @@ remove_visited_stmt_chain (tree var)
>  static void
>  possibly_move_powi (gimple stmt, tree op)
>  {
> -  gimple stmt2;
> +  gimple stmt2, *mpy;
>tree fndecl;
>gimple_stmt_iterator gsi1, gsi2;
> 
> @@ -2278,9 +2282,39 @@ possibly_move_powi (gimple stmt, tree op)
>return;
>  }
> 
> +  /* Move the __builtin_powi.  */
>gsi1 = gsi_for_stmt (stmt);
>gsi2 = gsi_for_stmt (stmt2);
>gsi_move_before (&gsi2, &gsi1);
> +
> +  /* See if there are multiplies feeding the __builtin_powi base
> + argument that must also be moved.  */
> +  while ((mpy = (gimple *) pointer_map_contains (bip_map, stmt2)) != NULL)
> +{
> +  /* If we've already moved this statement, we're done.  This is
> + identified by a NULL entry for the statement in bip_map.  */
> +  gimple *next = (gimple *) pointer_map_contains (bip_map, *mpy);
> +  if (next && !*next)
> + return;
> +
> +  stmt = stmt2;
> +  stmt2 = *mpy;
> +  gsi1 = gsi_for_stmt (stmt);
> +  gsi2 = gsi_for_stmt (stmt2);
> +  gsi_move_before (&gsi2, &gsi1);
> +
> +  /* The moved multiply may be DAG'd from multiple calls if it
> +  was the result of a cached multiply.  Only move it once.
> +  Rank order ensures we move it to the right place the first
> +  time.  */
> +  if (next)
> + *next = NULL;
> +  else
> + {
> +   next = (gimple *) pointer_map_insert (bip_map, *mpy);
> +   *next = NULL;
> + }
> +}
>  }
> 
>  /* This function checks three consequtive operands in
> @@ -3281,6 +3315,7 @@ attempt_b

Re: PING: [PATCH] Fix PR53217

2012-05-16 Thread William J. Schmidt

On Wed, 2012-05-16 at 11:45 +0200, Richard Guenther wrote:
> On Tue, 15 May 2012, William J. Schmidt wrote:
> 
> > Ping.
> 
> I don't like it too much - but pondering a bit over it I can't find
> a nicer solution.
> 
> So, ok.
> 
> Thanks,
> Richard.
> 
Agreed.  I'm not fond of it either, and I feel it's a bit fragile.

An alternative would be to go back to handling the exponentiation
expressions outside of the ops list (generating an explicit multiply to
hook them up with the results of normal linear/parallel expansion).  In
hindsight, placing the exponentiation results in the ops list and
letting the rank order handle things introduces some complexity as well
as saving some.  The DAG'd nature of the exponentiation expressions
isn't a perfect fit for the pure tree form of the reassociated
multiplies.

Let me know if you'd like me to pursue that instead.

Thanks,
Bill

> > Thanks,
> > Bill
> > 
> > On Tue, 2012-05-08 at 22:04 -0500, William J. Schmidt wrote:
> > > This fixes another statement-placement issue when reassociating
> > > expressions with repeated factors.  Multiplies feeding into
> > > __builtin_powi calls were not getting placed properly ahead of them in
> > > some cases.
> > > 
> > > Bootstrapped and tested on powerpc64-unknown-linux-gnu with no new
> > > regressions.  I've also run SPEC cpu2006 with no build or correctness
> > > issues.  OK for trunk?
> > > 
> > > Thanks,
> > > Bill
> > > 
> > > 
> > > gcc:
> > > 
> > > 2012-05-08  Bill Schmidt  
> > > 
> > >   PR tree-optimization/53217
> > >   * tree-ssa-reassoc.c (bip_map): New static variable.
> > >   (possibly_move_powi): Move feeding multiplies with __builtin_powi call.
> > >   (attempt_builtin_powi): Save feeding multiplies on a stack.
> > >   (reassociate_bb): Create and destroy bip_map.
> > > 
> > > gcc/testsuite:
> > > 
> > > 2012-05-08  Bill Schmidt  
> > > 
> > >   PR tree-optimization/53217
> > >   * gfortran.dg/pr53217.f90: New test.
> > > 
> > > 
> > > Index: gcc/testsuite/gfortran.dg/pr53217.f90
> > > ===
> > > --- gcc/testsuite/gfortran.dg/pr53217.f90 (revision 0)
> > > +++ gcc/testsuite/gfortran.dg/pr53217.f90 (revision 0)
> > > @@ -0,0 +1,28 @@
> > > +! { dg-do compile }
> > > +! { dg-options "-O1 -ffast-math" }
> > > +
> > > +! This tests only for compile-time failure, which formerly occurred
> > > +! when statements were emitted out of order, failing verify_ssa.
> > > +
> > > +MODULE xc_cs1
> > > +  INTEGER, PARAMETER :: dp=KIND(0.0D0)
> > > +  REAL(KIND=dp), PARAMETER :: a = 0.04918_dp, &
> > > +  c = 0.2533_dp, &
> > > +  d = 0.349_dp
> > > +CONTAINS
> > > +  SUBROUTINE cs1_u_2 ( rho, grho, r13, e_rho_rho, e_rho_ndrho, 
> > > e_ndrho_ndrho,&
> > > +   npoints, error)
> > > +REAL(KIND=dp), DIMENSION(*), &
> > > +  INTENT(INOUT)  :: e_rho_rho, e_rho_ndrho, &
> > > +e_ndrho_ndrho
> > > +DO ip = 1, npoints
> > > +  IF ( rho(ip) > eps_rho ) THEN
> > > + oc = 1.0_dp/(r*r*r3*r3 + c*g*g)
> > > + d2rF4 = c4p*f13*f23*g**4*r3/r * 
> > > (193*d*r**5*r3*r3+90*d*d*r**5*r3 &
> > > + -88*g*g*c*r**3*r3-100*d*d*c*g*g*r*r*r3*r3 &
> > > + +104*r**6)*od**3*oc**4
> > > + e_rho_rho(ip) = e_rho_rho(ip) + d2F1 + d2rF2 + d2F3 + d2rF4
> > > +  END IF
> > > +END DO
> > > +  END SUBROUTINE cs1_u_2
> > > +END MODULE xc_cs1
> > > Index: gcc/tree-ssa-reassoc.c
> > > ===
> > > --- gcc/tree-ssa-reassoc.c(revision 187117)
> > > +++ gcc/tree-ssa-reassoc.c(working copy)
> > > @@ -200,6 +200,10 @@ static long *bb_rank;
> > >  /* Operand->rank hashtable.  */
> > >  static struct pointer_map_t *operand_rank;
> > > 
> > > +/* Map from inserted __builtin_powi calls to multiply chains that
> > > +   feed them.  */
> > > +static struct pointer_map_t *bip_map;
> > > +
> > >  /* Forward decls.  */
> > >  static long get_rank (tree);
> > > 
> > &g

Re: PING: [PATCH] Fix PR53217

2012-05-16 Thread William J. Schmidt

On Wed, 2012-05-16 at 14:05 +0200, Richard Guenther wrote:
> On Wed, 16 May 2012, William J. Schmidt wrote:
> 
> > On Wed, 2012-05-16 at 11:45 +0200, Richard Guenther wrote:
> > > On Tue, 15 May 2012, William J. Schmidt wrote:
> > > 
> > > > Ping.
> > > 
> > > I don't like it too much - but pondering a bit over it I can't find
> > > a nicer solution.
> > > 
> > > So, ok.
> > > 
> > > Thanks,
> > > Richard.
> > > 
> > Agreed.  I'm not fond of it either, and I feel it's a bit fragile.
> > 
> > An alternative would be to go back to handling the exponentiation
> > expressions outside of the ops list (generating an explicit multiply to
> > hook them up with the results of normal linear/parallel expansion).  In
> > hindsight, placing the exponentiation results in the ops list and
> > letting the rank order handle things introduces some complexity as well
> > as saving some.  The DAG'd nature of the exponentiation expressions
> > isn't a perfect fit for the pure tree form of the reassociated
> > multiplies.
> 
> True.
> 
> > Let me know if you'd like me to pursue that instead.
> 
> You can try - if the result looks better I'm all for it ;)
> 
OK. :)  I'll commit this for now to deal with the fallout, and work on
the alternative version in my spare time.

Thanks,
Bill

> Thanks,
> Richard.
> 
> > Thanks,
> > Bill
> > 
> > > > Thanks,
> > > > Bill
> > > > 
> > > > On Tue, 2012-05-08 at 22:04 -0500, William J. Schmidt wrote:
> > > > > This fixes another statement-placement issue when reassociating
> > > > > expressions with repeated factors.  Multiplies feeding into
> > > > > __builtin_powi calls were not getting placed properly ahead of them in
> > > > > some cases.
> > > > > 
> > > > > Bootstrapped and tested on powerpc64-unknown-linux-gnu with no new
> > > > > regressions.  I've also run SPEC cpu2006 with no build or correctness
> > > > > issues.  OK for trunk?
> > > > > 
> > > > > Thanks,
> > > > > Bill
> > > > > 
> > > > > 
> > > > > gcc:
> > > > > 
> > > > > 2012-05-08  Bill Schmidt  
> > > > > 
> > > > >   PR tree-optimization/53217
> > > > >   * tree-ssa-reassoc.c (bip_map): New static variable.
> > > > >   (possibly_move_powi): Move feeding multiplies with 
> > > > > __builtin_powi call.
> > > > >   (attempt_builtin_powi): Save feeding multiplies on a stack.
> > > > >   (reassociate_bb): Create and destroy bip_map.
> > > > > 
> > > > > gcc/testsuite:
> > > > > 
> > > > > 2012-05-08  Bill Schmidt  
> > > > > 
> > > > >   PR tree-optimization/53217
> > > > >   * gfortran.dg/pr53217.f90: New test.
> > > > > 
> > > > > 
> > > > > Index: gcc/testsuite/gfortran.dg/pr53217.f90
> > > > > ===
> > > > > --- gcc/testsuite/gfortran.dg/pr53217.f90 (revision 0)
> > > > > +++ gcc/testsuite/gfortran.dg/pr53217.f90 (revision 0)
> > > > > @@ -0,0 +1,28 @@
> > > > > +! { dg-do compile }
> > > > > +! { dg-options "-O1 -ffast-math" }
> > > > > +
> > > > > +! This tests only for compile-time failure, which formerly occurred
> > > > > +! when statements were emitted out of order, failing verify_ssa.
> > > > > +
> > > > > +MODULE xc_cs1
> > > > > +  INTEGER, PARAMETER :: dp=KIND(0.0D0)
> > > > > +  REAL(KIND=dp), PARAMETER :: a = 0.04918_dp, &
> > > > > +  c = 0.2533_dp, &
> > > > > +  d = 0.349_dp
> > > > > +CONTAINS
> > > > > +  SUBROUTINE cs1_u_2 ( rho, grho, r13, e_rho_rho, e_rho_ndrho, 
> > > > > e_ndrho_ndrho,&
> > > > > +   npoints, error)
> > > > > +REAL(KIND=dp), DIMENSION(*), &
> > > > > +  INTENT(INOUT)  :: e_rho_rho, 
> > > > > e_rho_ndrho, &
> > > > > +e_ndrho_ndrho
> > >

Ping: [PATCH] Hoist adjacent pointer loads

2012-05-16 Thread William J. Schmidt

Ping.

Thanks,
Bill

On Thu, 2012-05-03 at 09:33 -0500, William J. Schmidt wrote:
> This patch was posted for comment back in February during stage 4.  It
> addresses a performance issue noted in the EEMBC routelookup benchmark
> on a common idiom:
> 
>   if (...)
> x = y->left;
>   else
> x = y->right;
> 
> If the two loads can be hoisted out of the if/else, the if/else can be
> replaced by a conditional move instruction on architectures that support
> one.  Because this speculates one of the loads, the patch constrains the
> optimization to avoid introducing page faults.
> 
> Bootstrapped and regression tested on powerpc-unknown-linux-gnu with no
> new failures.  The patch provides significant improvement to the
> routelookup benchmark, and is neutral on SPEC cpu2000/cpu2006.
> 
> One question is what optimization level should be required for this.
> Because of the speculation, -O3 might be in order.  I don't believe
> -Ofast is required as there is no potential correctness issue involved.
> Right now the patch doesn't check the optimization level (like the rest
> of the phi-opt transforms), which is likely a poor choice.
> 
> Ok for trunk?
> 
> Thanks,
> Bill
> 
> 
> 2012-05-03  Bill Schmidt  
> 
>   * tree-ssa-phiopt.c (tree_ssa_phiopt_worker): Add argument to forward
>   declaration.
>   (hoist_adjacent_loads, gate_hoist_loads): New forward declarations.
>   (tree_ssa_phiopt): Call gate_hoist_loads.
>   (tree_ssa_cs_elim): Add parm to tree_ssa_phiopt_worker call.
>   (tree_ssa_phiopt_worker): Add do_hoist_loads to formal arg list; call
>   hoist_adjacent_loads.
>   (local_reg_dependence): New function.
>   (local_mem_dependence): Likewise.
>   (hoist_adjacent_loads): Likewise.
>   (gate_hoist_loads): Likewise.
>   * common.opt (fhoist-adjacent-loads): New switch.
>   * Makefile.in (tree-ssa-phiopt.o): Added dependencies.
>   * params.def (PARAM_MIN_CMOVE_STRUCT_ALIGN): New param.
> 
> 
> Index: gcc/tree-ssa-phiopt.c
> ===
> --- gcc/tree-ssa-phiopt.c (revision 187057)
> +++ gcc/tree-ssa-phiopt.c (working copy)
> @@ -37,9 +37,17 @@ along with GCC; see the file COPYING3.  If not see
>  #include "cfgloop.h"
>  #include "tree-data-ref.h"
>  #include "tree-pretty-print.h"
> +#include "gimple-pretty-print.h"
> +#include "insn-config.h"
> +#include "expr.h"
> +#include "optabs.h"
> 
> +#ifndef HAVE_conditional_move
> +#define HAVE_conditional_move (0)
> +#endif
> +
>  static unsigned int tree_ssa_phiopt (void);
> -static unsigned int tree_ssa_phiopt_worker (bool);
> +static unsigned int tree_ssa_phiopt_worker (bool, bool);
>  static bool conditional_replacement (basic_block, basic_block,
>edge, edge, gimple, tree, tree);
>  static int value_replacement (basic_block, basic_block,
> @@ -53,6 +61,9 @@ static bool cond_store_replacement (basic_block, b
>  static bool cond_if_else_store_replacement (basic_block, basic_block, 
> basic_block);
>  static struct pointer_set_t * get_non_trapping (void);
>  static void replace_phi_edge_with_variable (basic_block, edge, gimple, tree);
> +static void hoist_adjacent_loads (basic_block, basic_block,
> +   basic_block, basic_block);
> +static bool gate_hoist_loads (void);
> 
>  /* This pass tries to replaces an if-then-else block with an
> assignment.  We have four kinds of transformations.  Some of these
> @@ -138,12 +149,56 @@ static void replace_phi_edge_with_variable (basic_
>   bb2:
> x = PHI ;
> 
> -   A similar transformation is done for MAX_EXPR.  */
> +   A similar transformation is done for MAX_EXPR.
> 
> +
> +   This pass also performs a fifth transformation of a slightly different
> +   flavor.
> +
> +   Adjacent Load Hoisting
> +   --
> +   
> +   This transformation replaces
> +
> + bb0:
> +   if (...) goto bb2; else goto bb1;
> + bb1:
> +   x1 = ().field1;
> +   goto bb3;
> + bb2:
> +   x2 = ().field2;
> + bb3:
> +   # x = PHI ;
> +
> +   with
> +
> + bb0:
> +   x1 = ().field1;
> +   x2 = ().field2;
> +   if (...) goto bb2; else goto bb1;
> + bb1:
> +   goto bb3;
> + bb2:
> + bb3:
> +   # x = PHI ;
> +
> +   The purpose of this transformation is to enable generation of conditional
> +   move instructions such as Intel CMOVE or PowerPC ISEL.  Because one of
> +   the loads is speculative, the transformation

Re: [PATCH][1/n] Improve vectorization in PR53355

2012-05-16 Thread William J. Schmidt

On Tue, 2012-05-15 at 14:17 +0200, Richard Guenther wrote:
> This is the first patch to make the generated code for the testcase
> in PR53355 better.  It teaches VRP about LSHIFT_EXPRs (albeit only
> of a very simple form).
> 
> Bootstrapped on x86_64-unknown-linux-gnu, testing in progress.

This appears to have caused
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53385.

Thanks,
Bill

> 
> Richard.
> 
> 2012-05-15  Richard Guenther  
> 
>   PR tree-optimization/53355
>   * tree-vrp.c (extract_range_from_binary_expr_1): Handle LSHIFT_EXPRs
>   by constants.
> 
>   * gcc.dg/tree-ssa/vrp67.c: New testcase.
> 
> Index: gcc/tree-vrp.c
> ===
> *** gcc/tree-vrp.c(revision 187503)
> --- gcc/tree-vrp.c(working copy)
> *** extract_range_from_binary_expr_1 (value_
> *** 2403,2408 
> --- 2403,2409 
> && code != ROUND_DIV_EXPR
> && code != TRUNC_MOD_EXPR
> && code != RSHIFT_EXPR
> +   && code != LSHIFT_EXPR
> && code != MIN_EXPR
> && code != MAX_EXPR
> && code != BIT_AND_EXPR
> *** extract_range_from_binary_expr_1 (value_
> *** 2596,2601 
> --- 2597,2636 
> extract_range_from_multiplicative_op_1 (vr, code, &vr0, &vr1);
> return;
>   }
> +   else if (code == LSHIFT_EXPR)
> + {
> +   /* If we have a LSHIFT_EXPR with any shift values outside [0..prec-1],
> +  then drop to VR_VARYING.  Outside of this range we get undefined
> +  behavior from the shift operation.  We cannot even trust
> +  SHIFT_COUNT_TRUNCATED at this stage, because that applies to rtl
> +  shifts, and the operation at the tree level may be widened.  */
> +   if (vr1.type != VR_RANGE
> +   || !value_range_nonnegative_p (&vr1)
> +   || TREE_CODE (vr1.max) != INTEGER_CST
> +   || compare_tree_int (vr1.max, TYPE_PRECISION (expr_type) - 1) == 1)
> + {
> +   set_value_range_to_varying (vr);
> +   return;
> + }
> + 
> +   /* We can map shifts by constants to MULT_EXPR handling.  */
> +   if (range_int_cst_singleton_p (&vr1))
> + {
> +   value_range_t vr1p = { VR_RANGE, NULL_TREE, NULL_TREE, NULL };
> +   vr1p.min
> + = double_int_to_tree (expr_type,
> +   double_int_lshift (double_int_one,
> +  TREE_INT_CST_LOW (vr1.min),
> +  TYPE_PRECISION (expr_type),
> +  false));
> +   vr1p.max = vr1p.min;
> +   extract_range_from_multiplicative_op_1 (vr, MULT_EXPR, &vr0, &vr1p);
> +   return;
> + }
> + 
> +   set_value_range_to_varying (vr);
> +   return;
> + }
> else if (code == TRUNC_DIV_EXPR
>  || code == FLOOR_DIV_EXPR
>  || code == CEIL_DIV_EXPR
> Index: gcc/testsuite/gcc.dg/tree-ssa/vrp67.c
> ===
> *** gcc/testsuite/gcc.dg/tree-ssa/vrp67.c (revision 0)
> --- gcc/testsuite/gcc.dg/tree-ssa/vrp67.c (revision 0)
> ***
> *** 0 
> --- 1,38 
> + /* { dg-do compile } */
> + /* { dg-options "-O2 -fdump-tree-vrp1" } */
> + 
> + unsigned foo (unsigned i)
> + {
> +   if (i == 2)
> + {
> +   i = i << 2;
> +   if (i != 8)
> + link_error ();
> + }
> +   return i;
> + }
> + unsigned bar (unsigned i)
> + {
> +   if (i == 1 << (sizeof (unsigned) * 8 - 1))
> + {
> +   i = i << 1;
> +   if (i != 0)
> + link_error ();
> + }
> +   return i;
> + }
> + unsigned baz (unsigned i)
> + {
> +   i = i & 15;
> +   if (i == 0)
> + return 0;
> +   i = 1000 - i;
> +   i >>= 1;
> +   i <<= 1;
> +   if (i == 0)
> + link_error ();
> +   return i;
> + }
> + 
> + /* { dg-final { scan-tree-dump-times "Folding predicate" 3 "vrp1" } } */
> + /* { dg-final { cleanup-tree-dump "vrp1" } } */

[PATCH] Simplify attempt_builtin_powi logic

2012-05-17 Thread William J. Schmidt

This patch gives up on using the reassociation rank algorithm to
correctly place __builtin_powi calls and their feeding multiplies.  In
the end this proved to introduce more complexity than it saved, due in
part to the poor fit of introducing DAG expressions into the
reassociated operand tree.  This patch returns to generating explicit
multiplies to bind the builtin calls together and to the results of the
expression tree rewrite.  I feel this version is smaller, easier to
understand, and less fragile than the existing code.

Bootstrapped and tested on powerpc64-unknown-linux-gnu with no new
regressions.  Ok for trunk?

Thanks,
Bill


2012-05-17  Bill Schmidt  

* tree-ssa-reassoc.c (bip_map): Remove decl.
(completely_remove_stmt): Remove function.
(remove_def_if_absorbed_call): Remove function.
(remove_visited_stmt_chain): Remove __builtin_powi handling.
(possibly_move_powi): Remove function.
(rewrite_expr_tree): Remove calls to possibly_move_powi.
(rewrite_expr_tree_parallel): Likewise.
(attempt_builtin_powi): Build multiplies explicitly rather than
relying on the ops vector and rank system.
(transform_stmt_to_copy): New function.
(transform_stmt_to_multiply): Likewise.
(reassociate_bb): Handle leftover operations after __builtin_powi
optimization; build a final multiply if necessary.


Index: gcc/tree-ssa-reassoc.c
===
--- gcc/tree-ssa-reassoc.c  (revision 187626)
+++ gcc/tree-ssa-reassoc.c  (working copy)
@@ -200,10 +200,6 @@ static long *bb_rank;
 /* Operand->rank hashtable.  */
 static struct pointer_map_t *operand_rank;
 
-/* Map from inserted __builtin_powi calls to multiply chains that
-   feed them.  */
-static struct pointer_map_t *bip_map;
-
 /* Forward decls.  */
 static long get_rank (tree);
 
@@ -2184,32 +2180,6 @@ is_phi_for_stmt (gimple stmt, tree operand)
   return false;
 }
 
-/* Remove STMT, unlink its virtual defs, and release its SSA defs.  */
-
-static inline void
-completely_remove_stmt (gimple stmt)
-{
-  gimple_stmt_iterator gsi = gsi_for_stmt (stmt);
-  gsi_remove (&gsi, true);
-  unlink_stmt_vdef (stmt);
-  release_defs (stmt);
-}
-
-/* If OP is defined by a builtin call that has been absorbed by
-   reassociation, remove its defining statement completely.  */
-
-static inline void
-remove_def_if_absorbed_call (tree op)
-{
-  gimple stmt;
-
-  if (TREE_CODE (op) == SSA_NAME
-  && has_zero_uses (op)
-  && is_gimple_call ((stmt = SSA_NAME_DEF_STMT (op)))
-  && gimple_visited_p (stmt))
-completely_remove_stmt (stmt);
-}
-
 /* Remove def stmt of VAR if VAR has zero uses and recurse
on rhs1 operand if so.  */
 
@@ -2218,7 +2188,6 @@ remove_visited_stmt_chain (tree var)
 {
   gimple stmt;
   gimple_stmt_iterator gsi;
-  tree var2;
 
   while (1)
 {
@@ -2228,95 +2197,15 @@ remove_visited_stmt_chain (tree var)
   if (is_gimple_assign (stmt) && gimple_visited_p (stmt))
{
  var = gimple_assign_rhs1 (stmt);
- var2 = gimple_assign_rhs2 (stmt);
  gsi = gsi_for_stmt (stmt);
  gsi_remove (&gsi, true);
  release_defs (stmt);
- /* A multiply whose operands are both fed by builtin pow/powi
-calls must check whether to remove rhs2 as well.  */
- remove_def_if_absorbed_call (var2);
}
-  else if (is_gimple_call (stmt) && gimple_visited_p (stmt))
-   {
- completely_remove_stmt (stmt);
- return;
-   }
   else
return;
 }
 }
 
-/* If OP is an SSA name, find its definition and determine whether it
-   is a call to __builtin_powi.  If so, move the definition prior to
-   STMT.  Only do this during early reassociation.  */
-
-static void
-possibly_move_powi (gimple stmt, tree op)
-{
-  gimple stmt2, *mpy;
-  tree fndecl;
-  gimple_stmt_iterator gsi1, gsi2;
-
-  if (!first_pass_instance
-  || !flag_unsafe_math_optimizations
-  || TREE_CODE (op) != SSA_NAME)
-return;
-  
-  stmt2 = SSA_NAME_DEF_STMT (op);
-
-  if (!is_gimple_call (stmt2)
-  || !has_single_use (gimple_call_lhs (stmt2)))
-return;
-
-  fndecl = gimple_call_fndecl (stmt2);
-
-  if (!fndecl
-  || DECL_BUILT_IN_CLASS (fndecl) != BUILT_IN_NORMAL)
-return;
-
-  switch (DECL_FUNCTION_CODE (fndecl))
-{
-CASE_FLT_FN (BUILT_IN_POWI):
-  break;
-default:
-  return;
-}
-
-  /* Move the __builtin_powi.  */
-  gsi1 = gsi_for_stmt (stmt);
-  gsi2 = gsi_for_stmt (stmt2);
-  gsi_move_before (&gsi2, &gsi1);
-
-  /* See if there are multiplies feeding the __builtin_powi base
- argument that must also be moved.  */
-  while ((mpy = (gimple *) pointer_map_contains (bip_map, stmt2)) != NULL)
-{
-  /* If we've already moved this statement, we're done.  This is
- identified by a NULL entry for the statement in bip_map.  */
-  gimple *next = (gimple *) pointer_map_con

[PATCH, rs6000] Fix PR53385

2012-05-18 Thread William J. Schmidt

This repairs the bootstrap issue due to unsafe signed overflow
assumptions.  Bootstrapped and tested on powerpc64-unknown-linux-gnu
with no new regressions.  Ok for trunk?

Thanks,
Bill


2012-05-18  Bill Schmidt  

* config/rs6000/rs6000.c (print_operand): Revise code that unsafely
relied on signed overflow behavior.


Index: gcc/config/rs6000/rs6000.c
===
--- gcc/config/rs6000/rs6000.c  (revision 187651)
+++ gcc/config/rs6000/rs6000.c  (working copy)
@@ -14679,7 +14679,6 @@ void
 print_operand (FILE *file, rtx x, int code)
 {
   int i;
-  HOST_WIDE_INT val;
   unsigned HOST_WIDE_INT uval;
 
   switch (code)
@@ -15120,34 +15119,17 @@ print_operand (FILE *file, rtx x, int code)
 
 case 'W':
   /* MB value for a PowerPC64 rldic operand.  */
-  val = (GET_CODE (x) == CONST_INT
-? INTVAL (x) : CONST_DOUBLE_HIGH (x));
+  i = clz_hwi (GET_CODE (x) == CONST_INT
+  ? INTVAL (x) : CONST_DOUBLE_HIGH (x));
 
-  if (val < 0)
-   i = -1;
-  else
-   for (i = 0; i < HOST_BITS_PER_WIDE_INT; i++)
- if ((val <<= 1) < 0)
-   break;
-
 #if HOST_BITS_PER_WIDE_INT == 32
-  if (GET_CODE (x) == CONST_INT && i >= 0)
+  if (GET_CODE (x) == CONST_INT && i > 0)
i += 32;  /* zero-extend high-part was all 0's */
   else if (GET_CODE (x) == CONST_DOUBLE && i == 32)
-   {
- val = CONST_DOUBLE_LOW (x);
-
- gcc_assert (val);
- if (val < 0)
-   --i;
- else
-   for ( ; i < 64; i++)
- if ((val <<= 1) < 0)
-   break;
-   }
+   i = clz_hwi (CONST_DOUBLE_LOW (x)) + 32;
 #endif
 
-  fprintf (file, "%d", i + 1);
+  fprintf (file, "%d", i);
   return;
 
 case 'x':

Re: [PATCH] Hoist adjacent pointer loads

2012-05-21 Thread William J. Schmidt

On Mon, 2012-05-21 at 14:17 +0200, Richard Guenther wrote:
> On Thu, May 3, 2012 at 4:33 PM, William J. Schmidt
>  wrote:
> > This patch was posted for comment back in February during stage 4.  It
> > addresses a performance issue noted in the EEMBC routelookup benchmark
> > on a common idiom:
> >
> >  if (...)
> >x = y->left;
> >  else
> >x = y->right;
> >
> > If the two loads can be hoisted out of the if/else, the if/else can be
> > replaced by a conditional move instruction on architectures that support
> > one.  Because this speculates one of the loads, the patch constrains the
> > optimization to avoid introducing page faults.
> >
> > Bootstrapped and regression tested on powerpc-unknown-linux-gnu with no
> > new failures.  The patch provides significant improvement to the
> > routelookup benchmark, and is neutral on SPEC cpu2000/cpu2006.
> >
> > One question is what optimization level should be required for this.
> > Because of the speculation, -O3 might be in order.  I don't believe
> > -Ofast is required as there is no potential correctness issue involved.
> > Right now the patch doesn't check the optimization level (like the rest
> > of the phi-opt transforms), which is likely a poor choice.
> >
> > Ok for trunk?
> >
> > Thanks,
> > Bill
> >
> >
> > 2012-05-03  Bill Schmidt  
> >
> >* tree-ssa-phiopt.c (tree_ssa_phiopt_worker): Add argument to forward
> >declaration.
> >(hoist_adjacent_loads, gate_hoist_loads): New forward declarations.
> >(tree_ssa_phiopt): Call gate_hoist_loads.
> >(tree_ssa_cs_elim): Add parm to tree_ssa_phiopt_worker call.
> >(tree_ssa_phiopt_worker): Add do_hoist_loads to formal arg list; call
> >hoist_adjacent_loads.
> >(local_reg_dependence): New function.
> >(local_mem_dependence): Likewise.
> >(hoist_adjacent_loads): Likewise.
> >(gate_hoist_loads): Likewise.
> >* common.opt (fhoist-adjacent-loads): New switch.
> >* Makefile.in (tree-ssa-phiopt.o): Added dependencies.
> >* params.def (PARAM_MIN_CMOVE_STRUCT_ALIGN): New param.
> >
> >
> > Index: gcc/tree-ssa-phiopt.c
> > ===
> > --- gcc/tree-ssa-phiopt.c   (revision 187057)
> > +++ gcc/tree-ssa-phiopt.c   (working copy)
> > @@ -37,9 +37,17 @@ along with GCC; see the file COPYING3.  If not see
> >  #include "cfgloop.h"
> >  #include "tree-data-ref.h"
> >  #include "tree-pretty-print.h"
> > +#include "gimple-pretty-print.h"
> > +#include "insn-config.h"
> > +#include "expr.h"
> > +#include "optabs.h"
> >
> > +#ifndef HAVE_conditional_move
> > +#define HAVE_conditional_move (0)
> > +#endif
> > +
> >  static unsigned int tree_ssa_phiopt (void);
> > -static unsigned int tree_ssa_phiopt_worker (bool);
> > +static unsigned int tree_ssa_phiopt_worker (bool, bool);
> >  static bool conditional_replacement (basic_block, basic_block,
> > edge, edge, gimple, tree, tree);
> >  static int value_replacement (basic_block, basic_block,
> > @@ -53,6 +61,9 @@ static bool cond_store_replacement (basic_block, b
> >  static bool cond_if_else_store_replacement (basic_block, basic_block, 
> > basic_block);
> >  static struct pointer_set_t * get_non_trapping (void);
> >  static void replace_phi_edge_with_variable (basic_block, edge, gimple, 
> > tree);
> > +static void hoist_adjacent_loads (basic_block, basic_block,
> > + basic_block, basic_block);
> > +static bool gate_hoist_loads (void);
> >
> >  /* This pass tries to replaces an if-then-else block with an
> >assignment.  We have four kinds of transformations.  Some of these
> > @@ -138,12 +149,56 @@ static void replace_phi_edge_with_variable (basic_
> >  bb2:
> >x = PHI ;
> >
> > -   A similar transformation is done for MAX_EXPR.  */
> > +   A similar transformation is done for MAX_EXPR.
> >
> > +
> > +   This pass also performs a fifth transformation of a slightly different
> > +   flavor.
> > +
> > +   Adjacent Load Hoisting
> > +   --
> > +
> > +   This transformation replaces
> > +
> > + bb0:
> > +   if (...) goto bb2; else goto bb1;
> > + bb1:
> > +   x1 = ().field1;
> > +

Re: [PATCH] Hoist adjacent pointer loads

2012-05-22 Thread William J. Schmidt

Here's a revision of the hoist-adjacent-loads patch.  Besides hopefully
addressing all your comments, I added a gate of at least -O2 for this
transformation.  Let me know if you prefer a different minimum opt
level.

I'm still running SPEC tests to make sure there are no regressions when
opening this up to non-pointer arguments.  The code bootstraps on
powerpc64-unknown-linux-gnu with no regressions.  Assuming the SPEC
numbers come out as expected, is this ok?

Thanks,
Bill


2012-05-22  Bill Schmidt  

* tree-ssa-phiopt.c (tree_ssa_phiopt_worker): Add argument to forward
declaration.
(hoist_adjacent_loads, gate_hoist_loads): New forward declarations.
(tree_ssa_phiopt): Call gate_hoist_loads.
(tree_ssa_cs_elim): Add parm to tree_ssa_phiopt_worker call.
(tree_ssa_phiopt_worker): Add do_hoist_loads to formal arg list; call
hoist_adjacent_loads.
(local_mem_dependence): New function.
(hoist_adjacent_loads): Likewise.
(gate_hoist_loads): Likewise.
* common.opt (fhoist-adjacent-loads): New switch.
* Makefile.in (tree-ssa-phiopt.o): Added dependencies.
* params.def (PARAM_MIN_CMOVE_STRUCT_ALIGN): New param.


Index: gcc/tree-ssa-phiopt.c
===
--- gcc/tree-ssa-phiopt.c   (revision 187728)
+++ gcc/tree-ssa-phiopt.c   (working copy)
@@ -37,9 +37,17 @@ along with GCC; see the file COPYING3.  If not see
 #include "cfgloop.h"
 #include "tree-data-ref.h"
 #include "tree-pretty-print.h"
+#include "gimple-pretty-print.h"
+#include "insn-config.h"
+#include "expr.h"
+#include "optabs.h"
 
+#ifndef HAVE_conditional_move
+#define HAVE_conditional_move (0)
+#endif
+
 static unsigned int tree_ssa_phiopt (void);
-static unsigned int tree_ssa_phiopt_worker (bool);
+static unsigned int tree_ssa_phiopt_worker (bool, bool);
 static bool conditional_replacement (basic_block, basic_block,
 edge, edge, gimple, tree, tree);
 static int value_replacement (basic_block, basic_block,
@@ -53,6 +61,9 @@ static bool cond_store_replacement (basic_block, b
 static bool cond_if_else_store_replacement (basic_block, basic_block, 
basic_block);
 static struct pointer_set_t * get_non_trapping (void);
 static void replace_phi_edge_with_variable (basic_block, edge, gimple, tree);
+static void hoist_adjacent_loads (basic_block, basic_block,
+ basic_block, basic_block);
+static bool gate_hoist_loads (void);
 
 /* This pass tries to replaces an if-then-else block with an
assignment.  We have four kinds of transformations.  Some of these
@@ -138,12 +149,56 @@ static void replace_phi_edge_with_variable (basic_
  bb2:
x = PHI ;
 
-   A similar transformation is done for MAX_EXPR.  */
+   A similar transformation is done for MAX_EXPR.
 
+
+   This pass also performs a fifth transformation of a slightly different
+   flavor.
+
+   Adjacent Load Hoisting
+   --
+   
+   This transformation replaces
+
+ bb0:
+   if (...) goto bb2; else goto bb1;
+ bb1:
+   x1 = ().field1;
+   goto bb3;
+ bb2:
+   x2 = ().field2;
+ bb3:
+   # x = PHI ;
+
+   with
+
+ bb0:
+   x1 = ().field1;
+   x2 = ().field2;
+   if (...) goto bb2; else goto bb1;
+ bb1:
+   goto bb3;
+ bb2:
+ bb3:
+   # x = PHI ;
+
+   The purpose of this transformation is to enable generation of conditional
+   move instructions such as Intel CMOVE or PowerPC ISEL.  Because one of
+   the loads is speculative, the transformation is restricted to very
+   specific cases to avoid introducing a page fault.  We are looking for
+   the common idiom:
+
+ if (...)
+   x = y->left;
+ else
+   x = y->right;
+
+   where left and right are typically adjacent pointers in a tree structure.  
*/
+
 static unsigned int
 tree_ssa_phiopt (void)
 {
-  return tree_ssa_phiopt_worker (false);
+  return tree_ssa_phiopt_worker (false, gate_hoist_loads ());
 }
 
 /* This pass tries to transform conditional stores into unconditional
@@ -190,7 +245,7 @@ tree_ssa_phiopt (void)
 static unsigned int
 tree_ssa_cs_elim (void)
 {
-  return tree_ssa_phiopt_worker (true);
+  return tree_ssa_phiopt_worker (true, false);
 }
 
 /* Return the singleton PHI in the SEQ of PHIs for edges E0 and E1. */
@@ -227,9 +282,11 @@ static tree condstoretemp;
 /* The core routine of conditional store replacement and normal
phi optimizations.  Both share much of the infrastructure in how
to match applicable basic block patterns.  DO_STORE_ELIM is true
-   when we want to do conditional store replacement, false otherwise.  */
+   when we want to do conditional store replacement, false otherwise.
+   DO_HOIST_LOADS is true when we want to hoist adjacent loads out 
+   of diamond control flow patterns, false otherwise.  */
 static unsigned int
-tree_ssa_phiopt_worker (bool do_store_elim)
+tree_ss

Re: [PATCH] Hoist adjacent pointer loads

2012-05-23 Thread William J. Schmidt

On Wed, 2012-05-23 at 13:25 +0200, Richard Guenther wrote:
> On Tue, 22 May 2012, William J. Schmidt wrote:
> 
> > Here's a revision of the hoist-adjacent-loads patch.  Besides hopefully
> > addressing all your comments, I added a gate of at least -O2 for this
> > transformation.  Let me know if you prefer a different minimum opt
> > level.
> > 
> > I'm still running SPEC tests to make sure there are no regressions when
> > opening this up to non-pointer arguments.  The code bootstraps on
> > powerpc64-unknown-linux-gnu with no regressions.  Assuming the SPEC
> > numbers come out as expected, is this ok?
> > 
> > Thanks,
> > Bill
> > 
> > 
> > 2012-05-22  Bill Schmidt  
> > 
> > * tree-ssa-phiopt.c (tree_ssa_phiopt_worker): Add argument to forward
> > declaration.
> > (hoist_adjacent_loads, gate_hoist_loads): New forward declarations.
> > (tree_ssa_phiopt): Call gate_hoist_loads.
> > (tree_ssa_cs_elim): Add parm to tree_ssa_phiopt_worker call.
> > (tree_ssa_phiopt_worker): Add do_hoist_loads to formal arg list; call
> > hoist_adjacent_loads.
> > (local_mem_dependence): New function.
> > (hoist_adjacent_loads): Likewise.
> > (gate_hoist_loads): Likewise.
> > * common.opt (fhoist-adjacent-loads): New switch.
> > * Makefile.in (tree-ssa-phiopt.o): Added dependencies.
> > * params.def (PARAM_MIN_CMOVE_STRUCT_ALIGN): New param.
> > 
> > 
> > Index: gcc/tree-ssa-phiopt.c
> > ===
> > --- gcc/tree-ssa-phiopt.c   (revision 187728)
> > +++ gcc/tree-ssa-phiopt.c   (working copy)
> > @@ -37,9 +37,17 @@ along with GCC; see the file COPYING3.  If not see
> >  #include "cfgloop.h"
> >  #include "tree-data-ref.h"
> >  #include "tree-pretty-print.h"
> > +#include "gimple-pretty-print.h"
> > +#include "insn-config.h"
> > +#include "expr.h"
> > +#include "optabs.h"
> >  
> > +#ifndef HAVE_conditional_move
> > +#define HAVE_conditional_move (0)
> > +#endif
> > +
> >  static unsigned int tree_ssa_phiopt (void);
> > -static unsigned int tree_ssa_phiopt_worker (bool);
> > +static unsigned int tree_ssa_phiopt_worker (bool, bool);
> >  static bool conditional_replacement (basic_block, basic_block,
> >  edge, edge, gimple, tree, tree);
> >  static int value_replacement (basic_block, basic_block,
> > @@ -53,6 +61,9 @@ static bool cond_store_replacement (basic_block, b
> >  static bool cond_if_else_store_replacement (basic_block, basic_block, 
> > basic_block);
> >  static struct pointer_set_t * get_non_trapping (void);
> >  static void replace_phi_edge_with_variable (basic_block, edge, gimple, 
> > tree);
> > +static void hoist_adjacent_loads (basic_block, basic_block,
> > + basic_block, basic_block);
> > +static bool gate_hoist_loads (void);
> >  
> >  /* This pass tries to replaces an if-then-else block with an
> > assignment.  We have four kinds of transformations.  Some of these
> > @@ -138,12 +149,56 @@ static void replace_phi_edge_with_variable (basic_
> >   bb2:
> > x = PHI ;
> >  
> > -   A similar transformation is done for MAX_EXPR.  */
> > +   A similar transformation is done for MAX_EXPR.
> >  
> > +
> > +   This pass also performs a fifth transformation of a slightly different
> > +   flavor.
> > +
> > +   Adjacent Load Hoisting
> > +   --
> > +   
> > +   This transformation replaces
> > +
> > + bb0:
> > +   if (...) goto bb2; else goto bb1;
> > + bb1:
> > +   x1 = ().field1;
> > +   goto bb3;
> > + bb2:
> > +   x2 = ().field2;
> > + bb3:
> > +   # x = PHI ;
> > +
> > +   with
> > +
> > + bb0:
> > +   x1 = ().field1;
> > +   x2 = ().field2;
> > +   if (...) goto bb2; else goto bb1;
> > + bb1:
> > +   goto bb3;
> > + bb2:
> > + bb3:
> > +   # x = PHI ;
> > +
> > +   The purpose of this transformation is to enable generation of 
> > conditional
> > +   move instructions such as Intel CMOVE or PowerPC ISEL.  Because one of
> > +   the loads is speculative, the transformation is restricted to very
> > +   specific cases to avoid introducing a page fault.  We are looking for
> > +   the common idiom

Re: [PATCH] Hoist adjacent pointer loads

2012-06-04 Thread William J. Schmidt

Hi Richard,

Here's a revision of the hoist-adjacent-loads patch.  I'm sorry for the
delay since the last revision, but my performance testing has been
blocked waiting for a fix to PR53487.  I ended up applying a test
version of the patch to 4.7 and ran performance numbers with that
instead, with no degradations.

In addition to addressing your comments, this patch contains one bug fix
where local_mem_dependence was called on the wrong blocks after swapping
def1 and def2.

Bootstrapped with no regressions on powerpc64-unknown-linux-gnu.  Is
this version ok for trunk?  I won't commit it until I can do final
testing on trunk in conjunction with a fix for PR53487.

Thanks,
Bill


2012-06-04  Bill Schmidt  

* opts.c: Add -fhoist_adjacent_loads to -O2 and above.
* tree-ssa-phiopt.c (tree_ssa_phiopt_worker): Add argument to forward
declaration.
(hoist_adjacent_loads, gate_hoist_loads): New forward declarations.
(tree_ssa_phiopt): Call gate_hoist_loads.
(tree_ssa_cs_elim): Add parm to tree_ssa_phiopt_worker call.
(tree_ssa_phiopt_worker): Add do_hoist_loads to formal arg list; call
hoist_adjacent_loads.
(local_mem_dependence): New function.
(hoist_adjacent_loads): Likewise.
(gate_hoist_loads): Likewise.
* common.opt (fhoist-adjacent-loads): New switch.
* Makefile.in (tree-ssa-phiopt.o): Added dependencies.
* params.def (PARAM_MIN_CMOVE_STRUCT_ALIGN): New param.


Index: gcc/opts.c
===
--- gcc/opts.c  (revision 187805)
+++ gcc/opts.c  (working copy)
@@ -489,6 +489,7 @@ static const struct default_options default_option
 { OPT_LEVELS_2_PLUS, OPT_falign_functions, NULL, 1 },
 { OPT_LEVELS_2_PLUS, OPT_ftree_tail_merge, NULL, 1 },
 { OPT_LEVELS_2_PLUS_SPEED_ONLY, OPT_foptimize_strlen, NULL, 1 },
+{ OPT_LEVELS_2_PLUS, OPT_fhoist_adjacent_loads, NULL, 1 },
 
 /* -O3 optimizations.  */
 { OPT_LEVELS_3_PLUS, OPT_ftree_loop_distribute_patterns, NULL, 1 },
Index: gcc/tree-ssa-phiopt.c
===
--- gcc/tree-ssa-phiopt.c   (revision 187805)
+++ gcc/tree-ssa-phiopt.c   (working copy)
@@ -37,9 +37,17 @@ along with GCC; see the file COPYING3.  If not see
 #include "cfgloop.h"
 #include "tree-data-ref.h"
 #include "tree-pretty-print.h"
+#include "gimple-pretty-print.h"
+#include "insn-config.h"
+#include "expr.h"
+#include "optabs.h"
 
+#ifndef HAVE_conditional_move
+#define HAVE_conditional_move (0)
+#endif
+
 static unsigned int tree_ssa_phiopt (void);
-static unsigned int tree_ssa_phiopt_worker (bool);
+static unsigned int tree_ssa_phiopt_worker (bool, bool);
 static bool conditional_replacement (basic_block, basic_block,
 edge, edge, gimple, tree, tree);
 static int value_replacement (basic_block, basic_block,
@@ -53,6 +61,9 @@ static bool cond_store_replacement (basic_block, b
 static bool cond_if_else_store_replacement (basic_block, basic_block, 
basic_block);
 static struct pointer_set_t * get_non_trapping (void);
 static void replace_phi_edge_with_variable (basic_block, edge, gimple, tree);
+static void hoist_adjacent_loads (basic_block, basic_block,
+ basic_block, basic_block);
+static bool gate_hoist_loads (void);
 
 /* This pass tries to replaces an if-then-else block with an
assignment.  We have four kinds of transformations.  Some of these
@@ -138,12 +149,56 @@ static void replace_phi_edge_with_variable (basic_
  bb2:
x = PHI ;
 
-   A similar transformation is done for MAX_EXPR.  */
+   A similar transformation is done for MAX_EXPR.
 
+
+   This pass also performs a fifth transformation of a slightly different
+   flavor.
+
+   Adjacent Load Hoisting
+   --
+   
+   This transformation replaces
+
+ bb0:
+   if (...) goto bb2; else goto bb1;
+ bb1:
+   x1 = ().field1;
+   goto bb3;
+ bb2:
+   x2 = ().field2;
+ bb3:
+   # x = PHI ;
+
+   with
+
+ bb0:
+   x1 = ().field1;
+   x2 = ().field2;
+   if (...) goto bb2; else goto bb1;
+ bb1:
+   goto bb3;
+ bb2:
+ bb3:
+   # x = PHI ;
+
+   The purpose of this transformation is to enable generation of conditional
+   move instructions such as Intel CMOVE or PowerPC ISEL.  Because one of
+   the loads is speculative, the transformation is restricted to very
+   specific cases to avoid introducing a page fault.  We are looking for
+   the common idiom:
+
+ if (...)
+   x = y->left;
+ else
+   x = y->right;
+
+   where left and right are typically adjacent pointers in a tree structure.  
*/
+
 static unsigned int
 tree_ssa_phiopt (void)
 {
-  return tree_ssa_phiopt_worker (false);
+  return tree_ssa_phiopt_worker (false, gate_hoist_loads ());
 }
 
 /* This pass tries to transform conditional stores into unconditional
@@ -190,7

Re: [PATCH] Hoist adjacent pointer loads

2012-06-06 Thread William J. Schmidt

On Mon, 2012-06-04 at 08:45 -0500, William J. Schmidt wrote:
> Hi Richard,
> 
> Here's a revision of the hoist-adjacent-loads patch.  I'm sorry for the
> delay since the last revision, but my performance testing has been
> blocked waiting for a fix to PR53487.  I ended up applying a test
> version of the patch to 4.7 and ran performance numbers with that
> instead, with no degradations.
> 
> In addition to addressing your comments, this patch contains one bug fix
> where local_mem_dependence was called on the wrong blocks after swapping
> def1 and def2.
> 
> Bootstrapped with no regressions on powerpc64-unknown-linux-gnu.  Is
> this version ok for trunk?  I won't commit it until I can do final
> testing on trunk in conjunction with a fix for PR53487.

Final performance tests are complete and show no degradations on SPEC
cpu2006 on powerpc64-unknown-linux-gnu.

Is the patch ok for trunk?

Thanks!
Bill
> 
> Thanks,
> Bill
> 
> 
> 2012-06-04  Bill Schmidt  
> 
>   * opts.c: Add -fhoist_adjacent_loads to -O2 and above.
>   * tree-ssa-phiopt.c (tree_ssa_phiopt_worker): Add argument to forward
>   declaration.
>   (hoist_adjacent_loads, gate_hoist_loads): New forward declarations.
>   (tree_ssa_phiopt): Call gate_hoist_loads.
>   (tree_ssa_cs_elim): Add parm to tree_ssa_phiopt_worker call.
>   (tree_ssa_phiopt_worker): Add do_hoist_loads to formal arg list; call
>   hoist_adjacent_loads.
>   (local_mem_dependence): New function.
>   (hoist_adjacent_loads): Likewise.
>   (gate_hoist_loads): Likewise.
>   * common.opt (fhoist-adjacent-loads): New switch.
>   * Makefile.in (tree-ssa-phiopt.o): Added dependencies.
>   * params.def (PARAM_MIN_CMOVE_STRUCT_ALIGN): New param.
> 
> 
> Index: gcc/opts.c
> ===
> --- gcc/opts.c(revision 187805)
> +++ gcc/opts.c(working copy)
> @@ -489,6 +489,7 @@ static const struct default_options default_option
>  { OPT_LEVELS_2_PLUS, OPT_falign_functions, NULL, 1 },
>  { OPT_LEVELS_2_PLUS, OPT_ftree_tail_merge, NULL, 1 },
>  { OPT_LEVELS_2_PLUS_SPEED_ONLY, OPT_foptimize_strlen, NULL, 1 },
> +{ OPT_LEVELS_2_PLUS, OPT_fhoist_adjacent_loads, NULL, 1 },
>  
>  /* -O3 optimizations.  */
>  { OPT_LEVELS_3_PLUS, OPT_ftree_loop_distribute_patterns, NULL, 1 },
> Index: gcc/tree-ssa-phiopt.c
> ===
> --- gcc/tree-ssa-phiopt.c (revision 187805)
> +++ gcc/tree-ssa-phiopt.c (working copy)
> @@ -37,9 +37,17 @@ along with GCC; see the file COPYING3.  If not see
>  #include "cfgloop.h"
>  #include "tree-data-ref.h"
>  #include "tree-pretty-print.h"
> +#include "gimple-pretty-print.h"
> +#include "insn-config.h"
> +#include "expr.h"
> +#include "optabs.h"
>  
> +#ifndef HAVE_conditional_move
> +#define HAVE_conditional_move (0)
> +#endif
> +
>  static unsigned int tree_ssa_phiopt (void);
> -static unsigned int tree_ssa_phiopt_worker (bool);
> +static unsigned int tree_ssa_phiopt_worker (bool, bool);
>  static bool conditional_replacement (basic_block, basic_block,
>edge, edge, gimple, tree, tree);
>  static int value_replacement (basic_block, basic_block,
> @@ -53,6 +61,9 @@ static bool cond_store_replacement (basic_block, b
>  static bool cond_if_else_store_replacement (basic_block, basic_block, 
> basic_block);
>  static struct pointer_set_t * get_non_trapping (void);
>  static void replace_phi_edge_with_variable (basic_block, edge, gimple, tree);
> +static void hoist_adjacent_loads (basic_block, basic_block,
> +   basic_block, basic_block);
> +static bool gate_hoist_loads (void);
>  
>  /* This pass tries to replaces an if-then-else block with an
> assignment.  We have four kinds of transformations.  Some of these
> @@ -138,12 +149,56 @@ static void replace_phi_edge_with_variable (basic_
>   bb2:
> x = PHI ;
>  
> -   A similar transformation is done for MAX_EXPR.  */
> +   A similar transformation is done for MAX_EXPR.
>  
> +
> +   This pass also performs a fifth transformation of a slightly different
> +   flavor.
> +
> +   Adjacent Load Hoisting
> +   --
> +   
> +   This transformation replaces
> +
> + bb0:
> +   if (...) goto bb2; else goto bb1;
> + bb1:
> +   x1 = ().field1;
> +   goto bb3;
> + bb2:
> +   x2 = ().field2;
> + bb3:
> +   # x = PHI ;
> +
> +   with
> +
> + bb0:
> +   x1 = ().field1;
> +

[PATCH] Add vector cost model density heuristic

2012-06-08 Thread William J. Schmidt

This patch adds a heuristic to the vectorizer when estimating the
minimum profitable number of iterations.  The heuristic is
target-dependent, and is currently disabled for all targets except
PowerPC.  However, the intent is to make it general enough to be useful
for other targets that want to opt in.

A previous patch addressed some PowerPC SPEC degradations by modifying
the vector cost model values for vec_perm and vec_promote_demote.  The
values were set a little higher than their natural values because the
natural values were not sufficient to prevent a poor vectorization
choice.  However, this is not the right long-term solution, since it can
unnecessarily constrain other vectorization choices involving permute
instructions.

Analysis of the badly vectorized loop (in sphinx3) showed that the
problem was overcommitment of vector resources -- too many vector
instructions issued without enough non-vector instructions available to
cover the delays.  The vector cost model assumes that instructions
always have a constant cost, and doesn't have a way of judging this kind
of "density" of vector instructions.

The present patch adds a heuristic to recognize when a loop is likely to
overcommit resources, and adds a small penalty to the inside-loop cost
to account for the expected stalls.  The heuristic is parameterized with
three target-specific values:

 * Density threshold: The heuristic will apply only when the
   percentage of inside-loop cost attributable to vectorized
   instructions exceeds this value.

 * Size threshold: The heuristic will apply only when the
   inside-loop cost exceeds this value.

 * Penalty: The inside-loop cost will be increased by this
   percentage value when the heuristic applies.

Thus only reasonably large loop bodies that are mostly vectorized
instructions will be affected.

By applying only a small percentage bump to the inside-loop cost, the
heuristic will only turn off vectorization for loops that were
considered "barely profitable" to begin with (such as the sphinx3 loop).
So the heuristic is quite conservative and should not affect the vast
majority of vectorization decisions.

Together with the new heuristic, this patch reduces the vec_perm and
vec_promote_demote costs for PowerPC to their natural values.

I've regstrapped this with no regressions on powerpc64-unknown-linux-gnu
and verified that no performance regressions occur on SPEC cpu2006.  Is
this ok for trunk?

Thanks,
Bill


2012-06-08  Bill Schmidt  

* doc/tm.texi.in: Add vectorization density hooks.
* doc/tm.texi: Regenerate.
* targhooks.c (default_density_pct_threshold): New.
(default_density_size_threshold): New.
(default_density_penalty): New.
* targhooks.h: New decls for new targhooks.c functions.
* target.def (density_pct_threshold): New DEF_HOOK.
(density_size_threshold): Likewise.
(density_penalty): Likewise.
* tree-vect-loop.c (accum_stmt_cost): New.
(vect_estimate_min_profitable_iters): Perform density test.
* config/rs6000/rs6000.c (TARGET_VECTORIZE_DENSITY_PCT_THRESHOLD):
New macro definition.
(TARGET_VECTORIZE_DENSITY_SIZE_THRESHOLD): Likewise.
(TARGET_VECTORIZE_DENSITY_PENALTY): Likewise.
(rs6000_builtin_vectorization_cost): Reduce costs of vec_perm and
vec_promote_demote to correct values.
(rs6000_density_pct_threshold): New.
(rs6000_density_size_threshold): New.
(rs6000_density_penalty): New.


Index: gcc/doc/tm.texi
===
--- gcc/doc/tm.texi (revision 188305)
+++ gcc/doc/tm.texi (working copy)
@@ -5798,6 +5798,27 @@ The default is @code{NULL_TREE} which means to not
 loads.
 @end deftypefn
 
+@deftypefn {Target Hook} int TARGET_VECTORIZE_DENSITY_PCT_THRESHOLD (void)
+This hook should return the maximum density, expressed in percent, for
+which autovectorization of loops with large bodies should be constrained.
+See also @code{TARGET_VECTORIZE_DENSITY_SIZE_THRESHOLD}.  The default
+is to return 100, which disables the density test.
+@end deftypefn
+
+@deftypefn {Target Hook} int TARGET_VECTORIZE_DENSITY_SIZE_THRESHOLD (void)
+This hook should return the minimum estimated size of a vectorized
+loop body for which the density test should apply.  See also
+@code{TARGET_VECTORIZE_DENSITY_PCT_THRESHOLD}.  The default is set
+to the unreasonable value of 100, which effectively disables 
+the density test.
+@end deftypefn
+
+@deftypefn {Target Hook} int TARGET_VECTORIZE_DENSITY_PENALTY (void)
+This hook should return the penalty, expressed in percent, to be applied
+to the inside-of-loop vectorization costs for a loop failing the density
+test.  The default is 10.
+@end deftypefn
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
Index: gcc/doc/tm.texi.in
===
--- gcc/doc/tm.te

[PATCH] Correct cost model for strided loads

2012-06-10 Thread William J. Schmidt

The fix for PR53331 caused a degradation to 187.facerec on
powerpc64-unknown-linux-gnu.  The following simple patch reverses the
degradation without otherwise affecting SPEC cpu2000 or cpu2006.
Bootstrapped and regtested on that platform with no new regressions.  Ok
for trunk?

Thanks,
Bill


2012-06-10  Bill Schmidt  

* tree-vect-stmts.c (vect_model_load_cost):  Change cost model
for strided loads.


Index: gcc/tree-vect-stmts.c
===
--- gcc/tree-vect-stmts.c   (revision 188341)
+++ gcc/tree-vect-stmts.c   (working copy)
@@ -1031,11 +1031,10 @@ vect_model_load_cost (stmt_vec_info stmt_info, int
   /* The loads themselves.  */
   if (STMT_VINFO_STRIDE_LOAD_P (stmt_info))
 {
-  /* N scalar loads plus gathering them into a vector.
- ???  scalar_to_vec isn't the cost for that.  */
+  /* N scalar loads plus gathering them into a vector.  */
   inside_cost += (vect_get_stmt_cost (scalar_load) * ncopies
  * TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info)));
-  inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec);
+  inside_cost += ncopies * vect_get_stmt_cost (vec_perm);
 }
   else
 vect_get_load_cost (first_dr, ncopies,

Re: [PATCH] Hoist adjacent pointer loads

2012-06-11 Thread William J. Schmidt

On Mon, 2012-06-11 at 13:28 +0200, Richard Guenther wrote:
> On Mon, Jun 4, 2012 at 3:45 PM, William J. Schmidt
>  wrote:
> > Hi Richard,
> >
> > Here's a revision of the hoist-adjacent-loads patch.  I'm sorry for the
> > delay since the last revision, but my performance testing has been
> > blocked waiting for a fix to PR53487.  I ended up applying a test
> > version of the patch to 4.7 and ran performance numbers with that
> > instead, with no degradations.
> >
> > In addition to addressing your comments, this patch contains one bug fix
> > where local_mem_dependence was called on the wrong blocks after swapping
> > def1 and def2.
> >
> > Bootstrapped with no regressions on powerpc64-unknown-linux-gnu.  Is
> > this version ok for trunk?  I won't commit it until I can do final
> > testing on trunk in conjunction with a fix for PR53487.
> >
> > Thanks,
> > Bill
> >
> >
> > 2012-06-04  Bill Schmidt  
> >
> >* opts.c: Add -fhoist_adjacent_loads to -O2 and above.
> >* tree-ssa-phiopt.c (tree_ssa_phiopt_worker): Add argument to forward
> >declaration.
> >(hoist_adjacent_loads, gate_hoist_loads): New forward declarations.
> >(tree_ssa_phiopt): Call gate_hoist_loads.
> >(tree_ssa_cs_elim): Add parm to tree_ssa_phiopt_worker call.
> >(tree_ssa_phiopt_worker): Add do_hoist_loads to formal arg list; call
> >hoist_adjacent_loads.
> >(local_mem_dependence): New function.
> >(hoist_adjacent_loads): Likewise.
> >(gate_hoist_loads): Likewise.
> >* common.opt (fhoist-adjacent-loads): New switch.
> >* Makefile.in (tree-ssa-phiopt.o): Added dependencies.
> >* params.def (PARAM_MIN_CMOVE_STRUCT_ALIGN): New param.
> >
> >
> > Index: gcc/opts.c
> > ===
> > --- gcc/opts.c  (revision 187805)
> > +++ gcc/opts.c  (working copy)
> > @@ -489,6 +489,7 @@ static const struct default_options default_option
> > { OPT_LEVELS_2_PLUS, OPT_falign_functions, NULL, 1 },
> > { OPT_LEVELS_2_PLUS, OPT_ftree_tail_merge, NULL, 1 },
> > { OPT_LEVELS_2_PLUS_SPEED_ONLY, OPT_foptimize_strlen, NULL, 1 },
> > +{ OPT_LEVELS_2_PLUS, OPT_fhoist_adjacent_loads, NULL, 1 },
> >
> > /* -O3 optimizations.  */
> > { OPT_LEVELS_3_PLUS, OPT_ftree_loop_distribute_patterns, NULL, 1 },
> > Index: gcc/tree-ssa-phiopt.c
> > ===
> > --- gcc/tree-ssa-phiopt.c   (revision 187805)
> > +++ gcc/tree-ssa-phiopt.c   (working copy)
> > @@ -37,9 +37,17 @@ along with GCC; see the file COPYING3.  If not see
> >  #include "cfgloop.h"
> >  #include "tree-data-ref.h"
> >  #include "tree-pretty-print.h"
> > +#include "gimple-pretty-print.h"
> > +#include "insn-config.h"
> > +#include "expr.h"
> > +#include "optabs.h"
> >
> > +#ifndef HAVE_conditional_move
> > +#define HAVE_conditional_move (0)
> > +#endif
> > +
> >  static unsigned int tree_ssa_phiopt (void);
> > -static unsigned int tree_ssa_phiopt_worker (bool);
> > +static unsigned int tree_ssa_phiopt_worker (bool, bool);
> >  static bool conditional_replacement (basic_block, basic_block,
> > edge, edge, gimple, tree, tree);
> >  static int value_replacement (basic_block, basic_block,
> > @@ -53,6 +61,9 @@ static bool cond_store_replacement (basic_block, b
> >  static bool cond_if_else_store_replacement (basic_block, basic_block, 
> > basic_block);
> >  static struct pointer_set_t * get_non_trapping (void);
> >  static void replace_phi_edge_with_variable (basic_block, edge, gimple, 
> > tree);
> > +static void hoist_adjacent_loads (basic_block, basic_block,
> > + basic_block, basic_block);
> > +static bool gate_hoist_loads (void);
> >
> >  /* This pass tries to replaces an if-then-else block with an
> >assignment.  We have four kinds of transformations.  Some of these
> > @@ -138,12 +149,56 @@ static void replace_phi_edge_with_variable (basic_
> >  bb2:
> >x = PHI ;
> >
> > -   A similar transformation is done for MAX_EXPR.  */
> > +   A similar transformation is done for MAX_EXPR.
> >
> > +
> > +   This pass also performs a fifth transformation of a slightly different
> > +   flavor.
> > +
> > +   Adjacent Load Hoisti

Re: [PATCH] Correct cost model for strided loads

2012-06-11 Thread William J. Schmidt

On Mon, 2012-06-11 at 11:15 +0200, Richard Guenther wrote:
> On Sun, Jun 10, 2012 at 5:58 PM, William J. Schmidt
>  wrote:
> > The fix for PR53331 caused a degradation to 187.facerec on
> > powerpc64-unknown-linux-gnu.  The following simple patch reverses the
> > degradation without otherwise affecting SPEC cpu2000 or cpu2006.
> > Bootstrapped and regtested on that platform with no new regressions.  Ok
> > for trunk?
> 
> Well, would the real cost not be subparts * scalar_to_vec plus
> subparts * vec_perm?
> At least vec_perm isn't the cost for building up a vector from N scalar 
> elements
> either (it might be close enough if subparts == 2).  What's the case
> with facerec
> here?  Does it have subparts == 2?  

In this case, subparts == 4 (32-bit floats, 128-bit vec reg).  On
PowerPC, this requires two merge instructions and a permute instruction
to get the four 32-bit quantities into the right place in a 128-bit
register.  Currently this is modeled as a vec_perm in other parts of the
vectorizer per Ira's earlier patches, so I naively changed this to do
the same thing.

The types of vectorizer instructions aren't documented, and I can't
infer much from the i386.c cost model, so I need a little education.
What semantics are represented by scalar_to_vec?

On PowerPC, we have this mapping of the floating-point registers and
vector float registers where they overlap (the low-order half of each of
the first 32 vector float regs corresponds to a scalar float reg).  So
in this case we have four scalar loads that place things in the bottom
half of four vector registers, two vector merge instructions that
collapse the four registers into two vector registers, and a vector
permute that gets things in the right order.(*)  I wonder if what we
refer to as a merge instruction is similar to scalar_to_vec.

If so, then maybe we need something like

 subparts = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info));
 inside_cost += vect_get_stmt_cost (scalar_load) * ncopies * subparts;
 inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec) * subparts / 2;
 inside_cost += ncopies * vect_get_stmt_cost (vec_perm);

But then we'd have to change how vec_perm is modeled elsewhere for
PowerPC based on Ira's earlier patches.  As I said, it's difficult for
me to figure out all the intent of cost model decisions that have been
made in the past, using current documentation.

> I really wanted to pessimize this case
> for say AVX and char elements, thus building up a vector from 32 scalars which
> certainly does not cost a mere vec_perm.  So, maybe special-case the
> subparts == 2 case and assume vec_perm would match the cost only in that
> case.

(I'm a little confused by this as what you have at the moment is a
single scalar_to_vec per copy, which has a cost of 1 on most i386
targets (occasionally 2).  The subparts multiplier is only applied to
the loads.  So changing this to vec_perm seemed to be a no-op for i386.)

(*) There are actually a couple more instructions here to convert 64-bit
values to 32-bit values, since on PowerPC 32-bit loads are converted to
64-bit values in scalar float registers and they have to be coerced back
to 32-bit.  Very ugly.  The cost model currently doesn't represent this
at all, which I'll have to look at fixing at some point in some way that
isn't too nasty for the other targets.  The cost model for PowerPC seems
to need a lot of TLC.

Thanks,
Bill

> 
> Thanks,
> Richard.
> 
> > Thanks,
> > Bill
> >
> >
> > 2012-06-10  Bill Schmidt  
> >
> >* tree-vect-stmts.c (vect_model_load_cost):  Change cost model
> >for strided loads.
> >
> >
> > Index: gcc/tree-vect-stmts.c
> > ===
> > --- gcc/tree-vect-stmts.c   (revision 188341)
> > +++ gcc/tree-vect-stmts.c   (working copy)
> > @@ -1031,11 +1031,10 @@ vect_model_load_cost (stmt_vec_info stmt_info, int
> >   /* The loads themselves.  */
> >   if (STMT_VINFO_STRIDE_LOAD_P (stmt_info))
> > {
> > -  /* N scalar loads plus gathering them into a vector.
> > - ???  scalar_to_vec isn't the cost for that.  */
> > +  /* N scalar loads plus gathering them into a vector.  */
> >   inside_cost += (vect_get_stmt_cost (scalar_load) * ncopies
> >  * TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE 
> > (stmt_info)));
> > -  inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec);
> > +  inside_cost += ncopies * vect_get_stmt_cost (vec_perm);
> > }
> >   else
> > vect_get_load_cost (first_dr, ncopies,
> >
> >
>

Re: [PATCH] Add vector cost model density heuristic

2012-06-11 Thread William J. Schmidt

On Mon, 2012-06-11 at 13:40 +0200, Richard Guenther wrote:
> On Fri, 8 Jun 2012, William J. Schmidt wrote:
> 
> > This patch adds a heuristic to the vectorizer when estimating the
> > minimum profitable number of iterations.  The heuristic is
> > target-dependent, and is currently disabled for all targets except
> > PowerPC.  However, the intent is to make it general enough to be useful
> > for other targets that want to opt in.
> > 
> > A previous patch addressed some PowerPC SPEC degradations by modifying
> > the vector cost model values for vec_perm and vec_promote_demote.  The
> > values were set a little higher than their natural values because the
> > natural values were not sufficient to prevent a poor vectorization
> > choice.  However, this is not the right long-term solution, since it can
> > unnecessarily constrain other vectorization choices involving permute
> > instructions.
> > 
> > Analysis of the badly vectorized loop (in sphinx3) showed that the
> > problem was overcommitment of vector resources -- too many vector
> > instructions issued without enough non-vector instructions available to
> > cover the delays.  The vector cost model assumes that instructions
> > always have a constant cost, and doesn't have a way of judging this kind
> > of "density" of vector instructions.
> > 
> > The present patch adds a heuristic to recognize when a loop is likely to
> > overcommit resources, and adds a small penalty to the inside-loop cost
> > to account for the expected stalls.  The heuristic is parameterized with
> > three target-specific values:
> > 
> >  * Density threshold: The heuristic will apply only when the
> >percentage of inside-loop cost attributable to vectorized
> >instructions exceeds this value.
> > 
> >  * Size threshold: The heuristic will apply only when the
> >inside-loop cost exceeds this value.
> > 
> >  * Penalty: The inside-loop cost will be increased by this
> >percentage value when the heuristic applies.
> > 
> > Thus only reasonably large loop bodies that are mostly vectorized
> > instructions will be affected.
> > 
> > By applying only a small percentage bump to the inside-loop cost, the
> > heuristic will only turn off vectorization for loops that were
> > considered "barely profitable" to begin with (such as the sphinx3 loop).
> > So the heuristic is quite conservative and should not affect the vast
> > majority of vectorization decisions.
> > 
> > Together with the new heuristic, this patch reduces the vec_perm and
> > vec_promote_demote costs for PowerPC to their natural values.
> > 
> > I've regstrapped this with no regressions on powerpc64-unknown-linux-gnu
> > and verified that no performance regressions occur on SPEC cpu2006.  Is
> > this ok for trunk?
> 
> Hmm.  I don't like this patch or its general idea too much.  Instead
> I'd like us to move more of the cost model detail to the target, giving
> it a chance to look at the whole loop before deciding on a cost.  ISTR
> posting the overall idea at some point, but let me repeat it here instead
> of trying to find that e-mail.
> 
> The basic interface of the cost model should be, in targetm.vectorize
> 
>   /* Tell the target to start cost analysis of a loop or a basic-block
>  (if the loop argument is NULL).  Returns an opaque pointer to
>  target-private data.  */
>   void *init_cost (struct loop *loop);
> 
>   /* Add cost for N vectorized-stmt-kind statements in vector_mode.  */
>   void add_stmt_cost (void *data, unsigned n,
> vectorized-stmt-kind,
>   enum machine_mode vector_mode);
> 
>   /* Tell the target to compute and return the cost of the accumulated
>  statements and free any target-private data.  */
>   unsigned finish_cost (void *data);
> 
> with eventually slightly different signatures for add_stmt_cost
> (like pass in the original scalar stmt?).
> 
> It allows the target, at finish_cost time, to evaluate things like
> register pressure and resource utilization.

OK, I'm trying to understand how you would want this built into the
present structure.  Taking just the loop case for now:

Judging by your suggested API, we would have to call add_stmt_cost ()
everywhere that we now call stmt_vinfo_set_inside_of_loop_cost ().  For
now this would be an additional call, not a replacement, though maybe
the other goes away eventually.  This allows the target to save more
data about the vectorized instructions than just an accumulated cost
number (order and quantity of various kinds of instructions can be
mai

Re: [PATCH] Correct cost model for strided loads

2012-06-11 Thread William J. Schmidt



On Mon, 2012-06-11 at 16:10 +0200, Richard Guenther wrote:
> On Mon, 11 Jun 2012, William J. Schmidt wrote:
> 
> > 
> > 
> > On Mon, 2012-06-11 at 11:15 +0200, Richard Guenther wrote:
> > > On Sun, Jun 10, 2012 at 5:58 PM, William J. Schmidt
> > >  wrote:
> > > > The fix for PR53331 caused a degradation to 187.facerec on
> > > > powerpc64-unknown-linux-gnu.  The following simple patch reverses the
> > > > degradation without otherwise affecting SPEC cpu2000 or cpu2006.
> > > > Bootstrapped and regtested on that platform with no new regressions.  Ok
> > > > for trunk?
> > > 
> > > Well, would the real cost not be subparts * scalar_to_vec plus
> > > subparts * vec_perm?
> > > At least vec_perm isn't the cost for building up a vector from N scalar 
> > > elements
> > > either (it might be close enough if subparts == 2).  What's the case
> > > with facerec
> > > here?  Does it have subparts == 2?  
> > 
> > In this case, subparts == 4 (32-bit floats, 128-bit vec reg).  On
> > PowerPC, this requires two merge instructions and a permute instruction
> > to get the four 32-bit quantities into the right place in a 128-bit
> > register.  Currently this is modeled as a vec_perm in other parts of the
> > vectorizer per Ira's earlier patches, so I naively changed this to do
> > the same thing.
> 
> I see.
> 
> > The types of vectorizer instructions aren't documented, and I can't
> > infer much from the i386.c cost model, so I need a little education.
> > What semantics are represented by scalar_to_vec?
> 
> It's a vector splat, thus x -> { x, x, x, ... }.  You can create
> { x, y, z, ... } by N such splats plus N - 1 permutes (if a permute,
> as VEC_PERM_EXPR, takes two input vectors).  That's by far not
> the most efficient way to build up such a vector of course (with AVX
> you could do one splat plus N - 1 inserts for example).  The cost
> is of course dependent on the number of vector elements, so a
> simple new enum vect_cost_for_stmt kind does not cover it but
> the target would have to look at the vector type passed and do
> some reasonable guess.

Ah, splat!  Yes, that's lingo I understand.  I see the intent now.

> 
> > On PowerPC, we have this mapping of the floating-point registers and
> > vector float registers where they overlap (the low-order half of each of
> > the first 32 vector float regs corresponds to a scalar float reg).  So
> > in this case we have four scalar loads that place things in the bottom
> > half of four vector registers, two vector merge instructions that
> > collapse the four registers into two vector registers, and a vector
> > permute that gets things in the right order.(*)  I wonder if what we
> > refer to as a merge instruction is similar to scalar_to_vec.
> 
> Looks similar to x86 SSE then.
> 
> > If so, then maybe we need something like
> > 
> >  subparts = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info));
> >  inside_cost += vect_get_stmt_cost (scalar_load) * ncopies * subparts;
> >  inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec) * subparts 
> > / 2;
> >  inside_cost += ncopies * vect_get_stmt_cost (vec_perm);
> > 
> > But then we'd have to change how vec_perm is modeled elsewhere for
> > PowerPC based on Ira's earlier patches.  As I said, it's difficult for
> > me to figure out all the intent of cost model decisions that have been
> > made in the past, using current documentation.
> 
> Heh, usually the intent was to make the changes simple, not to compute
> a proper cost.
> 
> I think we simply need a new scalars_to_vec cost kind.

That works.  Maybe vec_construct gets the point across a little better?
I think we need to use the full builtin_vectorization_cost interface
instead of vect_get_stmt_cost here, so the targets can parameterize on
type.  Then we can just do one cost calculation for vec_construct that
covers the full costs of getting the vector in order after the loads.

> 
> > > I really wanted to pessimize this case
> > > for say AVX and char elements, thus building up a vector from 32 scalars 
> > > which
> > > certainly does not cost a mere vec_perm.  So, maybe special-case the
> > > subparts == 2 case and assume vec_perm would match the cost only in that
> > > case.
> > 
> > (I'm a little confused by this as what you have at the moment is a
> > single scalar_to_vec per copy, which has a cost of 1 on most i386
> > targets (occasionally 2).  The subparts multiplier is only appl

Re: [PATCH] Add vector cost model density heuristic

2012-06-11 Thread William J. Schmidt

On Mon, 2012-06-11 at 16:58 +0200, Richard Guenther wrote:
> On Mon, 11 Jun 2012, Richard Guenther wrote:
> 
> > On Mon, 11 Jun 2012, William J. Schmidt wrote:
> > 
> > > On Mon, 2012-06-11 at 13:40 +0200, Richard Guenther wrote:
> > > > On Fri, 8 Jun 2012, William J. Schmidt wrote:
> > > > 
> > > > > This patch adds a heuristic to the vectorizer when estimating the
> > > > > minimum profitable number of iterations.  The heuristic is
> > > > > target-dependent, and is currently disabled for all targets except
> > > > > PowerPC.  However, the intent is to make it general enough to be 
> > > > > useful
> > > > > for other targets that want to opt in.
> > > > > 
> > > > > A previous patch addressed some PowerPC SPEC degradations by modifying
> > > > > the vector cost model values for vec_perm and vec_promote_demote.  The
> > > > > values were set a little higher than their natural values because the
> > > > > natural values were not sufficient to prevent a poor vectorization
> > > > > choice.  However, this is not the right long-term solution, since it 
> > > > > can
> > > > > unnecessarily constrain other vectorization choices involving permute
> > > > > instructions.
> > > > > 
> > > > > Analysis of the badly vectorized loop (in sphinx3) showed that the
> > > > > problem was overcommitment of vector resources -- too many vector
> > > > > instructions issued without enough non-vector instructions available 
> > > > > to
> > > > > cover the delays.  The vector cost model assumes that instructions
> > > > > always have a constant cost, and doesn't have a way of judging this 
> > > > > kind
> > > > > of "density" of vector instructions.
> > > > > 
> > > > > The present patch adds a heuristic to recognize when a loop is likely 
> > > > > to
> > > > > overcommit resources, and adds a small penalty to the inside-loop cost
> > > > > to account for the expected stalls.  The heuristic is parameterized 
> > > > > with
> > > > > three target-specific values:
> > > > > 
> > > > >  * Density threshold: The heuristic will apply only when the
> > > > >percentage of inside-loop cost attributable to vectorized
> > > > >instructions exceeds this value.
> > > > > 
> > > > >  * Size threshold: The heuristic will apply only when the
> > > > >inside-loop cost exceeds this value.
> > > > > 
> > > > >  * Penalty: The inside-loop cost will be increased by this
> > > > >percentage value when the heuristic applies.
> > > > > 
> > > > > Thus only reasonably large loop bodies that are mostly vectorized
> > > > > instructions will be affected.
> > > > > 
> > > > > By applying only a small percentage bump to the inside-loop cost, the
> > > > > heuristic will only turn off vectorization for loops that were
> > > > > considered "barely profitable" to begin with (such as the sphinx3 
> > > > > loop).
> > > > > So the heuristic is quite conservative and should not affect the vast
> > > > > majority of vectorization decisions.
> > > > > 
> > > > > Together with the new heuristic, this patch reduces the vec_perm and
> > > > > vec_promote_demote costs for PowerPC to their natural values.
> > > > > 
> > > > > I've regstrapped this with no regressions on 
> > > > > powerpc64-unknown-linux-gnu
> > > > > and verified that no performance regressions occur on SPEC cpu2006.  
> > > > > Is
> > > > > this ok for trunk?
> > > > 
> > > > Hmm.  I don't like this patch or its general idea too much.  Instead
> > > > I'd like us to move more of the cost model detail to the target, giving
> > > > it a chance to look at the whole loop before deciding on a cost.  ISTR
> > > > posting the overall idea at some point, but let me repeat it here 
> > > > instead
> > > > of trying to find that e-mail.
> > > > 
> > > > The basic interface of the cost model should be, in targetm.vectorize
> > > > 
> > > >   /* Tell the target to start cost analysis of a

Re: [PATCH] Add vector cost model density heuristic

2012-06-11 Thread William J. Schmidt

On Mon, 2012-06-11 at 11:09 -0400, David Edelsohn wrote:
> On Mon, Jun 11, 2012 at 10:55 AM, Richard Guenther  wrote:
> 
> > Well, they are at least magic numbers and heuristics that apply
> > generally and not only to the single issue in sphinx.  And in
> > fact how it works for sphinx _is_ magic.
> >
> >> Second, I suggest that you need to rephrase "I can make you" and
> >> re-send your reply.
> >
> > Sorry for my bad english.  Consider it meaning that I'd rather have
> > you think about a more proper solution.  That's what patch review
> > is about after all, no?  Sometimes a complete re-write (which gets
> > more difficult which each of the patches "enhancing" the not ideal
> > current state) is the best thing to do.
> 
> Richard,
> 
> The values of the heuristics may be "magic", but Bill believes the
> heuristics are testing the important characteristics.  The heuristics
> themselves are controlled by hooks, so the target can set the correct
> values for their own requirements.
> 
> The concern is that a general cost infrastructure is too general.
> And, based on history, all ports simply will copy the boilerplate from
> the first implementation. It also may cause more problems because the
> target has relatively little information to be able to judge
> heuristics at that point in the middle-end. If the targets start to
> get too "cute" or too complicated, it may cause more problems or more
> confusion about why more complicated heuristics are not effective and
> not producing the expected results.
> 
> I worry about creating another machine dependent reorg catch-all pass.
> 
> Maybe an incremental pre- and/or post- cost hook would be more
> effective. I will let Bill comment.

Thanks David,

I can see both sides of this, and it's hard to judge the future from
where I stand.  My belief is that the number of heuristics targets will
implement will be fairly limited, since judgments about cycle-level
costs are not accurately predictable during the middle end.  All we can
do is come up with a few things that seem to make sense.  Doing too much
in the back end seems impractical.

The interesting question to me is whether cost model heuristics are
general enough to be reusable.  What I saw in this case was what I
considered to be a somewhat target-neutral problem:  overwhelming those
assets of the processor that implement vectorization.  It seemed
reasonable to provide hooks for others to use the idea if they encounter
similar issues.  If reusing the heuristic is useful, then having to copy
the logic from one target to another isn't the best approach.  If nobody
else will ever use it, then embedding it in the back end is reasonable.
Unfortunately my crystal ball has been on the fritz for several decades,
so I can't tell you for sure which is right...

Richard, my biggest question is whether you think other targets are
likely to take advantage of a more general back-end interface, or
whether this will end up just being a PowerPC wart.  If you know of ways
this will be useful for i386, that would be helpful to know.  Perhaps
this requires your crystal ball as well; not sure how well yours
works...

If we look at just this one issue in isolation, then changing all the
code in the vectorizer that calculates inside/outside loop costs and
moving it to targetm seems more invasive than adding the few hooks.  But
if this will really be a useful feature for the community as a whole I
am certainly willing to tackle it.

Thanks,
Bill

> 
> Thanks, David
>

Re: [PATCH] Hoist adjacent pointer loads

2012-06-11 Thread William J. Schmidt

On Mon, 2012-06-11 at 14:59 +0200, Richard Guenther wrote:
> On Mon, 11 Jun 2012, William J. Schmidt wrote:
> 
> > On Mon, 2012-06-11 at 13:28 +0200, Richard Guenther wrote:
> > > On Mon, Jun 4, 2012 at 3:45 PM, William J. Schmidt
> > >  wrote:
> > > > Hi Richard,
> > > >
> > > > Here's a revision of the hoist-adjacent-loads patch.  I'm sorry for the
> > > > delay since the last revision, but my performance testing has been
> > > > blocked waiting for a fix to PR53487.  I ended up applying a test
> > > > version of the patch to 4.7 and ran performance numbers with that
> > > > instead, with no degradations.
> > > >
> > > > In addition to addressing your comments, this patch contains one bug fix
> > > > where local_mem_dependence was called on the wrong blocks after swapping
> > > > def1 and def2.
> > > >
> > > > Bootstrapped with no regressions on powerpc64-unknown-linux-gnu.  Is
> > > > this version ok for trunk?  I won't commit it until I can do final
> > > > testing on trunk in conjunction with a fix for PR53487.
> > > >
> > > > Thanks,
> > > > Bill
> > > >
> > > >
> > > > 2012-06-04  Bill Schmidt  
> > > >
> > > >* opts.c: Add -fhoist_adjacent_loads to -O2 and above.
> > > >* tree-ssa-phiopt.c (tree_ssa_phiopt_worker): Add argument to 
> > > > forward
> > > >declaration.
> > > >(hoist_adjacent_loads, gate_hoist_loads): New forward 
> > > > declarations.
> > > >(tree_ssa_phiopt): Call gate_hoist_loads.
> > > >(tree_ssa_cs_elim): Add parm to tree_ssa_phiopt_worker call.
> > > >(tree_ssa_phiopt_worker): Add do_hoist_loads to formal arg list; 
> > > > call
> > > >hoist_adjacent_loads.
> > > >(local_mem_dependence): New function.
> > > >(hoist_adjacent_loads): Likewise.
> > > >(gate_hoist_loads): Likewise.
> > > >* common.opt (fhoist-adjacent-loads): New switch.
> > > >* Makefile.in (tree-ssa-phiopt.o): Added dependencies.
> > > >* params.def (PARAM_MIN_CMOVE_STRUCT_ALIGN): New param.
> > > >
> > > >
> > > > Index: gcc/opts.c
> > > > ===
> > > > --- gcc/opts.c  (revision 187805)
> > > > +++ gcc/opts.c  (working copy)
> > > > @@ -489,6 +489,7 @@ static const struct default_options default_option
> > > > { OPT_LEVELS_2_PLUS, OPT_falign_functions, NULL, 1 },
> > > > { OPT_LEVELS_2_PLUS, OPT_ftree_tail_merge, NULL, 1 },
> > > > { OPT_LEVELS_2_PLUS_SPEED_ONLY, OPT_foptimize_strlen, NULL, 1 },
> > > > +{ OPT_LEVELS_2_PLUS, OPT_fhoist_adjacent_loads, NULL, 1 },
> > > >
> > > > /* -O3 optimizations.  */
> > > > { OPT_LEVELS_3_PLUS, OPT_ftree_loop_distribute_patterns, NULL, 1 },
> > > > Index: gcc/tree-ssa-phiopt.c
> > > > ===
> > > > --- gcc/tree-ssa-phiopt.c   (revision 187805)
> > > > +++ gcc/tree-ssa-phiopt.c   (working copy)
> > > > @@ -37,9 +37,17 @@ along with GCC; see the file COPYING3.  If not see
> > > >  #include "cfgloop.h"
> > > >  #include "tree-data-ref.h"
> > > >  #include "tree-pretty-print.h"
> > > > +#include "gimple-pretty-print.h"
> > > > +#include "insn-config.h"
> > > > +#include "expr.h"
> > > > +#include "optabs.h"
> > > >
> > > > +#ifndef HAVE_conditional_move
> > > > +#define HAVE_conditional_move (0)
> > > > +#endif
> > > > +
> > > >  static unsigned int tree_ssa_phiopt (void);
> > > > -static unsigned int tree_ssa_phiopt_worker (bool);
> > > > +static unsigned int tree_ssa_phiopt_worker (bool, bool);
> > > >  static bool conditional_replacement (basic_block, basic_block,
> > > > edge, edge, gimple, tree, tree);
> > > >  static int value_replacement (basic_block, basic_block,
> > > > @@ -53,6 +61,9 @@ static bool cond_store_replacement (basic_block, b
> > > >  static bool cond_if_else_store_replacement (basic_block, basic_block, 
> > >

Re: [PATCH] Hoist adjacent pointer loads

2012-06-11 Thread William J. Schmidt

On Mon, 2012-06-11 at 12:11 -0500, William J. Schmidt wrote:

> I found this parameter that seems to correspond to well-predicted
> conditional jumps:
> 
> /* When branch is predicted to be taken with probability lower than this
>threshold (in percent), then it is considered well predictable. */
> DEFPARAM (PARAM_PREDICTABLE_BRANCH_OUTCOME,
> "predictable-branch-outcome",
> "Maximal estimated outcome of branch considered predictable",
> 2, 0, 50)
> 
...which has an interface predictable_edge_p () in predict.c, so that's
what I'll use.

Thanks,
Bill

Re: [PATCH] Hoist adjacent loads

2012-06-11 Thread William J. Schmidt

OK, once more with feeling... :)

This patch differs from the previous one in two respects:  It disables
the optimization when either the then or else edge is well-predicted;
and it now uses the existing l1-cache-line-size parameter instead of a
new one (with updated commentary).

Bootstraps and tests with no new regressions on
powerpc64-unknown-linux-gnu.  One last performance run is underway, but
I don't expect any surprises since both changes are more conservative.
The original benchmark issue is still resolved.

Is this version ok for trunk?

Thanks,
Bill


2012-06-11  Bill Schmidt  

* opts.c: Add -fhoist-adjacent-loads to -O2 and above.
* tree-ssa-phiopt.c (tree_ssa_phiopt_worker): Add argument to forward
declaration.
(hoist_adjacent_loads, gate_hoist_loads): New forward declarations.
(tree_ssa_phiopt): Call gate_hoist_loads.
(tree_ssa_cs_elim): Add parm to tree_ssa_phiopt_worker call.
(tree_ssa_phiopt_worker): Add do_hoist_loads to formal arg list; call
hoist_adjacent_loads.
(local_mem_dependence): New function.
(hoist_adjacent_loads): Likewise.
(gate_hoist_loads): Likewise.
* common.opt (fhoist-adjacent-loads): New switch.
* Makefile.in (tree-ssa-phiopt.o): Added dependencies.


Index: gcc/opts.c
===
--- gcc/opts.c  (revision 188390)
+++ gcc/opts.c  (working copy)
@@ -489,6 +489,7 @@ static const struct default_options default_option
 { OPT_LEVELS_2_PLUS, OPT_falign_functions, NULL, 1 },
 { OPT_LEVELS_2_PLUS, OPT_ftree_tail_merge, NULL, 1 },
 { OPT_LEVELS_2_PLUS_SPEED_ONLY, OPT_foptimize_strlen, NULL, 1 },
+{ OPT_LEVELS_2_PLUS, OPT_fhoist_adjacent_loads, NULL, 1 },
 
 /* -O3 optimizations.  */
 { OPT_LEVELS_3_PLUS, OPT_ftree_loop_distribute_patterns, NULL, 1 },
Index: gcc/tree-ssa-phiopt.c
===
--- gcc/tree-ssa-phiopt.c   (revision 188390)
+++ gcc/tree-ssa-phiopt.c   (working copy)
@@ -37,9 +37,17 @@ along with GCC; see the file COPYING3.  If not see
 #include "cfgloop.h"
 #include "tree-data-ref.h"
 #include "tree-pretty-print.h"
+#include "gimple-pretty-print.h"
+#include "insn-config.h"
+#include "expr.h"
+#include "optabs.h"
 
+#ifndef HAVE_conditional_move
+#define HAVE_conditional_move (0)
+#endif
+
 static unsigned int tree_ssa_phiopt (void);
-static unsigned int tree_ssa_phiopt_worker (bool);
+static unsigned int tree_ssa_phiopt_worker (bool, bool);
 static bool conditional_replacement (basic_block, basic_block,
 edge, edge, gimple, tree, tree);
 static int value_replacement (basic_block, basic_block,
@@ -53,6 +61,9 @@ static bool cond_store_replacement (basic_block, b
 static bool cond_if_else_store_replacement (basic_block, basic_block, 
basic_block);
 static struct pointer_set_t * get_non_trapping (void);
 static void replace_phi_edge_with_variable (basic_block, edge, gimple, tree);
+static void hoist_adjacent_loads (basic_block, basic_block,
+ basic_block, basic_block);
+static bool gate_hoist_loads (void);
 
 /* This pass tries to replaces an if-then-else block with an
assignment.  We have four kinds of transformations.  Some of these
@@ -138,12 +149,56 @@ static void replace_phi_edge_with_variable (basic_
  bb2:
x = PHI ;
 
-   A similar transformation is done for MAX_EXPR.  */
+   A similar transformation is done for MAX_EXPR.
 
+
+   This pass also performs a fifth transformation of a slightly different
+   flavor.
+
+   Adjacent Load Hoisting
+   --
+   
+   This transformation replaces
+
+ bb0:
+   if (...) goto bb2; else goto bb1;
+ bb1:
+   x1 = ().field1;
+   goto bb3;
+ bb2:
+   x2 = ().field2;
+ bb3:
+   # x = PHI ;
+
+   with
+
+ bb0:
+   x1 = ().field1;
+   x2 = ().field2;
+   if (...) goto bb2; else goto bb1;
+ bb1:
+   goto bb3;
+ bb2:
+ bb3:
+   # x = PHI ;
+
+   The purpose of this transformation is to enable generation of conditional
+   move instructions such as Intel CMOVE or PowerPC ISEL.  Because one of
+   the loads is speculative, the transformation is restricted to very
+   specific cases to avoid introducing a page fault.  We are looking for
+   the common idiom:
+
+ if (...)
+   x = y->left;
+ else
+   x = y->right;
+
+   where left and right are typically adjacent pointers in a tree structure.  
*/
+
 static unsigned int
 tree_ssa_phiopt (void)
 {
-  return tree_ssa_phiopt_worker (false);
+  return tree_ssa_phiopt_worker (false, gate_hoist_loads ());
 }
 
 /* This pass tries to transform conditional stores into unconditional
@@ -190,7 +245,7 @@ tree_ssa_phiopt (void)
 static unsigned int
 tree_ssa_cs_elim (void)
 {
-  return tree_ssa_phiopt_worker (true);
+  return tree_ssa_phiopt_worker (true, false);
 }
 
 /* R

Re: [PATCH] Correct cost model for strided loads

2012-06-12 Thread William J. Schmidt

On Tue, 2012-06-12 at 12:59 +0200, Richard Guenther wrote:

> Btw, with PR53533 I now have a case where multiplications of v4si are
> really expensive on x86 without SSE 4.1.  But we only have vect_stmt_cost
> and no further subdivision ...
> 
> Thus we'd need a tree_code argument to the cost hook.  Though it gets
> quite overloaded then, so maybe splitting it into one handling loads/stores
> (and get the misalign parameter) and one handling only vector_stmt but
> with a tree_code argument.  Or splitting it even further, seeing
> cond_branch_taken ...

Yes, I think subdividing the hook for the vector_stmt kind is pretty
much inevitable -- more situations like this expensive multiply will
arise.  I agree with the interface starting to get messy also.
Splitting it is probably the way to go -- a little painful but keeping
it all in one hook is going to get ugly.

Bill

> 
> Richard.
>

[PATCH, RFC] First cut at using vec_construct for strided loads

2012-06-12 Thread William J. Schmidt

This patch is a follow-up to the discussion generated by
http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00546.html.  I've added
vec_construct to the cost model for use in vect_model_load_cost, and
implemented a cost calculation that makes sense to me for PowerPC.  I'm
less certain about the default, i386, and spu implementations.  I took a
guess at i386 from the discussions we had, and used the same calculation
for the default and for spu.  I'm hoping you or others can fill in the
blanks if I guessed badly.

The i386 cost for vec_construct is different from all the others, which
are parameterized for each processor description.  This should probably
be parameterized in some way as well, but thought you'd know better than
I how that should be.  Perhaps instead of

elements / 2 + 1

it should be

(elements / 2) * X + Y

where X and Y are taken from the processor description, and represent
the cost of a merge and a permute, respectively.  Let me know what you
think.

Thanks,
Bill


2012-06-12  Bill Schmidt  

* targhooks.c (default_builtin_vectorized_conversion): Handle
vec_construct, using vectype to base cost on subparts.
* target.h (enum vect_cost_for_stmt): Add vec_construct.
* tree-vect-stmts.c (vect_model_load_cost): Use vec_construct
instead of scalar_to-vec.
* config/spu/spu.c (spu_builtin_vectorization_cost): Handle
vec_construct in same way as default for now.
* config/i386/i386.c (ix86_builtin_vectorization_cost): Likewise.
* config/rs6000/rs6000.c (rs6000_builtin_vectorization_cost):
Handle vec_construct, including special case for 32-bit loads.


Index: gcc/targhooks.c
===
--- gcc/targhooks.c (revision 188482)
+++ gcc/targhooks.c (working copy)
@@ -499,9 +499,11 @@ default_builtin_vectorized_conversion (unsigned in
 
 int
 default_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost,
-tree vectype ATTRIBUTE_UNUSED,
+tree vectype,
 int misalign ATTRIBUTE_UNUSED)
 {
+  unsigned elements;
+
   switch (type_of_cost)
 {
   case scalar_stmt:
@@ -524,6 +526,11 @@ default_builtin_vectorization_cost (enum vect_cost
   case cond_branch_taken:
 return 3;
 
+  case vec_construct:
+   elements = TYPE_VECTOR_SUBPARTS (vectype);
+   gcc_assert (elements > 1);
+   return elements / 2 + 1;
+
   default:
 gcc_unreachable ();
 }
Index: gcc/target.h
===
--- gcc/target.h(revision 188482)
+++ gcc/target.h(working copy)
@@ -146,7 +146,8 @@ enum vect_cost_for_stmt
   cond_branch_not_taken,
   cond_branch_taken,
   vec_perm,
-  vec_promote_demote
+  vec_promote_demote,
+  vec_construct
 };
 
 /* The target structure.  This holds all the backend hooks.  */
Index: gcc/tree-vect-stmts.c
===
--- gcc/tree-vect-stmts.c   (revision 188482)
+++ gcc/tree-vect-stmts.c   (working copy)
@@ -1031,11 +1031,13 @@ vect_model_load_cost (stmt_vec_info stmt_info, int
   /* The loads themselves.  */
   if (STMT_VINFO_STRIDE_LOAD_P (stmt_info))
 {
-  /* N scalar loads plus gathering them into a vector.
- ???  scalar_to_vec isn't the cost for that.  */
+  /* N scalar loads plus gathering them into a vector.  */
+  tree vectype = STMT_VINFO_VECTYPE (stmt_info);
   inside_cost += (vect_get_stmt_cost (scalar_load) * ncopies
- * TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info)));
-  inside_cost += ncopies * vect_get_stmt_cost (scalar_to_vec);
+ * TYPE_VECTOR_SUBPARTS (vectype));
+  inside_cost += ncopies
+   * targetm.vectorize.builtin_vectorization_cost (vec_construct,
+   vectype, 0);
 }
   else
 vect_get_load_cost (first_dr, ncopies,
Index: gcc/config/spu/spu.c
===
--- gcc/config/spu/spu.c(revision 188482)
+++ gcc/config/spu/spu.c(working copy)
@@ -6908,9 +6908,11 @@ spu_builtin_mask_for_load (void)
 /* Implement targetm.vectorize.builtin_vectorization_cost.  */
 static int 
 spu_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost,
-tree vectype ATTRIBUTE_UNUSED,
+tree vectype,
 int misalign ATTRIBUTE_UNUSED)
 {
+  unsigned elements;
+
   switch (type_of_cost)
 {
   case scalar_stmt:
@@ -6937,6 +6939,11 @@ spu_builtin_vectorization_cost (enum vect_cost_for
   case cond_branch_taken:
 return 6;
 
+  case vec_construct:
+   elements = TYPE_VECTOR_SUBPARTS (vectype);
+   gcc_assert (elements > 1);
+

Re: [PATCH, RFC] First cut at using vec_construct for strided loads

2012-06-13 Thread William J. Schmidt

On Wed, 2012-06-13 at 11:26 +0200, Richard Guenther wrote:
> On Tue, 12 Jun 2012, William J. Schmidt wrote:
> 
> > This patch is a follow-up to the discussion generated by
> > http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00546.html.  I've added
> > vec_construct to the cost model for use in vect_model_load_cost, and
> > implemented a cost calculation that makes sense to me for PowerPC.  I'm
> > less certain about the default, i386, and spu implementations.  I took a
> > guess at i386 from the discussions we had, and used the same calculation
> > for the default and for spu.  I'm hoping you or others can fill in the
> > blanks if I guessed badly.
> > 
> > The i386 cost for vec_construct is different from all the others, which
> > are parameterized for each processor description.  This should probably
> > be parameterized in some way as well, but thought you'd know better than
> > I how that should be.  Perhaps instead of
> > 
> > elements / 2 + 1
> > 
> > it should be
> > 
> > (elements / 2) * X + Y
> > 
> > where X and Y are taken from the processor description, and represent
> > the cost of a merge and a permute, respectively.  Let me know what you
> > think.
> 
> Looks good to me with the gcc_asserts removed - TYPE_VECTOR_SUBPARTS
> might be 1 for V1TImode for example (heh, not that the vectorizer would
> vectorize to that).  But I don't see any possible breakage with
> elements == 1, do you?

No, that was some unnecessary sanity testing I was doing for my own
curiosity.  I'll pull them out and pop this in today.  Thanks for the
review!

Bill

> 
> Target maintainers can improve on the cost calculation if they wish,
> the default looks sensible to me.
> 
> Thanks,
> Richard.
> 
> > Thanks,
> > Bill
> > 
> > 
> > 2012-06-12  Bill Schmidt  
> > 
> > * targhooks.c (default_builtin_vectorized_conversion): Handle
> > vec_construct, using vectype to base cost on subparts.
> > * target.h (enum vect_cost_for_stmt): Add vec_construct.
> > * tree-vect-stmts.c (vect_model_load_cost): Use vec_construct
> > instead of scalar_to-vec.
> > * config/spu/spu.c (spu_builtin_vectorization_cost): Handle
> > vec_construct in same way as default for now.
> > * config/i386/i386.c (ix86_builtin_vectorization_cost): Likewise.
> > * config/rs6000/rs6000.c (rs6000_builtin_vectorization_cost):
> > Handle vec_construct, including special case for 32-bit loads.
> > 
> > 
> > Index: gcc/targhooks.c
> > ===
> > --- gcc/targhooks.c (revision 188482)
> > +++ gcc/targhooks.c (working copy)
> > @@ -499,9 +499,11 @@ default_builtin_vectorized_conversion (unsigned in
> >  
> >  int
> >  default_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost,
> > -tree vectype ATTRIBUTE_UNUSED,
> > +tree vectype,
> >  int misalign ATTRIBUTE_UNUSED)
> >  {
> > +  unsigned elements;
> > +
> >switch (type_of_cost)
> >  {
> >case scalar_stmt:
> > @@ -524,6 +526,11 @@ default_builtin_vectorization_cost (enum vect_cost
> >case cond_branch_taken:
> >  return 3;
> >  
> > +  case vec_construct:
> > +   elements = TYPE_VECTOR_SUBPARTS (vectype);
> > +   gcc_assert (elements > 1);
> > +   return elements / 2 + 1;
> > +
> >default:
> >  gcc_unreachable ();
> >  }
> > Index: gcc/target.h
> > ===
> > --- gcc/target.h(revision 188482)
> > +++ gcc/target.h(working copy)
> > @@ -146,7 +146,8 @@ enum vect_cost_for_stmt
> >cond_branch_not_taken,
> >cond_branch_taken,
> >vec_perm,
> > -  vec_promote_demote
> > +  vec_promote_demote,
> > +  vec_construct
> >  };
> >  
> >  /* The target structure.  This holds all the backend hooks.  */
> > Index: gcc/tree-vect-stmts.c
> > ===
> > --- gcc/tree-vect-stmts.c   (revision 188482)
> > +++ gcc/tree-vect-stmts.c   (working copy)
> > @@ -1031,11 +1031,13 @@ vect_model_load_cost (stmt_vec_info stmt_info, int
> >/* The loads themselves.  */
> >if (STMT_VINFO_STRIDE_LOAD_P (stmt_info))
> >  {
> > -  /* N scalar loads plus gathering them into a vector.
>

[PATCH, committed] Fix PR53647

2012-06-13 Thread William J. Schmidt

It turns out we have some old machine descriptions that have no L1
cache, so we must account for a zero line size.  Regstrapped on
powerpc64-linux-unknown-gnu with no new failures, committed as obvious.

Thanks,
Bill


2012-06-13  Bill Schmidt  

PR tree-optimization/53647
* tree-ssa-phiopt.c (gate_hoist_loads): Skip transformation for
targets with no defined cache line size.


Index: gcc/tree-ssa-phiopt.c
===
--- gcc/tree-ssa-phiopt.c   (revision 188482)
+++ gcc/tree-ssa-phiopt.c   (working copy)
@@ -1976,12 +1976,14 @@ hoist_adjacent_loads (basic_block bb0, basic_block
 /* Determine whether we should attempt to hoist adjacent loads out of
diamond patterns in pass_phiopt.  Always hoist loads if
-fhoist-adjacent-loads is specified and the target machine has
-   a conditional move instruction.  */
+   both a conditional move instruction and a defined cache line size.  */
 
 static bool
 gate_hoist_loads (void)
 {
-  return (flag_hoist_adjacent_loads == 1 && HAVE_conditional_move);
+  return (flag_hoist_adjacent_loads == 1
+ && PARAM_VALUE (PARAM_L1_CACHE_LINE_SIZE)
+ && HAVE_conditional_move);
 }
 
 /* Always do these optimizations if we have SSA

[PATCH] Fix PR50183

2011-09-13 Thread William J. Schmidt

Greetings,

The code to build scops (static control parts) for graphite first
rewrites loops into canonical loop-closed SSA form.  PR50183 identifies
a scenario where the results do not fulfill all required invariants of
this form.  In particular, a value defined inside a loop and used
outside that loop must reach exactly one definition, which must be a
single-argument PHI node called a close-phi.  When nested loops exist,
it is possible that, following the rewrite, a definition may reach two
close-phis.  This patch corrects that problem.

The problem arises because loops are processed from outside in.  While
processing a loop, duplicate close-phis are eliminated.  However,
eliminating duplicate close-phis for an inner loop can sometimes create
duplicate close-phis for an already-processed outer loop.  This patch
detects when this may have occurred and repeats the removal of duplicate
close-phis as necessary.

The problem was noted on ibm/4_6-branch and 4_6-branch; it is apparently
latent on trunk.  The same patch can be applied to all three branches.

Bootstrapped and regression-tested on powerpc64-linux.  OK to commit to
these three branches?

Thanks,
Bill


2011-09-13  Bill Schmidt  

* graphite-scop-detection.c (make_close_phi_nodes_unique):  New
forward declaration.
(remove_duplicate_close_phi): Detect and repair creation of
duplicate close-phis for a containing loop.


Index: gcc/graphite-scop-detection.c
===
--- gcc/graphite-scop-detection.c   (revision 178829)
+++ gcc/graphite-scop-detection.c   (working copy)
@@ -30,6 +30,9 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-pass.h"
 #include "sese.h"
 
+/* Forward declarations.  */
+static void make_close_phi_nodes_unique (basic_block);
+
 #ifdef HAVE_cloog
 #include "ppl_c.h"
 #include "graphite-ppl.h"
@@ -1231,6 +1234,13 @@ remove_duplicate_close_phi (gimple phi, gimple_stm
SET_USE (use_p, res);
 
   update_stmt (use_stmt);
+  
+  /* It is possible that we just created a duplicate close-phi
+for an already-processed containing loop.  Check for this
+case and clean it up.  */
+  if (gimple_code (use_stmt) == GIMPLE_PHI
+ && gimple_phi_num_args (use_stmt) == 1)
+   make_close_phi_nodes_unique (gimple_bb (use_stmt));
 }
 
   remove_phi_node (gsi, true);

[PING] Re: [PATCH] Fix PR50183

2011-09-28 Thread William J. Schmidt

Hi there,

Ping.  I'm seeking approval for this fix on trunk and 4_6-branch.
Thanks!

Bill

On Tue, 2011-09-13 at 17:55 -0500, William J. Schmidt wrote:
> Greetings,
> 
> The code to build scops (static control parts) for graphite first
> rewrites loops into canonical loop-closed SSA form.  PR50183 identifies
> a scenario where the results do not fulfill all required invariants of
> this form.  In particular, a value defined inside a loop and used
> outside that loop must reach exactly one definition, which must be a
> single-argument PHI node called a close-phi.  When nested loops exist,
> it is possible that, following the rewrite, a definition may reach two
> close-phis.  This patch corrects that problem.
> 
> The problem arises because loops are processed from outside in.  While
> processing a loop, duplicate close-phis are eliminated.  However,
> eliminating duplicate close-phis for an inner loop can sometimes create
> duplicate close-phis for an already-processed outer loop.  This patch
> detects when this may have occurred and repeats the removal of duplicate
> close-phis as necessary.
> 
> The problem was noted on ibm/4_6-branch and 4_6-branch; it is apparently
> latent on trunk.  The same patch can be applied to all three branches.
> 
> Bootstrapped and regression-tested on powerpc64-linux.  OK to commit to
> these three branches?
> 
> Thanks,
> Bill
> 
> 
> 2011-09-13  Bill Schmidt  
> 
>   * graphite-scop-detection.c (make_close_phi_nodes_unique):  New
>   forward declaration.
>   (remove_duplicate_close_phi): Detect and repair creation of
>   duplicate close-phis for a containing loop.
> 
> 
> Index: gcc/graphite-scop-detection.c
> ===
> --- gcc/graphite-scop-detection.c (revision 178829)
> +++ gcc/graphite-scop-detection.c (working copy)
> @@ -30,6 +30,9 @@ along with GCC; see the file COPYING3.  If not see
>  #include "tree-pass.h"
>  #include "sese.h"
> 
> +/* Forward declarations.  */
> +static void make_close_phi_nodes_unique (basic_block);
> +
>  #ifdef HAVE_cloog
>  #include "ppl_c.h"
>  #include "graphite-ppl.h"
> @@ -1231,6 +1234,13 @@ remove_duplicate_close_phi (gimple phi, gimple_stm
>   SET_USE (use_p, res);
> 
>update_stmt (use_stmt);
> +  
> +  /* It is possible that we just created a duplicate close-phi
> +  for an already-processed containing loop.  Check for this
> +  case and clean it up.  */
> +  if (gimple_code (use_stmt) == GIMPLE_PHI
> +   && gimple_phi_num_args (use_stmt) == 1)
> + make_close_phi_nodes_unique (gimple_bb (use_stmt));
>  }
> 
>remove_phi_node (gsi, true);
> 
>

Re: [PING] Re: [PATCH] Fix PR50183

2011-09-29 Thread William J. Schmidt

On Thu, 2011-09-29 at 10:03 +0100, Tobias Grosser wrote:
> On 09/29/2011 09:58 AM, Richard Guenther wrote:
> > On Thu, Sep 29, 2011 at 12:10 AM, William J. Schmidt
> >   wrote:
> >> Hi there,
> >>
> >> Ping.  I'm seeking approval for this fix on trunk and 4_6-branch.
> >> Thanks!
> >
> > Ok.
> Yes, also looks good from me. Though, you may want to move the forward
> declaration after the "#ifdef HAVE_CLOOG". This makes it clearer that 
> the whole file is not compiled, if cloog is not available.

Good point.  I'll make that change.

Thanks!
Bill

> 
> Cheers
> Tobi

[PATCH] Fix PR46556 (poor address generation)

2011-10-05 Thread William J. Schmidt

This patch addresses the poor code generation in PR46556 for the
following code:

struct x
{
  int a[16];
  int b[16];
  int c[16];
};

extern void foo (int, int, int);

void
f (struct x *p, unsigned int n)
{
  foo (p->a[n], p->c[n], p->b[n]);
}

Prior to the fix for PR32698, gcc calculated the offset for accessing
the array elements as:  n*4; 64+n*4; 128+n*4.

Following that fix, the offsets are calculated as:  n*4; (n+16)*4; (n
+32)*4.  This led to poor code generation on powerpc64 targets, among
others.

The poor code generation was observed to not occur in loops, as the
IVOPTS code does a good job of lowering these expressions to MEM_REFs.
It was previously suggested that perhaps a general pass to lower memory
accesses to MEM_REFs in GIMPLE would solve not only this, but other
similar problems.  I spent some time looking into various approaches to
this, and reviewing some previous attempts to do similar things.  In the
end, I've concluded that this is a bad idea in practice because of the
loss of useful aliasing information.  In particular, early lowering of
component references causes us to lose the ability to disambiguate
non-overlapping references in the same structure, and there is no simple
way to carry the necessary aliasing information along with the
replacement MEM_REFs to avoid this.  While some performance gains are
available with GIMPLE lowering of memory accesses, there are also
offsetting performance losses, and I suspect this would just be a
continuous source of bug reports into the future.

Therefore the current patch is a much simpler approach to solve the
specific problem noted in the PR.  There are two pieces to the patch:

 * The offending addressing pattern is matched in GIMPLE and transformed
into a restructured MEM_REF that distributes the multiply, so that (n
+32)*4 becomes 4*n+128 as before.  This is done during the reassociation
pass, for reasons described below.  The transformation only occurs in
non-loop blocks, since IVOPTS does a good job on such things within
loops.
 * A tweak is added to the RTL forward-propagator to avoid propagating
into memory references based on a single base register with no offset,
under certain circumstances.  This improves sharing of base registers
for accesses within the same structure and slightly lowers register
pressure.

It would be possible to separate these into two patches if that's
preferred.  I chose to combine them because together they provide the
ideal code generation that the new test cases test for.

I initially implemented the pattern matcher during expand, but I found
that the expanded code for two accesses to the same structure was often
not being CSEd well.  So I moved it back into the GIMPLE phases prior to
DOM to take advantage of its CSE.  To avoid an additional complete pass
over the IL, I chose to piggyback on the reassociation pass.  This
transformation is not technically a reassociation, but it is related
enough to not be a complete wart.

One noob question about this:  It would probably be preferable to have
this transformation only take place during the second reassociation
pass, so the ARRAY_REFs are seen by earlier optimization phases.  Is
there an easy way to detect that it's the second pass without having to
generate a separate pass entry point?

One other general question about the pattern-match transformation:  Is
this an appropriate transformation for all targets, or should it be
somehow gated on available addressing modes on the target processor?

Bootstrapped and regression tested on powerpc64-linux-gnu.  Verified no
performance degradations on that target for SPEC CPU2000 and CPU2006.

I'm looking for eventual approval for trunk after any comments are
resolved.  Thanks!

Bill


2011-10-05  Bill Schmidt  

gcc:

PR rtl-optimization/46556
* fwprop.c (fwprop_bb_aux_d): New struct.
(MEM_PLUS_REGS): New macro.
(record_mem_plus_reg): New function.
(record_mem_plus_regs): Likewise.
(single_def_use_enter_block): Record mem-plus-reg patterns.
(build_single_def_use_links): Allocate aux storage.
(locally_poor_mem_replacement): New function.
(forward_propagate_and_simplify): Call
locally_poor_mem_replacement.
(fwprop_init): Free storage.
* tree.h (copy_ref_info): Expose existing function.
* tree-ssa-loop-ivopts.c (copy_ref_info): Remove static token.
* tree-ssa-reassoc.c (restructure_base_and_offset): New function.
(restructure_mem_ref): Likewise.
(reassociate_bb): Look for opportunities to call
restructure_mem_ref; clean up immediate use lists.

gcc/testsuite:

PR rtl-optimization/46556
* gcc.target/powerpc/ppc-pr46556-1.c: New testcase.
* gcc.target/powerpc/ppc-pr46556-2.c: Likewise.
* gcc.target/powerpc/ppc-pr46556-3.c: Likewise.
* gcc.target/powerpc/ppc-pr46556-4.c: Likewise.
* gcc.dg/tree-ssa/pr46556-1.c: Likewise.

Re: [PATCH] Fix PR46556 (poor address generation)

2011-10-05 Thread William J. Schmidt

On Wed, 2011-10-05 at 18:29 +0200, Steven Bosscher wrote:
> On Wed, Oct 5, 2011 at 6:13 PM, William J. Schmidt
>  wrote:
> >* tree-ssa-loop-ivopts.c (copy_ref_info): Remove static token.
> 
> Rather than this, why not move the function to common code somewhere?
> 
> Ciao!
> Steven

An alternative would be to move it into tree-ssa-address.c, where there
is already a simpler version called copy_mem_ref_info.  I'm open to that
if it's preferable.

Bill

Re: [PATCH] Fix PR46556 (poor address generation)

2011-10-05 Thread William J. Schmidt

On Wed, 2011-10-05 at 18:21 +0200, Paolo Bonzini wrote:
> On 10/05/2011 06:13 PM, William J. Schmidt wrote:
> > One other general question about the pattern-match transformation:  Is
> > this an appropriate transformation for all targets, or should it be
> > somehow gated on available addressing modes on the target processor?
> >
> > Bootstrapped and regression tested on powerpc64-linux-gnu.  Verified no
> > performance degradations on that target for SPEC CPU2000 and CPU2006.
> >
> > I'm looking for eventual approval for trunk after any comments are
> > resolved.  Thanks!
> 
> How do the costs look like for the two transforms you mention in the 
> head comment of locally_poor_mem_replacement?
> 
> Paolo
> 

I don't know off the top of my head -- I'll have to gather that
information.  The issue is that the profitability is really
context-sensitive, so just the isolated costs of insns aren't enough.
The forward propagation of the add into (mem (reg REG)) looks like a
slam dunk in the absence of other information, but if there are other
nearby references using nonzero offsets from REG, this just extends the
lifetimes of X and Y without eliminating the need for REG.

1 2 3 >

1 - 100 of 217 matches

Mail list logo