[Patch] Fix PR53397

2012-10-01 Thread venkataramanan.kumar
Hi, 

The below patch fixes the FFT/Scimark regression caused by useless prefetch
generation.

This fix tries to make prefetch less aggressive by prefetching arrays in the
inner loop, when the step is invariant in the entire loop nest.

GCC currently tries to prefetch invariant steps when they are in the inner
loop. But does not check if the step is variant in outer loops.

In the scimark FFT case, the trip count of the inner loop varies by a non
constant step, which is invariant in the inner loop. 
But the step variable is varying in outer loop. This makes
inner loop trip count small (at run time varies sometimes as small as 1
iteration) 

Prefetching ahead x iteration when the inner loop trip count is smaller than x
leads to useless prefetches. 

Flag used: -O3 -march=amdfam10 

Before 
**  **
** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **
** for details. (Results can be submitted to p...@nist.gov) **
**  **
Using   2.00 seconds min time per kenel.
Composite Score:  550.50
FFT Mflops:38.66(N=1024)
SOR Mflops:   617.61(100 x 100)
MonteCarlo: Mflops:   173.74
Sparse matmult  Mflops:   675.63(N=1000, nz=5000)
LU  Mflops:  1246.88(M=100, N=100)


After 
**  **
** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **
** for details. (Results can be submitted to p...@nist.gov) **
**  **
Using   2.00 seconds min time per kenel.
Composite Score:  639.20
FFT Mflops:   479.19(N=1024)
SOR Mflops:   617.61(100 x 100)
MonteCarlo: Mflops:   173.18
Sparse matmult  Mflops:   679.13(N=1000, nz=5000)
LU  Mflops:  1246.88(M=100, N=100)

GCC regression "make check -k" passes with x86_64-unknown-linux-gnu
New tests that PASS:

gcc.dg/pr53397-1.c scan-assembler prefetcht0
gcc.dg/pr53397-1.c scan-tree-dump aprefetch "Issued prefetch"
gcc.dg/pr53397-1.c (test for excess errors)
gcc.dg/pr53397-2.c scan-tree-dump aprefetch "loop variant step"
gcc.dg/pr53397-2.c scan-tree-dump aprefetch "Not prefetching"
gcc.dg/pr53397-2.c (test for excess errors)


Checked CPU2006 and polyhedron on latest AMD processor, no regressions noted.

Ok to commit in trunk?

regards,
Venkat

gcc/ChangeLog
+2012-10-01  Venkataramanan Kumar  
+
+   * tree-ssa-loop-prefetch.c (gather_memory_references_ref):$
+   Perform non constant step prefetching in inner loop, only $
+   when it is invariant in the entire loop nest.  $
+   * testsuite/gcc.dg/pr53397-1.c: New test case $
+   Checks we are prefecthing for loop invariant steps$
+   * testsuite/gcc.dg/pr53397-2.c: New test case$
+   Checks we are not prefecthing for loop variant steps
+


Index: gcc/testsuite/gcc.dg/pr53397-1.c
===
--- gcc/testsuite/gcc.dg/pr53397-1.c(revision 0)
+++ gcc/testsuite/gcc.dg/pr53397-1.c(revision 0)
@@ -0,0 +1,28 @@
+/* Prefetching when the step is loop invariant.  */
+
+/* { dg-do compile } */
+/* { dg-options "-O3 -fprefetch-loop-arrays -fdump-tree-aprefetch-details 
--param min-insn-to-prefetch-ratio=3 --param simultaneous-prefetches=10 
-fdump-tree-aprefetch-details" } */
+
+
+double data[16384];
+void prefetch_when_non_constant_step_is_invariant(int step, int n)
+{
+ int a;
+ int b;
+ for (a = 1; a < step; a++) {
+for (b = 0; b < n; b += 2 * step) {
+
+  int i = 2*(b + a);
+  int j = 2*(b + a + step);
+
+
+  data[j]   = data[i];
+  data[j+1] = data[i+1];
+}
+ }
+}
+
+/* { dg-final { scan-tree-dump "Issued prefetch" "aprefetch" } } */
+/* { dg-final { scan-assembler "prefetcht0" } } */
+
+/* { dg-final { cleanup-tree-dump "aprefetch" } } */
Index: gcc/testsuite/gcc.dg/pr53397-2.c
===
--- gcc/testsuite/gcc.dg/pr53397-2.c(revision 0)
+++ gcc/testsuite/gcc.dg/pr53397-2.c(revision 0)
@@ -0,0 +1,29 @@
+/* Not prefetching when the step is loop variant.  */
+
+/* { dg-do compile } */
+/* { dg-options "-O3 -fprefetch-loop-arrays -fdump-tree-aprefetch-details 
--param min-insn-to-prefetch-ratio=3 --param simultaneous-prefetches=10 
-fdump-tree-aprefetch-details" } */
+
+
+double data[16384];
+void donot_prefetch_when_non_constant_step_is_variant(int step, int n)
+{ 
+ int a;
+ int b;
+ for (a = 1; a < step; a++,step*=2) {
+for (b = 0; b < n; b += 2 * step) {
+
+  int i = 2*(b + a);
+  int j = 2*(b + a + step);
+
+
+  data[j]   = data[i];
+  data[j+1] = data[i+1];
+}
+ } 
+}
+
+/* { dg-final { scan-tree-dump "Not prefetching" "aprefetch" } } */
+/* { dg-final { scan-tree-dump "loop variant step" "aprefetch" }

[Gcc.amd] [Patch 001] Document bdver1/btver1 in invoke.texi

2011-11-11 Thread venkataramanan.kumar
> Subject: Re: [Gcc.amd] [Patch 001] [x86 backend] Define march/mtune for
> upcoming AMD Bulldozer procesor.
> 
> > Hello!
> >
> > > This patch defines -march=bdver1 and -mtune=bdver1 flag for the upcoming
> > > AMD Bulldozer processor.
> Hi,
> it seems that bdver/btver is not mentioned in invoke.texi nor changes.html.
> Could you please add documentation?
> 
> Honza

Hi Honza,  

I have added documentation for bdver1/bdver1 in invoke.texi.

is Ok to commit?

Index: gcc/doc/invoke.texi
===
--- gcc/doc/invoke.texi (revision 181283)
+++ gcc/doc/invoke.texi (working copy)
@@ -12803,6 +12803,15 @@
 AMD Family 10h core based CPUs with x86-64 instruction set support.  (This
 supersets MMX, SSE, SSE2, SSE3, SSE4A, 3DNow!, enhanced 3DNow!, ABM and 64-bit
 instruction set extensions.)
+@item bdver1
+AMD Family 15h core based CPUs with x86-64 instruction set support.  (This
+supersets FMA4, AVX, XOP, LWP, AES, PCL_MUL, CX16, MMX, SSE, SSE2, SSE3, SSE4A,
+SSSE3, SSE4.1, SSE4.2, 3DNow!, enhanced 3DNow!, ABM and 64-bit
+instruction set extensions.)
+@item btver1
+AMD Family 14h core based CPUs with x86-64 instruction set support.  (This
+supersets MMX, SSE, SSE2, SSE3, SSSE3, SSE4A, CX16, ABM and 64-bit
+instruction set extensions.)
 @item winchip-c6
 IDT Winchip C6 CPU, dealt in same way as i486 with additional MMX instruction
 set support.




[Gcc.amd] [Patch 002] Document bdver1 in changes.html for GCC4.6

2011-11-11 Thread venkataramanan.kumar
> Subject: Re: [Gcc.amd] [Patch 001] [x86 backend] Define march/mtune for
> upcoming AMD Bulldozer procesor.
> 
> > Hello!
> >
> > > This patch defines -march=bdver1 and -mtune=bdver1 flag for the upcoming
> > > AMD Bulldozer processor.
> Hi,
> it seems that bdver/btver is not mentioned in invoke.texi nor changes.html.
> Could you please add documentation?
> 
> Honza

Hi Honza,  

Added bdver1 information to changes.html for GCC4.6

is Ok to commit?

Index: changes.html
===
RCS file: /cvs/gcc/wwwdocs/htdocs/gcc-4.6/changes.html,v
retrieving revision 1.136
diff -u -r1.136 changes.html
--- changes.html30 Oct 2011 12:55:43 -  1.136
+++ changes.html11 Nov 2011 12:26:03 -
@@ -813,6 +813,9 @@
 Support for AMD Bobcat (family 14) processors is now available through
the -march=btver1 and -mtune=btver1
options.
+Support for AMD Bulldozer (family 15) processors is now available
+   through the -march=bdver1 and -mtune=bdver1
+   options.
 The default setting (when not optimizing for size) for 32-bit
   GNU/Linux and Darwin x86 targets has been changed to
   -fomit-frame-pointer.  The default can be reverted




[PATCH,i386] Enable prefetchw in processor alias table for AMD targets

2012-09-11 Thread venkataramanan.kumar
Hi Maintainers,

This patch enables "prefetchw" ISA in the processor alias table for targets 
amdfam10,barcelona and bdver1,2 and btver1,2.

GCC regression test passes with the patch.

Ok for trunk?

Change log:

2012-09-11  Venkataramanan Kumar  

* config/i386/i386.c (processor_alias_table): Enable PTA_PRFCHW
for targets amdfam10, barcelona, bdver1, bdver2, btver1 and btver2.

Index: gcc/config/i386/i386.c
===
--- gcc/config/i386/i386.c  (revision 190345)
+++ gcc/config/i386/i386.c  (working copy)
@@ -3151,31 +3151,33 @@
| PTA_SSE2 | PTA_NO_SAHF},
   {"amdfam10", PROCESSOR_AMDFAM10, CPU_AMDFAM10,
PTA_64BIT | PTA_MMX | PTA_3DNOW | PTA_3DNOW_A | PTA_SSE
-   | PTA_SSE2 | PTA_SSE3 | PTA_SSE4A | PTA_CX16 | PTA_ABM},
+   | PTA_SSE2 | PTA_SSE3 | PTA_SSE4A | PTA_CX16 | PTA_ABM 
+   | PTA_PRFCHW},
   {"barcelona", PROCESSOR_AMDFAM10, CPU_AMDFAM10,
PTA_64BIT | PTA_MMX | PTA_3DNOW | PTA_3DNOW_A | PTA_SSE
-   | PTA_SSE2 | PTA_SSE3 | PTA_SSE4A | PTA_CX16 | PTA_ABM},
+   | PTA_SSE2 | PTA_SSE3 | PTA_SSE4A | PTA_CX16 | PTA_ABM
+   | PTA_PRFCHW},
   {"bdver1", PROCESSOR_BDVER1, CPU_BDVER1,
PTA_64BIT | PTA_MMX | PTA_SSE | PTA_SSE2 | PTA_SSE3
| PTA_SSE4A | PTA_CX16 | PTA_ABM | PTA_SSSE3 | PTA_SSE4_1
| PTA_SSE4_2 | PTA_AES | PTA_PCLMUL | PTA_AVX | PTA_FMA4
-   | PTA_XOP | PTA_LWP},
+   | PTA_XOP | PTA_LWP | PTA_PRFCHW},
   {"bdver2", PROCESSOR_BDVER2, CPU_BDVER2,
PTA_64BIT | PTA_MMX | PTA_SSE | PTA_SSE2 | PTA_SSE3
| PTA_SSE4A | PTA_CX16 | PTA_ABM | PTA_SSSE3 | PTA_SSE4_1
| PTA_SSE4_2 | PTA_AES | PTA_PCLMUL | PTA_AVX
| PTA_XOP | PTA_LWP | PTA_BMI | PTA_TBM | PTA_F16C
-   | PTA_FMA},
+   | PTA_FMA | PTA_PRFCHW},
   {"btver1", PROCESSOR_BTVER1, CPU_GENERIC64,
 PTA_64BIT | PTA_MMX |  PTA_SSE  | PTA_SSE2 | PTA_SSE3
-| PTA_SSSE3 | PTA_SSE4A |PTA_ABM | PTA_CX16},
+| PTA_SSSE3 | PTA_SSE4A |PTA_ABM | PTA_CX16 | PTA_PRFCHW},
   {"generic32", PROCESSOR_GENERIC32, CPU_PENTIUMPRO,
PTA_HLE /* flags are only used for -march switch.  */ },
   {"btver2", PROCESSOR_BTVER2, CPU_GENERIC64,
PTA_64BIT | PTA_MMX |  PTA_SSE  | PTA_SSE2 | PTA_SSE3
| PTA_SSSE3 | PTA_SSE4A |PTA_ABM | PTA_CX16 | PTA_SSE4_1
| PTA_SSE4_2 | PTA_AES | PTA_PCLMUL | PTA_AVX
-   | PTA_BMI | PTA_F16C | PTA_MOVBE},
+   | PTA_BMI | PTA_F16C | PTA_MOVBE | PTA_PRFCHW},
   {"generic64", PROCESSOR_GENERIC64, CPU_GENERIC64,
PTA_64BIT
 | PTA_HLE /* flags are only used for -march switch.  */ },



[PATCH, i386]: AMD btver2 enablement

2012-07-20 Thread venkataramanan.kumar
Hi Maintainers,

Below patch does the basic enablement for next generation AMD low power btver2 
core.
It defines -march=btver2 and -mtune=btver2, and lets -march=native correctly
recognizes btver2. At the moment the tuning is mostly a copy of btver1.
The patch passed bootstrap and the x86 tests.

Is it OK to commit to trunk?

Also can I modify doc/invoke.texi now?

regards,
Venkat.

Index: gcc/ChangeLog
===
--- gcc/ChangeLog   (revision 189510)
+++ gcc/ChangeLog   (working copy)
@@ -1,3 +1,28 @@
+2012-7-18  Venkataramanan Kumar  
+
+   Jaguar Enablement
+   * config.gcc (i[34567]86-*-linux* | ...): Add btver2.
+   (case ${target}): Add btver2.
+   * config/i386/driver-i386.c (host_detect_local_cpu): Let
+   -march=native recognize btver2 processors.
+   * config/i386/i386-c.c (ix86_target_macros_internal): Add
+   btver2 def_and_undef
+   * config/i386/i386.c (struct processor_costs btver2_cost): New
+   btver2 cost table.
+   (m_BTVER2): New definition.
+   (m_AMD_MULTIPLE): Includes m_BTVER2.
+   (initial_ix86_tune_features): Add btver2 tune.
+   (processor_target_table): Add btver2 entry.
+   (static const char *const cpu_names): Add btver2 entry.
+   (software_prefetching_beneficial_p): Add btver2.
+   (ix86_option_override_internal): Add btver2 instruction sets.
+   (ix86_issue_rate): Add btver2.
+   (ix86_adjust_cost): Add btver2.
+   * config/i386/i386.h (TARGET_BTVER2): New definition.
+   (enum target_cpu_default): Add TARGET_CPU_DEFAULT_btver2.
+   (enum processor_type): Add PROCESSOR_BTVER2.
+   * config/i386/i386.md (define_attr "cpu"): Add btver2.
+
 2012-07-16  Hans-Peter Nilsson  
 
* config/cris/cris-protos.h (cris_legitimate_address_p): Declare.
Index: gcc/config.gcc
===
--- gcc/config.gcc  (revision 189510)
+++ gcc/config.gcc  (working copy)
@@ -1214,7 +1214,7 @@
TM_MULTILIB_CONFIG=`echo $TM_MULTILIB_CONFIG | sed 
's/^,//'`
need_64bit_isa=yes
case X"${with_cpu}" in
-   
Xgeneric|Xatom|Xcore2|Xcorei7|Xcorei7-avx|Xnocona|Xx86-64|Xbdver2|Xbdver1|Xbtver1|Xamdfam10|Xbarcelona|Xk8|Xopteron|Xathlon64|Xathlon-fx|Xathlon64-sse3|Xk8-sse3|Xopteron-sse3)
+   
Xgeneric|Xatom|Xcore2|Xcorei7|Xcorei7-avx|Xnocona|Xx86-64|Xbdver2|Xbdver1|Xbtver2|Xbtver1|Xamdfam10|Xbarcelona|Xk8|Xopteron|Xathlon64|Xathlon-fx|Xathlon64-sse3|Xk8-sse3|Xopteron-sse3)
;;
X)
if test x$with_cpu_64 = x; then
@@ -1223,7 +1223,7 @@
;;
*)
echo "Unsupported CPU used in 
--with-cpu=$with_cpu, supported values:" 1>&2
-   echo "generic atom core2 corei7 corei7-avx 
nocona x86-64 bdver2 bdver1 btver1 amdfam10 barcelona k8 opteron athlon64 
athlon-fx athlon64-sse3 k8-sse3 opteron-sse3" 1>&2
+   echo "generic atom core2 corei7 corei7-avx 
nocona x86-64 bdver2 bdver1 btver2 btver1 amdfam10 barcelona k8 opteron 
athlon64 athlon-fx athlon64-sse3 k8-sse3 opteron-sse3" 1>&2
exit 1
;;
esac
@@ -1335,7 +1335,7 @@
tmake_file="$tmake_file i386/t-sol2-64"
need_64bit_isa=yes
case X"${with_cpu}" in
-   
Xgeneric|Xatom|Xcore2|Xcorei7|Xcorei7-avx|Xnocona|Xx86-64|Xbdver2|Xbdver1|Xbtver1|Xamdfam10|Xbarcelona|Xk8|Xopteron|Xathlon64|Xathlon-fx|Xathlon64-sse3|Xk8-sse3|Xopteron-sse3)
+   
Xgeneric|Xatom|Xcore2|Xcorei7|Xcorei7-avx|Xnocona|Xx86-64|Xbdver2|Xbdver1|Xbtver2|Xbtver1|Xamdfam10|Xbarcelona|Xk8|Xopteron|Xathlon64|Xathlon-fx|Xathlon64-sse3|Xk8-sse3|Xopteron-sse3)
;;
X)
if test x$with_cpu_64 = x; then
@@ -1344,7 +1344,7 @@
;;
*)
echo "Unsupported CPU used in --with-cpu=$with_cpu, 
supported values:" 1>&2
-   echo "generic atom core2 corei7 corei7-avx nocona 
x86-64 bdver2 bdver1 btver1 amdfam10 barcelona k8 opteron athlon64 athlon-fx 
athlon64-sse3 k8-sse3 opteron-sse3" 1>&2
+   echo "generic atom core2 corei7 corei7-avx nocona 
x86-64 bdver2 bdver1 btver2 btver1 amdfam10 barcelona k8 opteron athlon64 
athlon-fx athlon64-sse3 k8-sse3 opteron-sse3" 1>&2
exit 1
;;
esac
@@ -1401,7 +1401,7 @@
if test x$enable_targets = xall; then
tm_defines="${tm_defines} TARGET_BI_ARCH=1"
case X"${with_cpu}" in
-

[PATCH, i386]: Back port Fix PR 52908 - xop-mul-1:f9 miscompiled on bulldozer (-mxop) to 4.7

2012-06-07 Thread venkataramanan.kumar
Hi Maintainers,

Please find the patch below that backports PR target/52908 to GCC 4.7.

The patch passed bootstrap and regression test.

Ok to commit?

regards,
Venkat.


Index: ChangeLog
===
--- ChangeLog   (revision 187449)
+++ ChangeLog   (working copy)
@@ -1,3 +1,17 @@
+2012-06-07  Venkataramanan Kumar 
+
+   Backport from  2012-05-09 mainline r187354
+
+   PR target/52908
+   * config/i386/sse.md (vec_widen_smult_hi_v4si): Expand using
+   xop_pmacsdqh insn pattern instead of xop_mulv2div2di3_high.
+   (vec_widen_smult_lo_v4si): Expand using xop_pmacsdql insn pattern
+   instead of xop_mulv2div2di3_low.
+   (xop_pdql): Fix vec_select selector.
+   (xop_pdqh): Ditto.
+   (xop_mulv2div2di3_low): Remove insn_and_split pattern.
+   (xop_mulv2div2di3_high): Ditto.
+
 2012-05-13  Uros Bizjak  
 
Backport from mainline
Index: testsuite/gcc.target/i386/xop-imul32widen-vector.c
===
--- testsuite/gcc.target/i386/xop-imul32widen-vector.c  (revision 187449)
+++ testsuite/gcc.target/i386/xop-imul32widen-vector.c  (working copy)
@@ -32,5 +32,5 @@
   exit (0);
 }
 
-/* { dg-final { scan-assembler "vpmacsdql" } } */
+/* { dg-final { scan-assembler "vpmuldq" } } */
 /* { dg-final { scan-assembler "vpmacsdqh" } } */
Index: testsuite/ChangeLog
===
--- testsuite/ChangeLog (revision 187449)
+++ testsuite/ChangeLog (working copy)
@@ -1,3 +1,11 @@
+2012-06-07  Venkataramanan Kumar  
+
+   Back port from 2012-05-09 mainline r187354
+
+   PR target/52908
+   * gcc.target/i386/xop-imul32widen-vector.c: Update scan-assembler
+   directive to Scan for vpmuldq, not vpmacsdql.
+
 2012-05-12  Eric Botcazou  
 
* gnat.dg/null_pointer_deref3.adb: New test.
Index: config/i386/sse.md
===
--- config/i386/sse.md  (revision 187449)
+++ config/i386/sse.md  (working copy)
@@ -5743,11 +5743,15 @@
 
   if (TARGET_XOP)
 {
+  rtx t3 = gen_reg_rtx (V2DImode);
+
   emit_insn (gen_sse2_pshufd_1 (t1, op1, GEN_INT (0), GEN_INT (2),
GEN_INT (1), GEN_INT (3)));
   emit_insn (gen_sse2_pshufd_1 (t2, op2, GEN_INT (0), GEN_INT (2),
GEN_INT (1), GEN_INT (3)));
-  emit_insn (gen_xop_mulv2div2di3_high (operands[0], t1, t2));
+  emit_move_insn (t3, CONST0_RTX (V2DImode));
+
+  emit_insn (gen_xop_pmacsdqh (operands[0], t1, t2, t3));
   DONE;
 }
 
@@ -5772,11 +5776,15 @@
 
   if (TARGET_XOP)
 {
+  rtx t3 = gen_reg_rtx (V2DImode);
+
   emit_insn (gen_sse2_pshufd_1 (t1, op1, GEN_INT (0), GEN_INT (2),
GEN_INT (1), GEN_INT (3)));
   emit_insn (gen_sse2_pshufd_1 (t2, op2, GEN_INT (0), GEN_INT (2),
GEN_INT (1), GEN_INT (3)));
-  emit_insn (gen_xop_mulv2div2di3_low (operands[0], t1, t2));
+  emit_move_insn (t3, CONST0_RTX (V2DImode));
+
+  emit_insn (gen_xop_pmacsdql (operands[0], t1, t2, t3));
   DONE;
 }
 
@@ -10443,12 +10451,12 @@
  (sign_extend:V2DI
   (vec_select:V2SI
(match_operand:V4SI 1 "nonimmediate_operand" "%x")
-   (parallel [(const_int 1)
-  (const_int 3)])))
- (vec_select:V2SI
+(parallel [(const_int 0)
+   (const_int 2)])))
+  (vec_select:V2SI
   (match_operand:V4SI 2 "nonimmediate_operand" "xm")
-  (parallel [(const_int 1)
- (const_int 3)])))
+  (parallel [(const_int 0)
+ (const_int 2)])))
 (match_operand:V2DI 3 "nonimmediate_operand" "x")))]
   "TARGET_XOP"
   "vpmacssdql\t{%3, %2, %1, %0|%0, %1, %2, %3}"
@@ -10462,13 +10470,13 @@
  (sign_extend:V2DI
   (vec_select:V2SI
(match_operand:V4SI 1 "nonimmediate_operand" "%x")
-   (parallel [(const_int 0)
-  (const_int 2)])))
+   (parallel [(const_int 1)
+  (const_int 3)])))
  (sign_extend:V2DI
   (vec_select:V2SI
(match_operand:V4SI 2 "nonimmediate_operand" "xm")
-   (parallel [(const_int 0)
-  (const_int 2)]
+   (parallel [(const_int 1)
+  (const_int 3)]
 (match_operand:V2DI 3 "nonimmediate_operand" "x")))]
   "TARGET_XOP"
   "vpmacssdqh\t{%3, %2, %1, %0|%0, %1, %2, %3}"
@@ -10482,61 +10490,19 @@
  (sign_extend:V2DI
   (vec_select:V2SI
(match_operand:V4SI 1 "nonimmediate_operand" "%x")
-   (parallel [(const_int 1)
-  (const_int 3)])))
+   (parallel [(const_int 0)
+  (const_int 2)])))
  (sign_extend:V2DI
   (vec_select:V2S