Re: Partition and subpartition Analysis that helps in further vectorization and parallelization

2015-07-16 Thread Richard Biener
On Tue, Jul 14, 2015 at 1:33 PM, Ajit Kumar Agarwal
 wrote:
> All:
>
> I am trying the place the following Analysis in the vectorizer of GCC that 
> helps in improving the vectorizer to a great extent
> For the unit stride, zero stride and non stride accesses of memory that helps 
> in vectorizer.
>
> For the Data Dependency graph, the topological sort is performed. The 
> topological sorted Data Dependence graph the time
> Stamp for each node of the DDG is assigned based on the following Algorithm.
>
> For each node in Topological sorted order in DDG
> {
>
> Timestamp = 0;
> Timestamp(node) = Max(Timestamp, Timestamp of all predecessors) + 1;
>
> }
>
> Based on the above calculation of timestamp, the partition of DDG is formed. 
> Each partition of DDG is having the nodes with the same
> Stamp. So nodes in each partition can be vectorized as they are independent 
> nodes in the DDG. To enable the vectorization, the accesses
>  based on contiguous access and non-Contagious access the sub partition is 
> formed. The memory address of all the operands of each node
> in the partition formed above is sorted in increasing/decreasing order. Based 
> on the sorted increasing/decreasing order of the memory
> address of each operands of each node in the partition the sub partition is 
> performed based on the unit stride access, zero stride access
> and the accesses that require shuffling of operands through the vectorized 
> instruction.
>
> The above analysis will help in performing Data Layout on the partitioned 
> nodes of the DDG and  based on Sub partition formed above and
> more vectorization opportunities is enabled for performing data Layout on non 
> contiguous accesses and  the sub partition With the contiguous
> access helps in vectorization.
>
> Thoughts?

As mentioned for the other DDG related idea we already have loop distribution.
It is not enabled by default because it lacks a good cost model.  It's current
cost-model looks after data locality/reuse only.  Adding a cost
modeling targeting
vectorization sounds interesting.

Richard.

> Thanks & Regards
> Ajit


GCC 4.9.3 build on i386-apple-darwin9.8.0

2015-07-16 Thread Andras Salamon

Mac OS X 10.5 with XCode 3.1.4, bootstrapped via gcc-4.7.4.

config.guess:
386-apple-darwin9.8.0

gcc -v:
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/i386-apple-darwin9.8.0/4.9.3/lto-wrapper
Target: i386-apple-darwin9.8.0
Configured with: ./configure --localstatedir=/usr/local/var --prefix=/usr/local 
--sysconfdir=/usr/local/etc --enable-languages=c,c++
Thread model: posix
gcc version 4.9.3 (GCC) 


Native configuration is i386-apple-darwin9.8.0

=== g++ tests ===

Schedule of variations:
   unix

Running target unix
FAIL: g++.dg/debug/dwarf2/imported-decl-2.C -std=gnu++98  scan-assembler-times ascii 
"0".*ascii "0".*DIE .0x[0-9a-f]*. DW_TAG_imported_declaration 1
FAIL: g++.dg/debug/dwarf2/imported-decl-2.C -std=gnu++11  scan-assembler-times ascii 
"0".*ascii "0".*DIE .0x[0-9a-f]*. DW_TAG_imported_declaration 1
FAIL: g++.dg/debug/dwarf2/imported-decl-2.C -std=gnu++1y  scan-assembler-times ascii 
"0".*ascii "0".*DIE .0x[0-9a-f]*. DW_TAG_imported_declaration 1
FAIL: g++.dg/ipa/pr61160-3.C -std=gnu++98 execution test
FAIL: g++.dg/ipa/pr61160-3.C -std=gnu++11 execution test
FAIL: g++.dg/ipa/pr61160-3.C -std=gnu++1y execution test
FAIL: g++.dg/opt/vt4.C -std=gnu++98  scan-assembler-not _ZTV.A
FAIL: g++.dg/opt/vt4.C -std=gnu++11  scan-assembler-not _ZTV.A
FAIL: g++.dg/opt/vt4.C -std=gnu++1y  scan-assembler-not _ZTV.A
FAIL: g++.dg/lto/pr65549 cp_lto_pr65549_0.o assemble,  -std=gnu++14 -flto -g 
UNRESOLVED: g++.dg/lto/pr65549 cp_lto_pr65549_0.o-cp_lto_pr65549_0.o link  -std=gnu++14 -flto -g 
UNRESOLVED: g++.dg/lto/pr65549 cp_lto_pr65549_0.o-cp_lto_pr65549_0.o execute  -std=gnu++14 -flto -g 
FAIL: g++.dg/lto/pr65549 cp_lto_pr65549_0.o assemble,  -std=gnu++14 -flto -g -O2 -fno-inline -flto-partition=max 
UNRESOLVED: g++.dg/lto/pr65549 cp_lto_pr65549_0.o-cp_lto_pr65549_0.o link  -std=gnu++14 -flto -g -O2 -fno-inline -flto-partition=max 
UNRESOLVED: g++.dg/lto/pr65549 cp_lto_pr65549_0.o-cp_lto_pr65549_0.o execute  -std=gnu++14 -flto -g -O2 -fno-inline -flto-partition=max 


=== g++ Summary ===

# of expected passes82428
# of unexpected failures11
# of expected failures  442
# of unresolved testcases   4
# of unsupported tests  2868


Re: Reduction Pattern ( Vectorization or Parallelization)

2015-07-16 Thread Richard Biener
On Sun, Jul 5, 2015 at 1:57 PM, Ajit Kumar Agarwal
 wrote:
> All:
>
> The scalar and array reduction patterns can be identified if the result of 
> commutative updates
> Is applied to the same scalar or array variables on the LHS with +, *, Min or 
> Max. Thus the reduction pattern identified with
> the commutative update help  in vectorization or parallelization.
>
> For the following code
> For(j = 0; j <= N;j++)
> {
>y = d[j];
> For( I = 0 ; I  <8 ; i++)
> X(a[i]) = X(a[i]) + c[i] * y;
> }
>
> Fig(1).
>
> For the above code with the reduction pattern on X with respect to the outer 
> loop  exhibits the commutative updates on + can be identified
> In gcc as reduction pattern with respect to outer loops. I wondering whether 
> this can be identified as reduction pattern which can reduce to vectorized
> Code because of the X is indexed by another array as thus the access of X is 
> not affine expression.
>
> Does the above code can be identified as reduction pattern and transform to 
> the vectorized or parallelize code.
>
> Thoughts?

I think the issue here is dependences of X(A[i]) as A[i] might be the same
for different i.

Richard.

> Thanks & Regards
> Ajit
>
>


GCC 5.3 Status Report (2015-07-16)

2015-07-16 Thread Richard Biener

Status
==

The GCC 5 branch is open again for regression and documentation fixes.
If nothing unusual happens you can expect GCC 5.3 in October, roughly
when people should think of stage1 ending soon ;)


Quality Data


Priority  #   Change from last report
---   ---
P10-   2
P2   91+  20
P3   28-  16
P4   85+   3
P5   34+   2
---   ---
Total P1-P3 119+   2
Total   238+   7


Previous Report
===

https://gcc.gnu.org/ml/gcc/2015-07/msg00071.html



GCC 5.2 Released

2015-07-16 Thread Richard Biener

The GNU Compiler Collection version 5.2 has been released.

GCC 5.2 is a bug-fix release from the GCC 5 branch
containing important fixes for regressions and serious bugs in
GCC 5.1 with more than 81 bugs fixed since the previous release.
This release is available from the FTP servers listed at:

  http://www.gnu.org/order/ftp.html

Please do not contact me directly regarding questions or comments
about this release.  Instead, use the resources available from
http://gcc.gnu.org.

As always, a vast number of people contributed to this GCC release
-- far too many to thank them individually!


Re: Reduction Pattern ( Vectorization or Parallelization)

2015-07-16 Thread Toon Moene

On 07/16/2015 12:53 PM, Richard Biener wrote:


On Sun, Jul 5, 2015 at 1:57 PM, Ajit Kumar Agarwal



For the following code
For(j = 0; j <= N;j++)
{
y = d[j];
 For( I = 0 ; I  <8 ; i++)
 X(a[i]) = X(a[i]) + c[i] * y;
}

Fig(1).



I think the issue here is dependences of X(A[i]) as A[i] might be the same
for different i.


In Fortran this is not allowed on the left-hand side of an assignment. 
Does C have any restrictions here ?


--
Toon Moene - e-mail: t...@moene.org - phone: +31 346 214290
Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands
At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/
Progress of GNU Fortran: http://gcc.gnu.org/wiki/GFortran#news


Re: Reduction Pattern ( Vectorization or Parallelization)

2015-07-16 Thread Richard Biener
On July 16, 2015 8:03:03 PM GMT+02:00, Toon Moene  wrote:
>On 07/16/2015 12:53 PM, Richard Biener wrote:
>
>> On Sun, Jul 5, 2015 at 1:57 PM, Ajit Kumar Agarwal
>
>>> For the following code
>>> For(j = 0; j <= N;j++)
>>> {
>>> y = d[j];
>>>  For( I = 0 ; I  <8 ; i++)
>>>  X(a[i]) = X(a[i]) + c[i] * y;
>>> }
>>>
>>> Fig(1).
>
>> I think the issue here is dependences of X(A[i]) as A[i] might be the
>same
>> for different i.
>
>In Fortran this is not allowed on the left-hand side of an assignment. 

Only if x(a(I)) = ... Is an array expression, right?  C doesn't have those.

Richard.

>Does C have any restrictions here ?




How to express this complicated constraint in md file

2015-07-16 Thread Dmitry Grinberg
Hi,

I have not been able to find an answer in the ".md" files i've read
through, or the docs, so I wanted to try here.

If we have an instruction on a machine that does "ADD x, y", where
register x is both input and output, in the ".md" file one can express
this easily (referencing it as "0" instead of "r"). But what if sizes
differ? Consider a hypothetical instruction on a 32-bit machine:

WUMUL x, y which will multiply 32-bit register x by 32-bit
register y, and produce a 64-bit result, storing the high bits into
register x and low bits into register y

Despite all the searching, I still do not quite know how to express
this in the ".md" file
Are there more docs I need to read? Is there an obvious solution I am missing?


Thanks,
Dmitry


Re: How to express this complicated constraint in md file

2015-07-16 Thread Jim Wilson
On 07/16/2015 01:32 PM, Dmitry Grinberg wrote:
> WUMUL x, y which will multiply 32-bit register x by 32-bit
> register y, and produce a 64-bit result, storing the high bits into
> register x and low bits into register y

You can rewrite the RTL to make this easier.  You can use a parallel to
do for instance

[(set (reg:SI x) (truncate:SI (lshiftrt:DI (mult:DI (sign_extend:DI
(reg:SI x)) (sign_extend:DI (reg:SI y))) (const_int 32)))
 (set (reg:SI y) (truncate:SI (mult:DI (sign_extend:DI (reg:SI x))
(sign_extend:DI (reg:SI y)]

Now you have only 32-bit regs, and you can use matching constraints to
make it work.  The truncate lshiftrt is the traditional way to write a
mulX_highpart pattern.  Some parts of the optimizer may recognize this
construct and know how to handle it.  For the second set, you might
consider just using (mult:SI ...) if that gives the correct result, or
you can use a subreg or whatever.  The optimizer is unlikely to generate
this pattern on its own, but you can have an expander and/or splitter
that generates it.  Use zero_extend instead of sign_extend if this is an
unsigned widening multiply.  You probably want to generate two SImode
temporaries in the expander, and copy the input regs into the
temporaries, as expanders aren't supposed to clobber input regs.  If you
want a 64-bit result out of this, then you would need extra instructions
to combine x and y into a 64-bit output.

Another way to do this is to arbtrarily force the result into a register
pair, then you can use a subreg to match the high part or the low part
of that register pair for the inputs.
[(set (reg:DI x) (mult:DI (sign_extend:DI (subreg:SI (reg:DI x) 0))
(sign_extend:DI (subreg:SI (reg:DI x) 1]
The subreg numbers may be reversed if this is little word endian instead
of big word endian.  You might need extra setup instructions to create
the register pair first.  Create a DI temp for the output, move the
inputs into the high/low word of the DI temp, and then you can do the
multiply on the DI tmep.

Jim



[gomp] Reuse futex barrier implementation for RTEMS

2015-07-16 Thread Sebastian Huber

Hello,

the libgomp configuration for RTEMS uses currently the POSIX 
implementation. Unfortunately the performance is unacceptable bad, so I 
work currently on a specialized RTEMS configuration. I would like to 
reuse the code of the Linux futex barrier. On RTEMS there is no 
kernel/user space separation. In order to make the futex management 
simpler, I would like to optionally embed a futex object in the barrier. 
Would a change like this be acceptable?


diff --git a/libgomp/config/linux/bar.h b/libgomp/config/linux/bar.h
index 3236436..b47b9f6 100644
--- a/libgomp/config/linux/bar.h
+++ b/libgomp/config/linux/bar.h
@@ -40,6 +40,9 @@ typedef struct
   unsigned generation;
   unsigned awaited __attribute__((aligned (64)));
   unsigned awaited_final;
+#ifdef HAVE_BARRIER_FUTEX
+  gomp_barrier_futex_t futex;
+#endif
 } gomp_barrier_t;

 typedef unsigned int gomp_barrier_state_t;
diff --git a/libgomp/config/linux/futex.h b/libgomp/config/linux/futex.h
index c99ea37..7aab595 100644
--- a/libgomp/config/linux/futex.h
+++ b/libgomp/config/linux/futex.h
@@ -68,3 +68,7 @@ cpu_relax (void)
 {
   __asm volatile ("" : : : "memory");
 }
+
+#define barrier_futex_wait(addr, val, futex) futex_wait (addr, val)
+
+#define barrier_futex_wake(addr, count, futex) futex_wake (addr, count)
diff --git a/libgomp/config/linux/bar.c b/libgomp/config/linux/bar.c
index 51fbd99..d776796 100644
--- a/libgomp/config/linux/bar.c
+++ b/libgomp/config/linux/bar.c
@@ -40,7 +40,7 @@ gomp_barrier_wait_end (gomp_barrier_t *bar, 
gomp_barrier_state_t state)

   bar->awaited = bar->total;
   __atomic_store_n (&bar->generation, bar->generation + BAR_INCR,
 MEMMODEL_RELEASE);
-  futex_wake ((int *) &bar->generation, INT_MAX);
+  barrier_futex_wake ((int *) &bar->generation, INT_MAX, &bar->futex);
 }
   else
 {
@@ -74,7 +74,8 @@ gomp_barrier_wait_last (gomp_barrier_t *bar)
 void
 gomp_team_barrier_wake (gomp_barrier_t *bar, int count)
 {
-  futex_wake ((int *) &bar->generation, count == 0 ? INT_MAX : count);
+  barrier_futex_wake ((int *) &bar->generation, count == 0 ? INT_MAX : 
count,

+ &bar->futex);
 }

 void
@@ -100,7 +101,7 @@ gomp_team_barrier_wait_end (gomp_barrier_t *bar, 
gomp_barrier_state_t state)

   state &= ~BAR_CANCELLED;
   state += BAR_INCR - BAR_WAS_LAST;
   __atomic_store_n (&bar->generation, state, MEMMODEL_RELEASE);
-  futex_wake ((int *) &bar->generation, INT_MAX);
+  barrier_futex_wake ((int *) &bar->generation, INT_MAX, &bar->futex);
   return;
 }
 }
@@ -163,7 +164,7 @@ gomp_team_barrier_wait_cancel_end (gomp_barrier_t *bar,
 {
   state += BAR_INCR - BAR_WAS_LAST;
   __atomic_store_n (&bar->generation, state, MEMMODEL_RELEASE);
-  futex_wake ((int *) &bar->generation, INT_MAX);
+  barrier_futex_wake ((int *) &bar->generation, INT_MAX, &bar->futex);
   return false;
 }
 }
@@ -199,13 +200,14 @@ gomp_team_barrier_wait_cancel (gomp_barrier_t *bar)
 void
 gomp_team_barrier_cancel (struct gomp_team *team)
 {
+  gomp_barrier_t *bar = &team->barrier;
   gomp_mutex_lock (&team->task_lock);
-  if (team->barrier.generation & BAR_CANCELLED)
+  if (bar->generation & BAR_CANCELLED)
 {
   gomp_mutex_unlock (&team->task_lock);
   return;
 }
-  team->barrier.generation |= BAR_CANCELLED;
+  bar->generation |= BAR_CANCELLED;
   gomp_mutex_unlock (&team->task_lock);
-  futex_wake ((int *) &team->barrier.generation, INT_MAX);
+  barrier_futex_wake ((int *) &bar->generation, INT_MAX, &bar->futex);
 }

--
Sebastian Huber, embedded brains GmbH

Address : Dornierstr. 4, D-82178 Puchheim, Germany
Phone   : +49 89 189 47 41-16
Fax : +49 89 189 47 41-09
E-Mail  : sebastian.hu...@embedded-brains.de
PGP : Public key available on request.

Diese Nachricht ist keine geschäftliche Mitteilung im Sinne des EHUG.