Re: Partition and subpartition Analysis that helps in further vectorization and parallelization
On Tue, Jul 14, 2015 at 1:33 PM, Ajit Kumar Agarwal wrote: > All: > > I am trying the place the following Analysis in the vectorizer of GCC that > helps in improving the vectorizer to a great extent > For the unit stride, zero stride and non stride accesses of memory that helps > in vectorizer. > > For the Data Dependency graph, the topological sort is performed. The > topological sorted Data Dependence graph the time > Stamp for each node of the DDG is assigned based on the following Algorithm. > > For each node in Topological sorted order in DDG > { > > Timestamp = 0; > Timestamp(node) = Max(Timestamp, Timestamp of all predecessors) + 1; > > } > > Based on the above calculation of timestamp, the partition of DDG is formed. > Each partition of DDG is having the nodes with the same > Stamp. So nodes in each partition can be vectorized as they are independent > nodes in the DDG. To enable the vectorization, the accesses > based on contiguous access and non-Contagious access the sub partition is > formed. The memory address of all the operands of each node > in the partition formed above is sorted in increasing/decreasing order. Based > on the sorted increasing/decreasing order of the memory > address of each operands of each node in the partition the sub partition is > performed based on the unit stride access, zero stride access > and the accesses that require shuffling of operands through the vectorized > instruction. > > The above analysis will help in performing Data Layout on the partitioned > nodes of the DDG and based on Sub partition formed above and > more vectorization opportunities is enabled for performing data Layout on non > contiguous accesses and the sub partition With the contiguous > access helps in vectorization. > > Thoughts? As mentioned for the other DDG related idea we already have loop distribution. It is not enabled by default because it lacks a good cost model. It's current cost-model looks after data locality/reuse only. Adding a cost modeling targeting vectorization sounds interesting. Richard. > Thanks & Regards > Ajit
GCC 4.9.3 build on i386-apple-darwin9.8.0
Mac OS X 10.5 with XCode 3.1.4, bootstrapped via gcc-4.7.4. config.guess: 386-apple-darwin9.8.0 gcc -v: Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/i386-apple-darwin9.8.0/4.9.3/lto-wrapper Target: i386-apple-darwin9.8.0 Configured with: ./configure --localstatedir=/usr/local/var --prefix=/usr/local --sysconfdir=/usr/local/etc --enable-languages=c,c++ Thread model: posix gcc version 4.9.3 (GCC) Native configuration is i386-apple-darwin9.8.0 === g++ tests === Schedule of variations: unix Running target unix FAIL: g++.dg/debug/dwarf2/imported-decl-2.C -std=gnu++98 scan-assembler-times ascii "0".*ascii "0".*DIE .0x[0-9a-f]*. DW_TAG_imported_declaration 1 FAIL: g++.dg/debug/dwarf2/imported-decl-2.C -std=gnu++11 scan-assembler-times ascii "0".*ascii "0".*DIE .0x[0-9a-f]*. DW_TAG_imported_declaration 1 FAIL: g++.dg/debug/dwarf2/imported-decl-2.C -std=gnu++1y scan-assembler-times ascii "0".*ascii "0".*DIE .0x[0-9a-f]*. DW_TAG_imported_declaration 1 FAIL: g++.dg/ipa/pr61160-3.C -std=gnu++98 execution test FAIL: g++.dg/ipa/pr61160-3.C -std=gnu++11 execution test FAIL: g++.dg/ipa/pr61160-3.C -std=gnu++1y execution test FAIL: g++.dg/opt/vt4.C -std=gnu++98 scan-assembler-not _ZTV.A FAIL: g++.dg/opt/vt4.C -std=gnu++11 scan-assembler-not _ZTV.A FAIL: g++.dg/opt/vt4.C -std=gnu++1y scan-assembler-not _ZTV.A FAIL: g++.dg/lto/pr65549 cp_lto_pr65549_0.o assemble, -std=gnu++14 -flto -g UNRESOLVED: g++.dg/lto/pr65549 cp_lto_pr65549_0.o-cp_lto_pr65549_0.o link -std=gnu++14 -flto -g UNRESOLVED: g++.dg/lto/pr65549 cp_lto_pr65549_0.o-cp_lto_pr65549_0.o execute -std=gnu++14 -flto -g FAIL: g++.dg/lto/pr65549 cp_lto_pr65549_0.o assemble, -std=gnu++14 -flto -g -O2 -fno-inline -flto-partition=max UNRESOLVED: g++.dg/lto/pr65549 cp_lto_pr65549_0.o-cp_lto_pr65549_0.o link -std=gnu++14 -flto -g -O2 -fno-inline -flto-partition=max UNRESOLVED: g++.dg/lto/pr65549 cp_lto_pr65549_0.o-cp_lto_pr65549_0.o execute -std=gnu++14 -flto -g -O2 -fno-inline -flto-partition=max === g++ Summary === # of expected passes82428 # of unexpected failures11 # of expected failures 442 # of unresolved testcases 4 # of unsupported tests 2868
Re: Reduction Pattern ( Vectorization or Parallelization)
On Sun, Jul 5, 2015 at 1:57 PM, Ajit Kumar Agarwal wrote: > All: > > The scalar and array reduction patterns can be identified if the result of > commutative updates > Is applied to the same scalar or array variables on the LHS with +, *, Min or > Max. Thus the reduction pattern identified with > the commutative update help in vectorization or parallelization. > > For the following code > For(j = 0; j <= N;j++) > { >y = d[j]; > For( I = 0 ; I <8 ; i++) > X(a[i]) = X(a[i]) + c[i] * y; > } > > Fig(1). > > For the above code with the reduction pattern on X with respect to the outer > loop exhibits the commutative updates on + can be identified > In gcc as reduction pattern with respect to outer loops. I wondering whether > this can be identified as reduction pattern which can reduce to vectorized > Code because of the X is indexed by another array as thus the access of X is > not affine expression. > > Does the above code can be identified as reduction pattern and transform to > the vectorized or parallelize code. > > Thoughts? I think the issue here is dependences of X(A[i]) as A[i] might be the same for different i. Richard. > Thanks & Regards > Ajit > >
GCC 5.3 Status Report (2015-07-16)
Status == The GCC 5 branch is open again for regression and documentation fixes. If nothing unusual happens you can expect GCC 5.3 in October, roughly when people should think of stage1 ending soon ;) Quality Data Priority # Change from last report --- --- P10- 2 P2 91+ 20 P3 28- 16 P4 85+ 3 P5 34+ 2 --- --- Total P1-P3 119+ 2 Total 238+ 7 Previous Report === https://gcc.gnu.org/ml/gcc/2015-07/msg00071.html
GCC 5.2 Released
The GNU Compiler Collection version 5.2 has been released. GCC 5.2 is a bug-fix release from the GCC 5 branch containing important fixes for regressions and serious bugs in GCC 5.1 with more than 81 bugs fixed since the previous release. This release is available from the FTP servers listed at: http://www.gnu.org/order/ftp.html Please do not contact me directly regarding questions or comments about this release. Instead, use the resources available from http://gcc.gnu.org. As always, a vast number of people contributed to this GCC release -- far too many to thank them individually!
Re: Reduction Pattern ( Vectorization or Parallelization)
On 07/16/2015 12:53 PM, Richard Biener wrote: On Sun, Jul 5, 2015 at 1:57 PM, Ajit Kumar Agarwal For the following code For(j = 0; j <= N;j++) { y = d[j]; For( I = 0 ; I <8 ; i++) X(a[i]) = X(a[i]) + c[i] * y; } Fig(1). I think the issue here is dependences of X(A[i]) as A[i] might be the same for different i. In Fortran this is not allowed on the left-hand side of an assignment. Does C have any restrictions here ? -- Toon Moene - e-mail: t...@moene.org - phone: +31 346 214290 Saturnushof 14, 3738 XG Maartensdijk, The Netherlands At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/ Progress of GNU Fortran: http://gcc.gnu.org/wiki/GFortran#news
Re: Reduction Pattern ( Vectorization or Parallelization)
On July 16, 2015 8:03:03 PM GMT+02:00, Toon Moene wrote: >On 07/16/2015 12:53 PM, Richard Biener wrote: > >> On Sun, Jul 5, 2015 at 1:57 PM, Ajit Kumar Agarwal > >>> For the following code >>> For(j = 0; j <= N;j++) >>> { >>> y = d[j]; >>> For( I = 0 ; I <8 ; i++) >>> X(a[i]) = X(a[i]) + c[i] * y; >>> } >>> >>> Fig(1). > >> I think the issue here is dependences of X(A[i]) as A[i] might be the >same >> for different i. > >In Fortran this is not allowed on the left-hand side of an assignment. Only if x(a(I)) = ... Is an array expression, right? C doesn't have those. Richard. >Does C have any restrictions here ?
How to express this complicated constraint in md file
Hi, I have not been able to find an answer in the ".md" files i've read through, or the docs, so I wanted to try here. If we have an instruction on a machine that does "ADD x, y", where register x is both input and output, in the ".md" file one can express this easily (referencing it as "0" instead of "r"). But what if sizes differ? Consider a hypothetical instruction on a 32-bit machine: WUMUL x, y which will multiply 32-bit register x by 32-bit register y, and produce a 64-bit result, storing the high bits into register x and low bits into register y Despite all the searching, I still do not quite know how to express this in the ".md" file Are there more docs I need to read? Is there an obvious solution I am missing? Thanks, Dmitry
Re: How to express this complicated constraint in md file
On 07/16/2015 01:32 PM, Dmitry Grinberg wrote: > WUMUL x, y which will multiply 32-bit register x by 32-bit > register y, and produce a 64-bit result, storing the high bits into > register x and low bits into register y You can rewrite the RTL to make this easier. You can use a parallel to do for instance [(set (reg:SI x) (truncate:SI (lshiftrt:DI (mult:DI (sign_extend:DI (reg:SI x)) (sign_extend:DI (reg:SI y))) (const_int 32))) (set (reg:SI y) (truncate:SI (mult:DI (sign_extend:DI (reg:SI x)) (sign_extend:DI (reg:SI y)] Now you have only 32-bit regs, and you can use matching constraints to make it work. The truncate lshiftrt is the traditional way to write a mulX_highpart pattern. Some parts of the optimizer may recognize this construct and know how to handle it. For the second set, you might consider just using (mult:SI ...) if that gives the correct result, or you can use a subreg or whatever. The optimizer is unlikely to generate this pattern on its own, but you can have an expander and/or splitter that generates it. Use zero_extend instead of sign_extend if this is an unsigned widening multiply. You probably want to generate two SImode temporaries in the expander, and copy the input regs into the temporaries, as expanders aren't supposed to clobber input regs. If you want a 64-bit result out of this, then you would need extra instructions to combine x and y into a 64-bit output. Another way to do this is to arbtrarily force the result into a register pair, then you can use a subreg to match the high part or the low part of that register pair for the inputs. [(set (reg:DI x) (mult:DI (sign_extend:DI (subreg:SI (reg:DI x) 0)) (sign_extend:DI (subreg:SI (reg:DI x) 1] The subreg numbers may be reversed if this is little word endian instead of big word endian. You might need extra setup instructions to create the register pair first. Create a DI temp for the output, move the inputs into the high/low word of the DI temp, and then you can do the multiply on the DI tmep. Jim
[gomp] Reuse futex barrier implementation for RTEMS
Hello, the libgomp configuration for RTEMS uses currently the POSIX implementation. Unfortunately the performance is unacceptable bad, so I work currently on a specialized RTEMS configuration. I would like to reuse the code of the Linux futex barrier. On RTEMS there is no kernel/user space separation. In order to make the futex management simpler, I would like to optionally embed a futex object in the barrier. Would a change like this be acceptable? diff --git a/libgomp/config/linux/bar.h b/libgomp/config/linux/bar.h index 3236436..b47b9f6 100644 --- a/libgomp/config/linux/bar.h +++ b/libgomp/config/linux/bar.h @@ -40,6 +40,9 @@ typedef struct unsigned generation; unsigned awaited __attribute__((aligned (64))); unsigned awaited_final; +#ifdef HAVE_BARRIER_FUTEX + gomp_barrier_futex_t futex; +#endif } gomp_barrier_t; typedef unsigned int gomp_barrier_state_t; diff --git a/libgomp/config/linux/futex.h b/libgomp/config/linux/futex.h index c99ea37..7aab595 100644 --- a/libgomp/config/linux/futex.h +++ b/libgomp/config/linux/futex.h @@ -68,3 +68,7 @@ cpu_relax (void) { __asm volatile ("" : : : "memory"); } + +#define barrier_futex_wait(addr, val, futex) futex_wait (addr, val) + +#define barrier_futex_wake(addr, count, futex) futex_wake (addr, count) diff --git a/libgomp/config/linux/bar.c b/libgomp/config/linux/bar.c index 51fbd99..d776796 100644 --- a/libgomp/config/linux/bar.c +++ b/libgomp/config/linux/bar.c @@ -40,7 +40,7 @@ gomp_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t state) bar->awaited = bar->total; __atomic_store_n (&bar->generation, bar->generation + BAR_INCR, MEMMODEL_RELEASE); - futex_wake ((int *) &bar->generation, INT_MAX); + barrier_futex_wake ((int *) &bar->generation, INT_MAX, &bar->futex); } else { @@ -74,7 +74,8 @@ gomp_barrier_wait_last (gomp_barrier_t *bar) void gomp_team_barrier_wake (gomp_barrier_t *bar, int count) { - futex_wake ((int *) &bar->generation, count == 0 ? INT_MAX : count); + barrier_futex_wake ((int *) &bar->generation, count == 0 ? INT_MAX : count, + &bar->futex); } void @@ -100,7 +101,7 @@ gomp_team_barrier_wait_end (gomp_barrier_t *bar, gomp_barrier_state_t state) state &= ~BAR_CANCELLED; state += BAR_INCR - BAR_WAS_LAST; __atomic_store_n (&bar->generation, state, MEMMODEL_RELEASE); - futex_wake ((int *) &bar->generation, INT_MAX); + barrier_futex_wake ((int *) &bar->generation, INT_MAX, &bar->futex); return; } } @@ -163,7 +164,7 @@ gomp_team_barrier_wait_cancel_end (gomp_barrier_t *bar, { state += BAR_INCR - BAR_WAS_LAST; __atomic_store_n (&bar->generation, state, MEMMODEL_RELEASE); - futex_wake ((int *) &bar->generation, INT_MAX); + barrier_futex_wake ((int *) &bar->generation, INT_MAX, &bar->futex); return false; } } @@ -199,13 +200,14 @@ gomp_team_barrier_wait_cancel (gomp_barrier_t *bar) void gomp_team_barrier_cancel (struct gomp_team *team) { + gomp_barrier_t *bar = &team->barrier; gomp_mutex_lock (&team->task_lock); - if (team->barrier.generation & BAR_CANCELLED) + if (bar->generation & BAR_CANCELLED) { gomp_mutex_unlock (&team->task_lock); return; } - team->barrier.generation |= BAR_CANCELLED; + bar->generation |= BAR_CANCELLED; gomp_mutex_unlock (&team->task_lock); - futex_wake ((int *) &team->barrier.generation, INT_MAX); + barrier_futex_wake ((int *) &bar->generation, INT_MAX, &bar->futex); } -- Sebastian Huber, embedded brains GmbH Address : Dornierstr. 4, D-82178 Puchheim, Germany Phone : +49 89 189 47 41-16 Fax : +49 89 189 47 41-09 E-Mail : sebastian.hu...@embedded-brains.de PGP : Public key available on request. Diese Nachricht ist keine geschäftliche Mitteilung im Sinne des EHUG.