On 24/11/15 13:24, Tom de Vries wrote:
On 16/11/15 12:59, Tom de Vries wrote:
On 09/11/15 20:52, Tom de Vries wrote:
On 09/11/15 16:35, Tom de Vries wrote:
Hi,
this patch series for stage1 trunk adds support to:
- parallelize oacc kernels regions using parloops, and
- map the loops onto the oacc gang dimension.
The patch series contains these patches:
1 Insert new exit block only when needed in
transform_to_exit_first_loop_alt
2 Make create_parallel_loop return void
3 Ignore reduction clause on kernels directive
4 Implement -foffload-alias
5 Add in_oacc_kernels_region in struct loop
6 Add pass_oacc_kernels
7 Add pass_dominator_oacc_kernels
8 Add pass_ch_oacc_kernels
9 Add pass_parallelize_loops_oacc_kernels
10 Add pass_oacc_kernels pass group in passes.def
11 Update testcases after adding kernels pass group
12 Handle acc loop directive
13 Add c-c++-common/goacc/kernels-*.c
14 Add gfortran.dg/goacc/kernels-*.f95
15 Add libgomp.oacc-c-c++-common/kernels-*.c
16 Add libgomp.oacc-fortran/kernels-*.f95
The first 9 patches are more or less independent, but patches 10-16 are
intended to be committed at the same time.
Bootstrapped and reg-tested on x86_64.
Build and reg-tested with nvidia accelerator, in combination with a
patch that enables accelerator testing (which is submitted at
https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
I'll post the individual patches in reply to this message.
This patch adds pass_parallelize_loops_oacc_kernels.
There's a number of things we do differently in parloops for oacc
kernels:
- in normal parloops, we generate code to choose between a parallel
version of the loop, and a sequential (low iteration count) version.
Since the code in oacc kernels region is supposed to run on the
accelerator anyway, we skip this check, and don't add a low iteration
count loop.
- in normal parloops, we generate an #pragma omp parallel /
GIMPLE_OMP_RETURN pair to delimit the region which will we split off
into a thread function. Since the oacc kernels region is already
split off, we don't add this pair.
- we indicate the parallelization factor by setting the oacc function
attributes
- we generate an #pragma oacc loop instead of an #pragma omp for, and
we add the gang clause
- in normal parloops, we rewrite the variable accesses in the loop in
terms into accesses relative to a thread function parameter. For the
oacc kernels region, that rewrite has already been done at omp-lower,
so we skip this.
- we need to ensure that the entire kernels region can be run in
parallel. The loop independence check is already present, so for oacc
kernels we add a check between blocks outside the loop and the entire
region.
- we guard stores in the blocks outside the loop with gang_pos == 0.
There's no need for each gang to write to a single location, we can
do this in just one gang. (Typically this is the write of the final
value of the iteration variable if that one is copied back to the
host).
Reposting with loop optimizer init added in
pass_parallelize_loops_oacc_kernels::execute.
Reposting with loop_optimizer_finalize,scev_initialize and scev_finalize
added in pass_parallelize_loops_oacc_kernels::execute.
Ping.
Anything I can do to facilitate the review?
Thanks,
Tom