On Mon, 16 Nov 2015, Tom de Vries wrote: > On 11/11/15 12:02, Richard Biener wrote: > > On Mon, 9 Nov 2015, Tom de Vries wrote: > > > > > On 09/11/15 16:35, Tom de Vries wrote: > > > > Hi, > > > > > > > > this patch series for stage1 trunk adds support to: > > > > - parallelize oacc kernels regions using parloops, and > > > > - map the loops onto the oacc gang dimension. > > > > > > > > The patch series contains these patches: > > > > > > > > 1 Insert new exit block only when needed in > > > > transform_to_exit_first_loop_alt > > > > 2 Make create_parallel_loop return void > > > > 3 Ignore reduction clause on kernels directive > > > > 4 Implement -foffload-alias > > > > 5 Add in_oacc_kernels_region in struct loop > > > > 6 Add pass_oacc_kernels > > > > 7 Add pass_dominator_oacc_kernels > > > > 8 Add pass_ch_oacc_kernels > > > > 9 Add pass_parallelize_loops_oacc_kernels > > > > 10 Add pass_oacc_kernels pass group in passes.def > > > > 11 Update testcases after adding kernels pass group > > > > 12 Handle acc loop directive > > > > 13 Add c-c++-common/goacc/kernels-*.c > > > > 14 Add gfortran.dg/goacc/kernels-*.f95 > > > > 15 Add libgomp.oacc-c-c++-common/kernels-*.c > > > > 16 Add libgomp.oacc-fortran/kernels-*.f95 > > > > > > > > The first 9 patches are more or less independent, but patches 10-16 are > > > > intended to be committed at the same time. > > > > > > > > Bootstrapped and reg-tested on x86_64. > > > > > > > > Build and reg-tested with nvidia accelerator, in combination with a > > > > patch that enables accelerator testing (which is submitted at > > > > https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ). > > > > > > > > I'll post the individual patches in reply to this message. > > > > > > > > > > This patch adds the pass_oacc_kernels pass group to the pass list in > > > passes.def. > > > > > > Note the repetition of pass_lim/pass_copy_prop. The first pair is for an > > > inner > > > loop in a loop nest, the second for an outer loop in a loop nest. > > > > @@ -86,6 +86,27 @@ along with GCC; see the file COPYING3. If not see > > /* pass_build_ealias is a dummy pass that ensures that we > > execute TODO_rebuild_alias at this point. */ > > NEXT_PASS (pass_build_ealias); > > + /* Pass group that runs when there are oacc kernels in the > > + function. */ > > + NEXT_PASS (pass_oacc_kernels); > > + PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels) > > + NEXT_PASS (pass_dominator_oacc_kernels); > > + NEXT_PASS (pass_ch_oacc_kernels); > > + NEXT_PASS (pass_dominator_oacc_kernels); > > + NEXT_PASS (pass_tree_loop_init); > > + NEXT_PASS (pass_lim); > > + NEXT_PASS (pass_copy_prop); > > + NEXT_PASS (pass_lim); > > + NEXT_PASS (pass_copy_prop); > > > > iterate lim/copyprop twice?! Why's that needed? > > > > I've managed to eliminate the last pass_copy_prop, but not pass_lim. I've > added a comment: > ... > /* We use pass_lim to rewrite in-memory iteration and reduction > variable accesses in loops into local variables accesses. > However, a single pass instantion manages to do this only for > one loop level, so we use pass_lim twice to at least be able to > handle a loop nest with a depth of two. */ > NEXT_PASS (pass_lim); > NEXT_PASS (pass_copy_prop); > NEXT_PASS (pass_lim); > ...
Huh. Testcase? LIM is perfectly able to handle nests. > > + NEXT_PASS (pass_scev_cprop); > > > > What's that for? It's supposed to help removing loops - I don't > > expect kernels to vanish. > > I'm using pass_scev_cprop for the "final value replacement" functionality. > Added comment. That functionality is intented to enable loop removal. > > > > + NEXT_PASS (pass_tree_loop_done); > > + NEXT_PASS (pass_dominator_oacc_kernels); > > > > Three times DOM? No please. I wonder why you don't run oacc_kernels > > after FRE and drop the initial DOM(s). > > > > Done. There's just one pass_dominator_oacc_kernels left now. > > > + NEXT_PASS (pass_dce); > > + NEXT_PASS (pass_tree_loop_init); > > + NEXT_PASS (pass_parallelize_loops_oacc_kernels); > > + NEXT_PASS (pass_expand_omp_ssa); > > + NEXT_PASS (pass_tree_loop_done); > > > > The switches into/outof tree_loop also look odd to me, but well > > (they'll be controlled by -ftree-loop-optimize)). > > > > I've eliminated all the uses for pass_tree_loop_init/pass_tree_loop_done in > the pass group. Instead, I've added conditional loop optimizer setup in: > - pass_lim and pass_scev_cprop (added in this patch), and > - pass_parallelize_loops_oacc_kernels (added in patch "Add > pass_parallelize_loops_oacc_kernels"). You miss calling scev_finalize (). Much better otherwise. I still wonder about scev_cprop and LIM two times. Richard.