Re: [gomp4] openacc kernels directive support

Tom de Vries Tue, 16 Sep 2014 08:34:49 -0700

On 09-09-14 12:56, Richard Biener wrote:

On Tue, 9 Sep 2014, Tom de Vries wrote:

On 18-08-14 14:16, Tom de Vries wrote:

On 06-08-14 17:10, Tom de Vries wrote:

We could insert a pass-group here that only deals with functions that have
the
kernels directive, and do the auto-par thing in a pass_oacc_kernels (which
should share the majority of the infrastructure with the parloops pass):
...
            NEXT_PASS (pass_build_ealias);
            INSERT_PASSES_AFTER/WITHIN (passes_oacc_kernels)
               NEXT_PASS (pass_ch);
               NEXT_PASS (pass_ccp);
               NEXT_PASS (pass_lim_aux);
               NEXT_PASS (pass_oacc_par);
            POP_INSERT_PASSES ()
...

Any comments, ideas or suggestions ?


I've experimented with implementing this on top of gomp-4_0-branch, and I
ran
into PR46032.

PR46032 is about vectorization failure on a function split off by omp
parallelization. The vectorization fails due to aliasing constraints in the
split off function, which are not present in the original code.


Heh.  At least the omp-low.c parts from comment #1 should be pushed
to trunk...


Hi Richard,

Right, but the intra_create_variable_infos part does not apply cleanly, and Idon't know yet how to resolve that.

In the gomp-4_0-branch, the code marked by the openacc kernels directive is
split off during omp_expand. The generated code has the same additional
aliasing
constraints, and in pass_oacc_par the parallelization fails.

The PR46032 contains a tentative patch by Richard Biener, which applies
cleanly
on top of 4.6 (I haven't yet reached a level of understanding of
tree-ssa-structalias.c to be able to resolve the conflict in
intra_create_variable_infos when applying on 4.7). The tentative patch
involves
running ipa-pta, which is also a pass run after the point where we write out
the
lto stream. I'm not sure whether it makes sense to run the pta-ipa pass as
part
of the pass_oacc_kernels pass list.


No, that's not even possible I think.


OK, thanks for confirming that.

I see three ways of continuing from here:
- take the tentative patch and make it work, including running pta-ipa
during
    passes_oacc_kernels
- same, but try somehow to manage without running pta-ipa.
- try to postpone splitting of the function until the end of pass_oacc_par.


I don't understand the last option?  What is the actual issue you run
into?  You split oacc kernels off and _then_ run "autopar" on the
split-off function (and get additional kernels)?


Let me try to reiterate the problem in more detail.

We're trying to implement the auto-parallelization part of the oacc kernelsdirective using the existing parloops pass. The source starting point is thegomp-4_0-branch. The gomp-4_0-branch has a dummy implementation of the oacckernels directive, analogous to the oacc parallel directive.

So the current gomp-4_0-branch does the following steps for oaccparallel/kernels directives:

1. pass_lower_omp/scan_omp:
   - create record type with rewrite vars (.omp_data_t).
   - declare function with arg with type pointer to .omp_data_t.
2. pass_lower_omp/lower_omp:
   - rewrite region in terms of rewrite vars
   - add omp_return at end
3. pass_expand_omp:
   - split off the region into a separate function
   - replace region with call to GOACC_parallel/GOACC_kernels, with function
     pointer as argument

I wrote an example with a single oacc kernels region containing a simple vectoraddition loop, and tried to make auto-parallelization work.

The first problem I ran into was that the parloops pass failed to analyze thedependencies in an vector addition example, due to the fact that the region wasalready split off into a separate function, similar to PR46032.

I looked briefly into the patches set in PR46032, but I realized that even if Ifix it, the next problem I run into will be that the parloops pass is run afterthe lto stream read/write point. So any changes the parloops pass makes at thatpoint are in the accelerator compile flow, in other words we're talking aboutlaunching an accelerator kernel from the accelerator. While that is possiblewith recent cuda accelerators, I guess in general we should not expect that tobe possible.[ I also thought of a fancy scheme where we don't split off a new function, butmanipulate the body of the already split off function, and emit a c file fromthe accelerator compiler containing the parameters that the host compiler shoulduse to launch the accelerator kernel... but I guess that would be a last resort. ]

So in order to solve the lto stream read/write point problem, I moved theparloops pass (well, a copy called pass_oacc_par or similar) up in the passlist, to before lto stream read/write point. That precludes solving the aliasproblem with the PR46032 patch set, since we need ipa for that.

I solved (well, rather prevented) the alias problem by disabling pass_omp_expandfor GIMPLE_OACC_KERNELS, in other words disabling the function-split-off inpass_omp_expand and letting pass_oacc_par take care of that (This is what Imeant with: 'postpone splitting of the function until the end of pass_oacc_par').Doing so required me to write a patch to handle omp-lowered code conservativelyin cpp and forwprop, otherwise the 'rewrite region in terms of rewrite vars'would be undone by the time we arrive at pass_oacc_par.

Some advice on how to continue from here would be *highly* appreciated. My
hunch
atm is to investigate the last option.


Jakub,
Richard,

I've investigated the last option, and published the current state in git-only
branch vries/oacc-kernels (
https://gcc.gnu.org/git/?p=gcc.git;a=shortlog;h=refs/heads/vries/oacc-kernels
).

The current state at commit 9255cadc5b6f8f7f4e4506e65a6be7fb3c00cd35 is that:
- a simple loop marked with the oacc kernels directive is analyzed for
    parallelization,
- the loop is then rewritten using oacc parallel and oacc loop directives
- these oacc directives are expanded using omp_expand_local
- this results in the loop being split off into a separate function, while
    the loop is replaced with a GOACC_parallel call
- all this is done before writing out the lto stream
- no support yet for reductions, nested loops, more than one loop nest in
   kernels region

At toplevel, the added pass list looks like this:
...
           NEXT_PASS (pass_build_ealias);
           /* Pass group that runs when there are oacc kernels in the
              function.  */


Not sure why pass_oacc_kernels runs before all the other local
cleanups?  I would have put it after pass_cd_dce at least.

My focus was on running pass_oacc_kernels ASAP, in order not to have to adaptmore passes to leave the omp-lowered code alone. I'll give your suggestion a try.

           NEXT_PASS (pass_oacc_kernels);
           PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
               NEXT_PASS (pass_ch_oacc_kernels);
               NEXT_PASS (pass_tree_loop_init);
               NEXT_PASS (pass_lim);
               NEXT_PASS (pass_ccp);
               NEXT_PASS (pass_parallelize_loops_oacc_kernels);
               NEXT_PASS (pass_tree_loop_done);
           POP_INSERT_PASSES ()
  ...

The main question I'm currently facing is the following: when to do lowering
(in other words, rewriting of variable access in terms of .omp_data) of the
kernels region. There are basically 2 passes that contain code to do this:
- pass_lower_omp (on pre-ssa code)
- pass_parallelize_loops (on ssa code)


Both use the same utilities.

I think you mean that both passes use the same utilities to do omp-expand (inother words, pass_parallelize_loops uses omp_expand_local).But AFAIU, the omp-lowering in pass_parallelize_loops (in particular, therewrite of the region in terms of rewrite vars) shares no code with the omp pass.

Atm I'm using pass_loswer_omp, and I've added a patch that handles omp-lowered
code conservatively in ccp and forwprop in order for the lowering to remain
until arriving at pass_parallelize_loops_oacc_kernels.


You mean omp-_un_-lowered code?

No, I mean pass_omp_lower lowers the code into omp-lowered code, and the patchin question prevents cpp and forwprop from undoing the lowering before arrivingat the point where we split off the function.

But it might turn out to be easier/necessary to handle this in
pass_parallelize_loops_oacc_kernels instead.


I'd do it similar to how autopar does it

OK, I'll try then to do the lowering for the kernels region inpass_parallelize_loops_oacc_kernels, not in pass_omp_lower.


FWIW, I'm looking now into reductions, and started thinking in the same 
direction.

(not that autopar is a great
example for a GCC pass these days...).

For my understanding, could you briefly elaborate on that (or give a referenceto an earlier discussion)?


Thanks,
- Tom

Richard.

Any advice on this issue, and on the current implementation is welcome.

Thanks,
- Tom

Re: [gomp4] openacc kernels directive support

Reply via email to