On 09-09-14 12:56, Richard Biener wrote:
On Tue, 9 Sep 2014, Tom de Vries wrote:
On 18-08-14 14:16, Tom de Vries wrote:
On 06-08-14 17:10, Tom de Vries wrote:
We could insert a pass-group here that only deals with functions that have
the
kernels directive, and do the auto-par thing in a pass_oacc_kernels (which
should share the majority of the infrastructure with the parloops pass):
...
NEXT_PASS (pass_build_ealias);
INSERT_PASSES_AFTER/WITHIN (passes_oacc_kernels)
NEXT_PASS (pass_ch);
NEXT_PASS (pass_ccp);
NEXT_PASS (pass_lim_aux);
NEXT_PASS (pass_oacc_par);
POP_INSERT_PASSES ()
...
Any comments, ideas or suggestions ?
I've experimented with implementing this on top of gomp-4_0-branch, and I
ran
into PR46032.
PR46032 is about vectorization failure on a function split off by omp
parallelization. The vectorization fails due to aliasing constraints in the
split off function, which are not present in the original code.
Heh. At least the omp-low.c parts from comment #1 should be pushed
to trunk...
Hi Richard,
Right, but the intra_create_variable_infos part does not apply cleanly, and I
don't know yet how to resolve that.
In the gomp-4_0-branch, the code marked by the openacc kernels directive is
split off during omp_expand. The generated code has the same additional
aliasing
constraints, and in pass_oacc_par the parallelization fails.
The PR46032 contains a tentative patch by Richard Biener, which applies
cleanly
on top of 4.6 (I haven't yet reached a level of understanding of
tree-ssa-structalias.c to be able to resolve the conflict in
intra_create_variable_infos when applying on 4.7). The tentative patch
involves
running ipa-pta, which is also a pass run after the point where we write out
the
lto stream. I'm not sure whether it makes sense to run the pta-ipa pass as
part
of the pass_oacc_kernels pass list.
No, that's not even possible I think.
OK, thanks for confirming that.
I see three ways of continuing from here:
- take the tentative patch and make it work, including running pta-ipa
during
passes_oacc_kernels
- same, but try somehow to manage without running pta-ipa.
- try to postpone splitting of the function until the end of pass_oacc_par.
I don't understand the last option? What is the actual issue you run
into? You split oacc kernels off and _then_ run "autopar" on the
split-off function (and get additional kernels)?
Let me try to reiterate the problem in more detail.
We're trying to implement the auto-parallelization part of the oacc kernels
directive using the existing parloops pass. The source starting point is the
gomp-4_0-branch. The gomp-4_0-branch has a dummy implementation of the oacc
kernels directive, analogous to the oacc parallel directive.
So the current gomp-4_0-branch does the following steps for oacc
parallel/kernels directives:
1. pass_lower_omp/scan_omp:
- create record type with rewrite vars (.omp_data_t).
- declare function with arg with type pointer to .omp_data_t.
2. pass_lower_omp/lower_omp:
- rewrite region in terms of rewrite vars
- add omp_return at end
3. pass_expand_omp:
- split off the region into a separate function
- replace region with call to GOACC_parallel/GOACC_kernels, with function
pointer as argument
I wrote an example with a single oacc kernels region containing a simple vector
addition loop, and tried to make auto-parallelization work.
The first problem I ran into was that the parloops pass failed to analyze the
dependencies in an vector addition example, due to the fact that the region was
already split off into a separate function, similar to PR46032.
I looked briefly into the patches set in PR46032, but I realized that even if I
fix it, the next problem I run into will be that the parloops pass is run after
the lto stream read/write point. So any changes the parloops pass makes at that
point are in the accelerator compile flow, in other words we're talking about
launching an accelerator kernel from the accelerator. While that is possible
with recent cuda accelerators, I guess in general we should not expect that to
be possible.
[ I also thought of a fancy scheme where we don't split off a new function, but
manipulate the body of the already split off function, and emit a c file from
the accelerator compiler containing the parameters that the host compiler should
use to launch the accelerator kernel... but I guess that would be a last resort. ]
So in order to solve the lto stream read/write point problem, I moved the
parloops pass (well, a copy called pass_oacc_par or similar) up in the pass
list, to before lto stream read/write point. That precludes solving the alias
problem with the PR46032 patch set, since we need ipa for that.
I solved (well, rather prevented) the alias problem by disabling pass_omp_expand
for GIMPLE_OACC_KERNELS, in other words disabling the function-split-off in
pass_omp_expand and letting pass_oacc_par take care of that (This is what I
meant with: 'postpone splitting of the function until the end of pass_oacc_par').
Doing so required me to write a patch to handle omp-lowered code conservatively
in cpp and forwprop, otherwise the 'rewrite region in terms of rewrite vars'
would be undone by the time we arrive at pass_oacc_par.
Some advice on how to continue from here would be *highly* appreciated. My
hunch
atm is to investigate the last option.
Jakub,
Richard,
I've investigated the last option, and published the current state in git-only
branch vries/oacc-kernels (
https://gcc.gnu.org/git/?p=gcc.git;a=shortlog;h=refs/heads/vries/oacc-kernels
).
The current state at commit 9255cadc5b6f8f7f4e4506e65a6be7fb3c00cd35 is that:
- a simple loop marked with the oacc kernels directive is analyzed for
parallelization,
- the loop is then rewritten using oacc parallel and oacc loop directives
- these oacc directives are expanded using omp_expand_local
- this results in the loop being split off into a separate function, while
the loop is replaced with a GOACC_parallel call
- all this is done before writing out the lto stream
- no support yet for reductions, nested loops, more than one loop nest in
kernels region
At toplevel, the added pass list looks like this:
...
NEXT_PASS (pass_build_ealias);
/* Pass group that runs when there are oacc kernels in the
function. */
Not sure why pass_oacc_kernels runs before all the other local
cleanups? I would have put it after pass_cd_dce at least.
My focus was on running pass_oacc_kernels ASAP, in order not to have to adapt
more passes to leave the omp-lowered code alone. I'll give your suggestion a try.
NEXT_PASS (pass_oacc_kernels);
PUSH_INSERT_PASSES_WITHIN (pass_oacc_kernels)
NEXT_PASS (pass_ch_oacc_kernels);
NEXT_PASS (pass_tree_loop_init);
NEXT_PASS (pass_lim);
NEXT_PASS (pass_ccp);
NEXT_PASS (pass_parallelize_loops_oacc_kernels);
NEXT_PASS (pass_tree_loop_done);
POP_INSERT_PASSES ()
...
The main question I'm currently facing is the following: when to do lowering
(in other words, rewriting of variable access in terms of .omp_data) of the
kernels region. There are basically 2 passes that contain code to do this:
- pass_lower_omp (on pre-ssa code)
- pass_parallelize_loops (on ssa code)
Both use the same utilities.
I think you mean that both passes use the same utilities to do omp-expand (in
other words, pass_parallelize_loops uses omp_expand_local).
But AFAIU, the omp-lowering in pass_parallelize_loops (in particular, the
rewrite of the region in terms of rewrite vars) shares no code with the omp pass.
Atm I'm using pass_loswer_omp, and I've added a patch that handles omp-lowered
code conservatively in ccp and forwprop in order for the lowering to remain
until arriving at pass_parallelize_loops_oacc_kernels.
You mean omp-_un_-lowered code?
No, I mean pass_omp_lower lowers the code into omp-lowered code, and the patch
in question prevents cpp and forwprop from undoing the lowering before arriving
at the point where we split off the function.
But it might turn out to be easier/necessary to handle this in
pass_parallelize_loops_oacc_kernels instead.
I'd do it similar to how autopar does it
OK, I'll try then to do the lowering for the kernels region in
pass_parallelize_loops_oacc_kernels, not in pass_omp_lower.
FWIW, I'm looking now into reductions, and started thinking in the same
direction.
(not that autopar is a great
example for a GCC pass these days...).
For my understanding, could you briefly elaborate on that (or give a reference
to an earlier discussion)?
Thanks,
- Tom
Richard.
Any advice on this issue, and on the current implementation is welcome.
Thanks,
- Tom