On 22-09-14 10:28, Richard Biener wrote:
On Tue, 16 Sep 2014, Tom de Vries wrote:
On 09-09-14 12:56, Richard Biener wrote:
On Tue, 9 Sep 2014, Tom de Vries wrote:
On 18-08-14 14:16, Tom de Vries wrote:
On 06-08-14 17:10, Tom de Vries wrote:
We could insert a pass-group here that only deals with functions that
have
the
kernels directive, and do the auto-par thing in a pass_oacc_kernels
(which
should share the majority of the infrastructure with the parloops
pass):
...
NEXT_PASS (pass_build_ealias);
INSERT_PASSES_AFTER/WITHIN (passes_oacc_kernels)
NEXT_PASS (pass_ch);
NEXT_PASS (pass_ccp);
NEXT_PASS (pass_lim_aux);
NEXT_PASS (pass_oacc_par);
POP_INSERT_PASSES ()
...
Any comments, ideas or suggestions ?
I've experimented with implementing this on top of gomp-4_0-branch, and
I
ran
into PR46032.
PR46032 is about vectorization failure on a function split off by omp
parallelization. The vectorization fails due to aliasing constraints in
the
split off function, which are not present in the original code.
Heh. At least the omp-low.c parts from comment #1 should be pushed
to trunk...
Hi Richard,
Right, but the intra_create_variable_infos part does not apply cleanly, and I
don't know yet how to resolve that.
In the gomp-4_0-branch, the code marked by the openacc kernels directive
is
split off during omp_expand. The generated code has the same additional
aliasing
constraints, and in pass_oacc_par the parallelization fails.
The PR46032 contains a tentative patch by Richard Biener, which applies
cleanly
on top of 4.6 (I haven't yet reached a level of understanding of
tree-ssa-structalias.c to be able to resolve the conflict in
intra_create_variable_infos when applying on 4.7). The tentative patch
involves
running ipa-pta, which is also a pass run after the point where we write
out
the
lto stream. I'm not sure whether it makes sense to run the pta-ipa pass
as
part
of the pass_oacc_kernels pass list.
No, that's not even possible I think.
OK, thanks for confirming that.
I see three ways of continuing from here:
- take the tentative patch and make it work, including running pta-ipa
during
passes_oacc_kernels
- same, but try somehow to manage without running pta-ipa.
- try to postpone splitting of the function until the end of
pass_oacc_par.
I don't understand the last option? What is the actual issue you run
into? You split oacc kernels off and _then_ run "autopar" on the
split-off function (and get additional kernels)?
Let me try to reiterate the problem in more detail.
We're trying to implement the auto-parallelization part of the oacc kernels
directive using the existing parloops pass. The source starting point is the
gomp-4_0-branch. The gomp-4_0-branch has a dummy implementation of the oacc
kernels directive, analogous to the oacc parallel directive.
So the current gomp-4_0-branch does the following steps for oacc
parallel/kernels directives:
1. pass_lower_omp/scan_omp:
- create record type with rewrite vars (.omp_data_t).
- declare function with arg with type pointer to .omp_data_t.
2. pass_lower_omp/lower_omp:
- rewrite region in terms of rewrite vars
- add omp_return at end
3. pass_expand_omp:
- split off the region into a separate function
- replace region with call to GOACC_parallel/GOACC_kernels, with function
pointer as argument
I wrote an example with a single oacc kernels region containing a simple
vector addition loop, and tried to make auto-parallelization work.
Ah, so the "target" OACC directive tells it to vectorize only, not to
parallelize?
Hi Richard,
I'm trying to make auto-parallelization work, not vectorization.
And we split off the kernel only because we have to
ship it to the accelerator.
The first problem I ran into was that the parloops pass failed to analyze the
dependencies in an vector addition example, due to the fact that the region
was already split off into a separate function, similar to PR46032.
I looked briefly into the patches set in PR46032, but I realized that even if
I fix it, the next problem I run into will be that the parloops pass is run
after the lto stream read/write point. So any changes the parloops pass makes
at that point are in the accelerator compile flow, in other words we're
talking about launching an accelerator kernel from the accelerator. While that
is possible with recent cuda accelerators, I guess in general we should not
expect that to be possible.
HSA also supports that btw.
OK, good to know.
[ I also thought of a fancy scheme where we don't split off a new function,
but manipulate the body of the already split off function, and emit a c file
from the accelerator compiler containing the parameters that the host compiler
should use to launch the accelerator kernel... but I guess that would be a
last resort. ]
So in order to solve the lto stream read/write point problem,