I'll be posting a patch series for trunk, which implements the core of the
OpenACC execution model. This is split into the following patches:
01-trunk-unique.patch
Internal function with a 'uniqueness' property
02-trunk-nvptx-partition.patch
NVPTX backend patch set for partitioned execution
03-trunk-hook.patch
OpenACC hook
04-trunk-c.patch
C FE changes
05-trunk-cxx.patch
C++ FE changes
06-trunk-red-init.patch
Placeholder to keep reductions functioning
07-trunk-loop-mark.patch
Annotate OpenACC loops in device-agnostic manner
08-trunk-dev-lower.patch
Device-specific lowering of loop markers
09-trunk-lower-gate.patch
Run oacc_device_lower pass regardless of errors
10-trunk-libgomp.patch
Libgomp change (remove dimension check)
11-trunk-tests.patch
Initial set of execution tests
With the exception of patch 6, these are all on the gomp4 branch. This patch
set does not change reduction handling, which will be dealt with in a subsequent
set.
An offloaded region is spawned on a set of execution engines. These are
organized as a cube, with specific axes controlled by the programmer. The
engines may operate in a 'partitioned' mode, where each engine executes as a
separate thread, or they may operate in a 'single' mode, where one engine of a
particular set executes the program and the other engines are idled (in an
implementation-specific manner).
A driving example is the following:
#pragma acc parallel ...
{
// single mode here
#pragma acc loop ...
for (i = 0; i < N; i++) // loop 1
... // partitioned mode here
if (expr) // single mode here
#pragma acc loop ...
for (i = 0; i < N; i++) // loop 2
... // partitioned mode here
}
While it's clear all paths lead to loop 1, it's not statically determinable
whether loop 2 is executed or not.
This implementation marks the head and tail of partitioned execution regions
with builtin functions indicating the axes of partitioning. After
device-specific lowering, these will eventually make it to RTL expansion time,
where they get expanded to backend-specific RTL. In the PTX implementation
'single' mode is implemented by a 'neutering' mechanism, where the non-active
execution engines skip each basic block and 'follow along' conditional branches
to get to a subsequent block. In this manner all engines can reach a
dynamically determinable partitioned region.
On entry to a partitioned region, we execute a 'fork' operation, cloning live
state from the single active engine before the region, into the other threads
that become activated.
This patchset has been tested on x86_64-linux & ptx accelerator.
nathan