I'll be posting a patch series for trunk, which implements the core of the OpenACC execution model. This is split into the following patches:

01-trunk-unique.patch
  Internal function with a 'uniqueness' property
02-trunk-nvptx-partition.patch
  NVPTX backend patch set for partitioned execution
03-trunk-hook.patch
  OpenACC hook
04-trunk-c.patch
  C FE changes
05-trunk-cxx.patch
  C++ FE changes
06-trunk-red-init.patch
  Placeholder to keep reductions functioning
07-trunk-loop-mark.patch
  Annotate OpenACC loops in device-agnostic manner
08-trunk-dev-lower.patch
  Device-specific lowering of loop markers
09-trunk-lower-gate.patch
  Run oacc_device_lower pass regardless of errors
10-trunk-libgomp.patch
  Libgomp change (remove dimension check)
11-trunk-tests.patch
  Initial set of execution tests

With the exception of patch 6, these are all on the gomp4 branch. This patch set does not change reduction handling, which will be dealt with in a subsequent set.

An offloaded region is spawned on a set of execution engines. These are organized as a cube, with specific axes controlled by the programmer. The engines may operate in a 'partitioned' mode, where each engine executes as a separate thread, or they may operate in a 'single' mode, where one engine of a particular set executes the program and the other engines are idled (in an implementation-specific manner).

A driving example is the following:
#pragma acc parallel ...
{
 // single mode here
#pragma acc loop ...
for (i = 0; i < N; i++) // loop 1
  ... // partitioned mode here

if (expr) // single mode here
#pragma acc loop ...
  for (i = 0; i < N; i++) // loop 2
    ... // partitioned mode here
}

While it's clear all paths lead to loop 1, it's not statically determinable whether loop 2 is executed or not.

This implementation marks the head and tail of partitioned execution regions with builtin functions indicating the axes of partitioning. After device-specific lowering, these will eventually make it to RTL expansion time, where they get expanded to backend-specific RTL. In the PTX implementation 'single' mode is implemented by a 'neutering' mechanism, where the non-active execution engines skip each basic block and 'follow along' conditional branches to get to a subsequent block. In this manner all engines can reach a dynamically determinable partitioned region.

On entry to a partitioned region, we execute a 'fork' operation, cloning live state from the single active engine before the region, into the other threads that become activated.

This patchset has been tested on x86_64-linux & ptx accelerator.

nathan

Reply via email to