[OpenACC 0/11] execution model

Nathan Sidwell Wed, 21 Oct 2015 11:46:23 -0700

I'll be posting a patch series for trunk, which implements the core of theOpenACC execution model. This is split into the following patches:


01-trunk-unique.patch
  Internal function with a 'uniqueness' property
02-trunk-nvptx-partition.patch
  NVPTX backend patch set for partitioned execution
03-trunk-hook.patch
  OpenACC hook
04-trunk-c.patch
  C FE changes
05-trunk-cxx.patch
  C++ FE changes
06-trunk-red-init.patch
  Placeholder to keep reductions functioning
07-trunk-loop-mark.patch
  Annotate OpenACC loops in device-agnostic manner
08-trunk-dev-lower.patch
  Device-specific lowering of loop markers
09-trunk-lower-gate.patch
  Run oacc_device_lower pass regardless of errors
10-trunk-libgomp.patch
  Libgomp change (remove dimension check)
11-trunk-tests.patch
  Initial set of execution tests

With the exception of patch 6, these are all on the gomp4 branch. This patchset does not change reduction handling, which will be dealt with in a subsequentset.

An offloaded region is spawned on a set of execution engines. These areorganized as a cube, with specific axes controlled by the programmer. Theengines may operate in a 'partitioned' mode, where each engine executes as aseparate thread, or they may operate in a 'single' mode, where one engine of aparticular set executes the program and the other engines are idled (in animplementation-specific manner).


A driving example is the following:
#pragma acc parallel ...
{
 // single mode here
#pragma acc loop ...
for (i = 0; i < N; i++) // loop 1
  ... // partitioned mode here

if (expr) // single mode here
#pragma acc loop ...
  for (i = 0; i < N; i++) // loop 2
    ... // partitioned mode here
}

While it's clear all paths lead to loop 1, it's not statically determinablewhether loop 2 is executed or not.

This implementation marks the head and tail of partitioned execution regionswith builtin functions indicating the axes of partitioning. Afterdevice-specific lowering, these will eventually make it to RTL expansion time,where they get expanded to backend-specific RTL. In the PTX implementation'single' mode is implemented by a 'neutering' mechanism, where the non-activeexecution engines skip each basic block and 'follow along' conditional branchesto get to a subsequent block. In this manner all engines can reach adynamically determinable partitioned region.

On entry to a partitioned region, we execute a 'fork' operation, cloning livestate from the single active engine before the region, into the other threadsthat become activated.


This patchset has been tested on x86_64-linux & ptx accelerator.

nathan

[OpenACC 0/11] execution model

Reply via email to