[GSoC][Enhance OpenACC support] Uncertain about Cache directive functionality and device_type clause goal

Zhang lv via Gcc Thu, 27 Mar 2025 04:06:31 -0700

 Hi here,

I found digging into OpenACC meaningful. It's a late start for a GSoC
proposal, and any suggestions from the community are appreciated! Feel free
to comment on any part of it. To save time for readers, I've outlined my
key understandings here—expansion and polishing are still needed.


I'm not sure whether my understanding of each task is correct, especially
for the cache directive and device_type clause. Here is my current
understanding:

*3. Make the OpenACC **cache** Directive Actually Do Something*

Currently, the cache directive in OpenACC is parsed at the front end, but
not used for any optimization purposes, such as data prefetching or moving
data to low-latency memory (e.g., L1/L2/L3 cache or GPU cache [5]).

*TODO:* My current understanding is that after the OACC_CACHE directive is
lowered to GIMPLE via gimplify_oacc_cache, a new tree-SSA optimization pass
could be added. This pass might be similar to the existing aprefetch pass,
or OpenACC prefetch logic could be integrated into it.

The goal may be emitting prefetch instructions by inserting suitable
built-in functions and relying on the backend to map them to runtime API
calls or device-specific instructions through RTL templates.

However, several questions remain:

   - Since OpenACC supports both accelerators (e.g., GPUs) and multicore
   CPUs, should we handle both cases?
      - For CPUs, we can refer to each vendor's ISA (e.g., x86_64, Arm) to
      decide which prefetch instructions to generate.
      - For GPUs, are we expected to use prefetch instructions from GPU ISA
      or Should we manual use runtime API routines like acc_memcpy_device to
      manage data?
   - Additional considerations include choosing a suitable prefetch
   distance, which may differ by device type or architecture.


*5. OpenACC **device_type** Clause*

*TODO:* Is the device_type clause designed to allow users to manually
specify the target platform in source code, rather than via compiler
options like -foffload=amdgcn-amdhsa="-march=gfx900"

, or compiler building options like--target=nvptx-none?


My understanding for other task is as follows:


*1. OpenACC **acc_memcpy_device** Runtime API Routine*

The acc_memcpy_device routine is currently missing in GCC's OpenACC runtime
implementation. According to the specification, this routine copies a
specified number of bytes from one device address (data_dev_src) to another
device address (data_dev_dest). Both addresses must reside in the current
device’s memory. There is also an asynchronous variant that performs the
data transfer on a specified async queue (async_arg). The routine must
handle error cases such as null pointers and invalid async argument values.
This function shares a similar implementation pattern with
acc_memcpy_to_device and acc_memcpy_from_device, which transfer data
between host and device memory.

Implementation will mainly involve modifying the following files:

   - libgomp/libgomp.map
   - libgomp/openacc.h
   - libgomp/openacc_lib.h
   - libgomp/openacc.f90
   - libgomp/oacc-mem.c

The existing functions such as memcpy_tofrom_device, gomp_copy_dev2host,
and gomp_device_copy were primarily designed for acc_memcpy_to_device and
acc_memcpy_from_device, which handle host-device transfers. For
acc_memcpy_device, which handles device-to-device transfers, we should
design a similar logic. Further investigation is needed to structure and
implement this functionality effectively.

*2. Support for **init**, **shutdown**, and **set** Directives*

These directives are currently unsupported at the front-end level in GCC,
even though their corresponding runtime APIs—acc_init, acc_shutdown,
acc_set_device_type, and their async queue variants—are implemented. The
goal here is to add parsing support in the front end to map these
directives to the appropriate built-in functions. In GCC, front ends map
OpenACC directives to BUILT_IN_GOACC_* entries defined in
gcc/omp-builtins.def, and the back end expands these into runtime API
calls. This task involves analyzing and extending front-end source files,
taking inspiration from the implementation of the wait directive. Relevant
files include:


   - gcc/c-family/c-omp.cc
   - gcc/c/c-parser.cc
   - gcc/cp/parser.cc
   - gcc/fortran/trans-openmp.cc
   - gcc/omp-builtins.def
   - gcc/omp-oacc-kernels-decompose.cc

*4. OpenACC **bind** Clause Support*

The bind clause appears in the routine directive and applies at the
sequential level of parallelism. The following restrictions must be
enforced:

   - A routine may not bind to a name that already has a visible bind
   clause.
   - If a procedure has a bind clause on both its declaration and
   definition, they must bind to the same name.
   - When compiling for multicore host CPUs, any bind clause should be
   ignored.
   - A bind clause must not appear after a device_type(host) clause.

These cases should be carefully validated during semantic analysis. We can
also use internal control variables (ICV) like default-device-var to inform
bind behavior.


And my understanding about the background is as follows:


Introduction

OpenACC is a directive-based parallel programming model designed for
heterogeneous HPC hardware. However, GCC currently only partially supports
the features specified in OpenACC 2.6: some directives are not parsed at
all, some are parsed at the front end but are not lowered to generate the
appropriate runtime API calls, and the runtime library implementation in
GCC is also incomplete. This project aims to address these gaps by
proposing feasible solutions to enhance GCC’s OpenACC support to more
closely align with the official specification.

Background

OpenACC is a parallel programming model for heterogeneous HPC hardware,
abstracted into two parts: the host and the attached parallel accelerator,
such as a GPU. It provides compiler directives (e.g., in C/C++: #pragma acc
directive-name [clause-list]) that allow users to specify compute-intensive
regions of a program to be offloaded to an accelerator or executed on
multiple host cores under the control of a host thread. The* execution
model* is host-directed: the host thread manages memory allocation on the
accelerator, initiates data transfers, sends code and arguments to the
device, queues tasks, waits for completion, retrieves results, and
deallocates memory. A key aspect of OpenACC is its *memory model*:
accelerator memory is often separate from host memory, requiring explicit
data transfers handled by the OpenACC runtime through underlying system
calls such as direct memory access (DMA) transfers. Nowadays, most
accelerators include caches, and OpenACC requires the compiler to manage
these caches [1].

GCC parses code containing OpenACC directives written in C/C++ or Fortran
and uses the OpenMP runtime API routines from the libgomp library—developed
by GCC—to implement the functionality of each directive. At runtime, libgomp
can look up and launch an offload function when given a target function
address [3]. These target functions are linked to libgomp plugins, which
are loaded from the standard dynamic linker path. For example, the plugin
for Intel MIC devices uses liboffloadmic.so, while the plugin for NVIDIA
PTX devices uses libcuda.so [2]. These loaded plugins rely on third-party,
target-specific libraries to perform low-level interactions with
accelerator devices. In short, libgomp is designed to be independent of
specific accelerator architectures—it exposes a generic interface and
delegates all target-dependent functionality to plugins. These plugins are
developed collaboratively by the GNU community and hardware vendors.



References

[1] OpenACC Specification: https://www.openacc.org/specification

[2] OpenAcc Host Compiler Compilation Process:
https://gcc.gnu.org/wiki/Offloading#Compilation_process

[3] Improving OpenACC kernels support in GCC:
https://gcc.gnu.org/wiki/cauldron2017?action=AttachFile&do=get&target=OpenACC+kernels.pdf

[4] Issue with acc_memcpy_device
https://forums.developer.nvidia.com/t/issue-with-acc-memcpy-device/135977

[5] NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf:
https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-product-literature/NVIDIA-Kepler-GK110-GK210-Architecture-Whitepaper.pdf


Best regards,

Chenlu

[GSoC][Enhance OpenACC support] Uncertain about Cache directive functionality and device_type clause goal

Reply via email to