Hi here, I found digging into OpenACC meaningful. It's a late start for a GSoC proposal, and any suggestions from the community are appreciated! Feel free to comment on any part of it. To save time for readers, I've outlined my key understandings here—expansion and polishing are still needed.
I'm not sure whether my understanding of each task is correct, especially for the cache directive and device_type clause. Here is my current understanding: *3. Make the OpenACC **cache** Directive Actually Do Something* Currently, the cache directive in OpenACC is parsed at the front end, but not used for any optimization purposes, such as data prefetching or moving data to low-latency memory (e.g., L1/L2/L3 cache or GPU cache [5]). *TODO:* My current understanding is that after the OACC_CACHE directive is lowered to GIMPLE via gimplify_oacc_cache, a new tree-SSA optimization pass could be added. This pass might be similar to the existing aprefetch pass, or OpenACC prefetch logic could be integrated into it. The goal may be emitting prefetch instructions by inserting suitable built-in functions and relying on the backend to map them to runtime API calls or device-specific instructions through RTL templates. However, several questions remain: - Since OpenACC supports both accelerators (e.g., GPUs) and multicore CPUs, should we handle both cases? - For CPUs, we can refer to each vendor's ISA (e.g., x86_64, Arm) to decide which prefetch instructions to generate. - For GPUs, are we expected to use prefetch instructions from GPU ISA or Should we manual use runtime API routines like acc_memcpy_device to manage data? - Additional considerations include choosing a suitable prefetch distance, which may differ by device type or architecture. *5. OpenACC **device_type** Clause* *TODO:* Is the device_type clause designed to allow users to manually specify the target platform in source code, rather than via compiler options like -foffload=amdgcn-amdhsa="-march=gfx900" , or compiler building options like--target=nvptx-none? My understanding for other task is as follows: *1. OpenACC **acc_memcpy_device** Runtime API Routine* The acc_memcpy_device routine is currently missing in GCC's OpenACC runtime implementation. According to the specification, this routine copies a specified number of bytes from one device address (data_dev_src) to another device address (data_dev_dest). Both addresses must reside in the current device’s memory. There is also an asynchronous variant that performs the data transfer on a specified async queue (async_arg). The routine must handle error cases such as null pointers and invalid async argument values. This function shares a similar implementation pattern with acc_memcpy_to_device and acc_memcpy_from_device, which transfer data between host and device memory. Implementation will mainly involve modifying the following files: - libgomp/libgomp.map - libgomp/openacc.h - libgomp/openacc_lib.h - libgomp/openacc.f90 - libgomp/oacc-mem.c The existing functions such as memcpy_tofrom_device, gomp_copy_dev2host, and gomp_device_copy were primarily designed for acc_memcpy_to_device and acc_memcpy_from_device, which handle host-device transfers. For acc_memcpy_device, which handles device-to-device transfers, we should design a similar logic. Further investigation is needed to structure and implement this functionality effectively. *2. Support for **init**, **shutdown**, and **set** Directives* These directives are currently unsupported at the front-end level in GCC, even though their corresponding runtime APIs—acc_init, acc_shutdown, acc_set_device_type, and their async queue variants—are implemented. The goal here is to add parsing support in the front end to map these directives to the appropriate built-in functions. In GCC, front ends map OpenACC directives to BUILT_IN_GOACC_* entries defined in gcc/omp-builtins.def, and the back end expands these into runtime API calls. This task involves analyzing and extending front-end source files, taking inspiration from the implementation of the wait directive. Relevant files include: - gcc/c-family/c-omp.cc - gcc/c/c-parser.cc - gcc/cp/parser.cc - gcc/fortran/trans-openmp.cc - gcc/omp-builtins.def - gcc/omp-oacc-kernels-decompose.cc *4. OpenACC **bind** Clause Support* The bind clause appears in the routine directive and applies at the sequential level of parallelism. The following restrictions must be enforced: - A routine may not bind to a name that already has a visible bind clause. - If a procedure has a bind clause on both its declaration and definition, they must bind to the same name. - When compiling for multicore host CPUs, any bind clause should be ignored. - A bind clause must not appear after a device_type(host) clause. These cases should be carefully validated during semantic analysis. We can also use internal control variables (ICV) like default-device-var to inform bind behavior. And my understanding about the background is as follows: Introduction OpenACC is a directive-based parallel programming model designed for heterogeneous HPC hardware. However, GCC currently only partially supports the features specified in OpenACC 2.6: some directives are not parsed at all, some are parsed at the front end but are not lowered to generate the appropriate runtime API calls, and the runtime library implementation in GCC is also incomplete. This project aims to address these gaps by proposing feasible solutions to enhance GCC’s OpenACC support to more closely align with the official specification. Background OpenACC is a parallel programming model for heterogeneous HPC hardware, abstracted into two parts: the host and the attached parallel accelerator, such as a GPU. It provides compiler directives (e.g., in C/C++: #pragma acc directive-name [clause-list]) that allow users to specify compute-intensive regions of a program to be offloaded to an accelerator or executed on multiple host cores under the control of a host thread. The* execution model* is host-directed: the host thread manages memory allocation on the accelerator, initiates data transfers, sends code and arguments to the device, queues tasks, waits for completion, retrieves results, and deallocates memory. A key aspect of OpenACC is its *memory model*: accelerator memory is often separate from host memory, requiring explicit data transfers handled by the OpenACC runtime through underlying system calls such as direct memory access (DMA) transfers. Nowadays, most accelerators include caches, and OpenACC requires the compiler to manage these caches [1]. GCC parses code containing OpenACC directives written in C/C++ or Fortran and uses the OpenMP runtime API routines from the libgomp library—developed by GCC—to implement the functionality of each directive. At runtime, libgomp can look up and launch an offload function when given a target function address [3]. These target functions are linked to libgomp plugins, which are loaded from the standard dynamic linker path. For example, the plugin for Intel MIC devices uses liboffloadmic.so, while the plugin for NVIDIA PTX devices uses libcuda.so [2]. These loaded plugins rely on third-party, target-specific libraries to perform low-level interactions with accelerator devices. In short, libgomp is designed to be independent of specific accelerator architectures—it exposes a generic interface and delegates all target-dependent functionality to plugins. These plugins are developed collaboratively by the GNU community and hardware vendors. References [1] OpenACC Specification: https://www.openacc.org/specification [2] OpenAcc Host Compiler Compilation Process: https://gcc.gnu.org/wiki/Offloading#Compilation_process [3] Improving OpenACC kernels support in GCC: https://gcc.gnu.org/wiki/cauldron2017?action=AttachFile&do=get&target=OpenACC+kernels.pdf [4] Issue with acc_memcpy_device https://forums.developer.nvidia.com/t/issue-with-acc-memcpy-device/135977 [5] NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf: https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-product-literature/NVIDIA-Kepler-GK110-GK210-Architecture-Whitepaper.pdf Best regards, Chenlu