[clang] [compiler-rt] [llvm] [openmp] [compiler-rt] Rework profile data handling for GPU targets (PR #187136)

Joseph Huber via cfe-commits Tue, 17 Mar 2026 15:01:27 -0700

https://github.com/jhuber6 created 
https://github.com/llvm/llvm-project/pull/187136


Summary:
Currently, the GPU iterates through all of the present symbols and
copies them by prefix. This is inefficient as it requires a lot of small
high-latency data transfers rather than a few large ones. Additionally,
we force every single profiling symbol to have protected visibility.
This means potentially hundreds of unnecessary symbols in the symbol
table.

This PR changes the interface to move towards the start / stop section
handling. AMDGPU supports this natively as an ELF target, so we need
little changes. Instead of overriding visibility, we use a single table
to define the bounds that we can obtain with one contiguous load.

Using a table interface should also work for the in-progress HIP
implementation for this, as it wraps the start / stop sections into
standard void pointers which will be inside of an already mapped region
of memory, so they should be accessible from the HIP API.

NVPTX is more difficult as it is an ELF platform without this support. I
have hooked up the 'Other' handling to work around this, but even then
it's a bit of a stretch. I could remove this support here, but I wanted
to demonstrate that we can share the ABI. However, NVPTX will only work
if we force LTO and change the backend to emit variables in the same

TL;DR, we now do this:
```c
struct { start1, stop1, start2, stop2, start3, stop3, version; } device;
struct host = DtoH(lookup("device"));
counters = DtoH(host.stop - host.start)
version = DtoH(host.version);
```


>From 6fcd15606004d89212aae808d24e639937e3e8d3 Mon Sep 17 00:00:00 2001
From: Joseph Huber <[email protected]>
Date: Tue, 10 Mar 2026 15:47:47 -0500
Subject: [PATCH 1/3] [compiler-rt] Define GPU specific handling of profiling
 functions

Summary:
The changes in github.com/llvm/llvm-project/pull/185552 allowed us to
start building the standard `libclang_rt.profile.a` for GPU targets.
This PR expands this by adding an optimized GPU routine for counter
increment and removing the special-case handling of these functions in
the OpenMP runtime.

Vast majority of these functions are boilerplate, but we should be able
to do more interesting things with this in the future, like value or
memory profiling.
---
 compiler-rt/lib/profile/CMakeLists.txt        |  1 +
 compiler-rt/lib/profile/InstrProfiling.h      |  8 +++++
 .../lib/profile/InstrProfilingPlatformGPU.c   | 35 +++++++++++++++++++
 .../Instrumentation/InstrProfiling.cpp        | 13 +++++--
 offload/test/CMakeLists.txt                   |  2 +-
 offload/test/lit.cfg                          | 16 +++++++--
 openmp/device/CMakeLists.txt                  |  1 -
 openmp/device/include/Profiling.h             | 21 -----------
 openmp/device/src/Profiling.cpp               | 18 ----------
 9 files changed, 70 insertions(+), 45 deletions(-)
 create mode 100644 compiler-rt/lib/profile/InstrProfilingPlatformGPU.c
 delete mode 100644 openmp/device/include/Profiling.h
 delete mode 100644 openmp/device/src/Profiling.cpp

diff --git a/compiler-rt/lib/profile/CMakeLists.txt 
b/compiler-rt/lib/profile/CMakeLists.txt
index 4cc2610cec870..86328b4c13922 100644
--- a/compiler-rt/lib/profile/CMakeLists.txt
+++ b/compiler-rt/lib/profile/CMakeLists.txt
@@ -74,6 +74,7 @@ set(PROFILE_SOURCES
   InstrProfilingPlatformLinux.c
   InstrProfilingPlatformOther.c
   InstrProfilingPlatformWindows.c
+  InstrProfilingPlatformGPU.c
   )
 
 if (NOT COMPILER_RT_PROFILE_BAREMETAL)
diff --git a/compiler-rt/lib/profile/InstrProfiling.h 
b/compiler-rt/lib/profile/InstrProfiling.h
index 187ef55ef3784..f01cbec44be64 100644
--- a/compiler-rt/lib/profile/InstrProfiling.h
+++ b/compiler-rt/lib/profile/InstrProfiling.h
@@ -166,6 +166,14 @@ void __llvm_profile_instrument_target_value(uint64_t 
TargetValue, void *Data,
                                             uint32_t CounterIndex,
                                             uint64_t CounterValue);
 
+/*!
+ * \brief Wave-cooperative counter increment for GPU targets.
+ *
+ * Reduces per-lane atomic contention by electing a single lane per wave to
+ * perform the counter update.
+ */
+void __llvm_profile_instrument_gpu(uint64_t *Counter, uint64_t Step);
+
 /*!
  * \brief Write instrumentation data to the current file.
  *
diff --git a/compiler-rt/lib/profile/InstrProfilingPlatformGPU.c 
b/compiler-rt/lib/profile/InstrProfilingPlatformGPU.c
new file mode 100644
index 0000000000000..6d8dacf030ff2
--- /dev/null
+++ b/compiler-rt/lib/profile/InstrProfilingPlatformGPU.c
@@ -0,0 +1,35 @@
+/*===- InstrProfilingPlatformGPU.c - GPU profiling support 
----------------===*\
+|*
+|* Part of the LLVM Project, under the Apache License v2.0 with LLVM 
Exceptions.
+|* See https://llvm.org/LICENSE.txt for license information.
+|* SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+|*
+\*===----------------------------------------------------------------------===*/
+
+// GPU-specific profiling functions for AMDGPU and NVPTX targets. This file
+// provides:
+//
+// Platform plumbing (section boundaries, binary IDs, VNodes) are handled by
+// InstrProfilingPlatformLinux.c via the COMPILER_RT_PROFILE_BAREMETAL path.
+
+#if defined(__NVPTX__) || defined(__AMDGPU__)
+
+#include "InstrProfiling.h"
+#include <gpuintrin.h>
+
+// Wave-cooperative counter increment. The instrumentation pass emits calls to
+// this in place of the default non-atomic load/add/store or atomicrmw 
sequence.
+COMPILER_RT_VISIBILITY void __llvm_profile_instrument_gpu(uint64_t *counter,
+                                                          uint64_t step) {
+  uint64_t mask = __gpu_lane_mask();
+  if (__gpu_is_first_in_lane(mask))
+    __scoped_atomic_fetch_add(counter, step * __builtin_popcountg(mask),
+                              __ATOMIC_RELAXED, __MEMORY_SCOPE_DEVICE);
+}
+
+// InstrProfilingValue.c is excluded from GPU builds but passes may still emit
+// calls to this for memory intrinsics. provide a no-op to prevent link errors.
+COMPILER_RT_VISIBILITY void
+__llvm_profile_instrument_memop(int64_t i, void *ptr, int32_t i2) {}
+
+#endif
diff --git a/llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp 
b/llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp
index 199b7357fa860..c60426439c910 100644
--- a/llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp
+++ b/llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp
@@ -1192,8 +1192,17 @@ void InstrLowerer::lowerIncrement(InstrProfIncrementInst 
*Inc) {
   auto *Addr = getCounterAddress(Inc);
 
   IRBuilder<> Builder(Inc);
-  if (Options.Atomic || AtomicCounterUpdateAll ||
-      (Inc->getIndex()->isNullValue() && AtomicFirstCounter)) {
+  if (isGPUProfTarget(M)) {
+    auto *I64Ty = Builder.getInt64Ty();
+    auto *PtrTy = Builder.getPtrTy();
+    auto *CalleeTy = FunctionType::get(Type::getVoidTy(M.getContext()),
+                                       {PtrTy, I64Ty}, false);
+    auto Callee =
+        M.getOrInsertFunction("__llvm_profile_instrument_gpu", CalleeTy);
+    Value *CastAddr = Builder.CreatePointerBitCastOrAddrSpaceCast(Addr, PtrTy);
+    Builder.CreateCall(Callee, {CastAddr, Inc->getStep()});
+  } else if (Options.Atomic || AtomicCounterUpdateAll ||
+             (Inc->getIndex()->isNullValue() && AtomicFirstCounter)) {
     Builder.CreateAtomicRMW(AtomicRMWInst::Add, Addr, Inc->getStep(),
                             MaybeAlign(), AtomicOrdering::Monotonic);
   } else {
diff --git a/offload/test/CMakeLists.txt b/offload/test/CMakeLists.txt
index 711621de9075d..69b4979177f54 100644
--- a/offload/test/CMakeLists.txt
+++ b/offload/test/CMakeLists.txt
@@ -12,7 +12,7 @@ else()
   set(LIBOMPTARGET_DEBUG False)
 endif()
 
-if (NOT OPENMP_STANDALONE_BUILD AND "compiler-rt" IN_LIST LLVM_ENABLE_RUNTIMES)
+if (NOT OPENMP_STANDALONE_BUILD)
   set(LIBOMPTARGET_TEST_GPU_PGO True)
 else()
   set(LIBOMPTARGET_TEST_GPU_PGO False)
diff --git a/offload/test/lit.cfg b/offload/test/lit.cfg
index 2d5d69167109d..d2ecc2524f1db 100644
--- a/offload/test/lit.cfg
+++ b/offload/test/lit.cfg
@@ -2,6 +2,7 @@
 # Configuration file for the 'lit' test runner.
 
 import os
+import glob
 import lit.formats
 
 # Tell pylint that we know config and lit_config exist somewhere.
@@ -133,8 +134,19 @@ if config.libomptarget_has_libc:
 
 profdata_path = os.path.join(config.bin_llvm_tools_dir, "llvm-profdata")
 if config.libomptarget_test_pgo:
-  config.available_features.add('pgo')
-  config.substitutions.append(("%profdata", profdata_path))
+  target = config.libomptarget_current_target
+  for suffix in ['-JIT-LTO', '-LTO']:
+    if target.endswith(suffix):
+      target = target[:-len(suffix)]
+      break
+  has_profile_rt = True
+  if target.startswith('amdgcn') or target.startswith('nvptx'):
+    has_profile_rt = bool(glob.glob(os.path.join(
+        config.llvm_lib_directory, 'clang', '*', 'lib', target,
+        'libclang_rt.profile.a')))
+  if has_profile_rt:
+    config.available_features.add('pgo')
+    config.substitutions.append(("%profdata", profdata_path))
 
 # Determine whether the test system supports unified memory.
 # For CUDA, this is the case with compute capability 70 (Volta) or higher.
diff --git a/openmp/device/CMakeLists.txt b/openmp/device/CMakeLists.txt
index 096a6fe0b6e7e..ff5a64fdd2f0f 100644
--- a/openmp/device/CMakeLists.txt
+++ b/openmp/device/CMakeLists.txt
@@ -16,7 +16,6 @@ set(src_files
   ${CMAKE_CURRENT_SOURCE_DIR}/src/Mapping.cpp
   ${CMAKE_CURRENT_SOURCE_DIR}/src/Misc.cpp
   ${CMAKE_CURRENT_SOURCE_DIR}/src/Parallelism.cpp
-  ${CMAKE_CURRENT_SOURCE_DIR}/src/Profiling.cpp
   ${CMAKE_CURRENT_SOURCE_DIR}/src/Reduction.cpp
   ${CMAKE_CURRENT_SOURCE_DIR}/src/State.cpp
   ${CMAKE_CURRENT_SOURCE_DIR}/src/Synchronization.cpp
diff --git a/openmp/device/include/Profiling.h 
b/openmp/device/include/Profiling.h
deleted file mode 100644
index d994752254121..0000000000000
--- a/openmp/device/include/Profiling.h
+++ /dev/null
@@ -1,21 +0,0 @@
-//===-------- Profiling.h - OpenMP interface ---------------------- C++ 
-*-===//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM 
Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-//===----------------------------------------------------------------------===//
-//
-//
-//===----------------------------------------------------------------------===//
-
-#ifndef OMPTARGET_DEVICERTL_PROFILING_H
-#define OMPTARGET_DEVICERTL_PROFILING_H
-
-extern "C" {
-void __llvm_profile_register_function(void *Ptr);
-void __llvm_profile_register_names_function(void *Ptr, long int I);
-void __llvm_profile_instrument_memop(long int I, void *Ptr, int I2);
-}
-
-#endif
diff --git a/openmp/device/src/Profiling.cpp b/openmp/device/src/Profiling.cpp
deleted file mode 100644
index df141af5ebeea..0000000000000
--- a/openmp/device/src/Profiling.cpp
+++ /dev/null
@@ -1,18 +0,0 @@
-//===------- Profiling.cpp ---------------------------------------- C++ 
---===//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM 
Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-//===----------------------------------------------------------------------===//
-
-#include "Profiling.h"
-
-extern "C" {
-
-// Provides empty implementations for certain functions in compiler-rt
-// that are emitted by the PGO instrumentation.
-void __llvm_profile_register_function(void *Ptr) {}
-void __llvm_profile_register_names_function(void *Ptr, long int I) {}
-void __llvm_profile_instrument_memop(long int I, void *Ptr, int I2) {}
-}

>From b872ed20770233d1a89af103c7808f3c112e0a36 Mon Sep 17 00:00:00 2001
From: Joseph Huber <[email protected]>
Date: Tue, 10 Mar 2026 16:17:09 -0500
Subject: [PATCH 2/3] [Clang] Correctly link and handle PGO options on the GPU

Summary:
Currently, the GPU targets ignore the standard profiling arguments. This
PR changes the behavior to use the standard handling, which links the in
the now-present `libclang_rt.profile.a` if the user built with the
compiler-rt support enabled. If it is not present this is a linker error
and we can always suppress with `-Xarch_host` and `-Xarch_device`.
Hopefully this doesn't cause some people pain if they're used to doing
`-fprofile-generate` on a CPU unguarded since it was a stange mix of a
no-op and not a no-op on the GPU until now.
---
 clang/lib/Driver/ToolChains/AMDGPU.cpp   | 2 ++
 clang/lib/Driver/ToolChains/Clang.cpp    | 6 +++++-
 clang/lib/Driver/ToolChains/Cuda.cpp     | 2 ++
 clang/test/Driver/amdgpu-toolchain.c     | 4 ++++
 clang/test/Driver/cuda-cross-compiling.c | 5 +++++
 clang/test/Driver/openmp-offload-gpu.c   | 9 +++++++++
 6 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/clang/lib/Driver/ToolChains/AMDGPU.cpp 
b/clang/lib/Driver/ToolChains/AMDGPU.cpp
index 7bbdb71b1e24f..54fbd86168602 100644
--- a/clang/lib/Driver/ToolChains/AMDGPU.cpp
+++ b/clang/lib/Driver/ToolChains/AMDGPU.cpp
@@ -632,6 +632,8 @@ void amdgpu::Linker::ConstructJob(Compilation &C, const 
JobAction &JA,
         Args.MakeArgString("-plugin-opt=-mattr=" + llvm::join(Features, ",")));
   }
 
+  getToolChain().addProfileRTLibs(Args, CmdArgs);
+
   if (Args.hasArg(options::OPT_stdlib))
     CmdArgs.append({"-lc", "-lm"});
   if (Args.hasArg(options::OPT_startfiles)) {
diff --git a/clang/lib/Driver/ToolChains/Clang.cpp 
b/clang/lib/Driver/ToolChains/Clang.cpp
index 3b852528d92c4..5f9edb205a0bb 100644
--- a/clang/lib/Driver/ToolChains/Clang.cpp
+++ b/clang/lib/Driver/ToolChains/Clang.cpp
@@ -9357,7 +9357,11 @@ void LinkerWrapper::ConstructJob(Compilation &C, const 
JobAction &JA,
       OPT_flto_partitions_EQ,
       OPT_flto_EQ,
       OPT_hipspv_pass_plugin_EQ,
-      OPT_use_spirv_backend};
+      OPT_use_spirv_backend,
+      OPT_fprofile_generate,
+      OPT_fprofile_generate_EQ,
+      OPT_fprofile_instr_generate,
+      OPT_fprofile_instr_generate_EQ};
   const llvm::DenseSet<unsigned> LinkerOptions{OPT_mllvm, OPT_Zlinker_input};
   auto ShouldForwardForToolChain = [&](Arg *A, const ToolChain &TC) {
     // Don't forward -mllvm to toolchains that don't support LLVM.
diff --git a/clang/lib/Driver/ToolChains/Cuda.cpp 
b/clang/lib/Driver/ToolChains/Cuda.cpp
index e0020176800fd..2ca8886936f6c 100644
--- a/clang/lib/Driver/ToolChains/Cuda.cpp
+++ b/clang/lib/Driver/ToolChains/Cuda.cpp
@@ -643,6 +643,8 @@ void NVPTX::Linker::ConstructJob(Compilation &C, const 
JobAction &JA,
   llvm::sys::path::append(DefaultLibPath, CLANG_INSTALL_LIBDIR_BASENAME);
   CmdArgs.push_back(Args.MakeArgString(Twine("-L") + DefaultLibPath));
 
+  getToolChain().addProfileRTLibs(Args, CmdArgs);
+
   if (Args.hasArg(options::OPT_stdlib))
     CmdArgs.append({"-lc", "-lm"});
   if (Args.hasArg(options::OPT_startfiles)) {
diff --git a/clang/test/Driver/amdgpu-toolchain.c 
b/clang/test/Driver/amdgpu-toolchain.c
index 459c1bdac246f..384a7617f8859 100644
--- a/clang/test/Driver/amdgpu-toolchain.c
+++ b/clang/test/Driver/amdgpu-toolchain.c
@@ -46,3 +46,7 @@
 // RUN:  --rocm-device-lib-path=%S/Inputs/rocm/amdgcn/bitcode 2>&1 \
 // RUN: | FileCheck -check-prefix=DEVICE-LIBS %s
 // DEVICE-LIBS: "-mlink-builtin-bitcode" "[[ROCM_PATH:.+]]ockl.bc"
+
+// RUN: %clang -### --target=amdgcn-amd-amdhsa -mcpu=gfx906 -nogpulib \
+// RUN:   -fprofile-generate %s 2>&1 | FileCheck -check-prefixes=PROFILE %s
+// PROFILE: ld.lld{{.*}}libclang_rt.profile.a
diff --git a/clang/test/Driver/cuda-cross-compiling.c 
b/clang/test/Driver/cuda-cross-compiling.c
index ed2853cae3ccc..10323408a3732 100644
--- a/clang/test/Driver/cuda-cross-compiling.c
+++ b/clang/test/Driver/cuda-cross-compiling.c
@@ -112,3 +112,8 @@
 // RUN:   -nogpulib -nogpuinc -### %s 2>&1 | FileCheck -check-prefix=PATH %s
 
 // PATH: 
clang-nvlink-wrapper{{.*}}"--cuda-path={{.*}}/Inputs/CUDA/usr/local/cuda"
+
+// RUN: %clang -### --target=nvptx64-nvidia-cuda -march=sm_89 -nogpulib \
+// RUN:   -fprofile-generate %s 2>&1 | FileCheck -check-prefixes=PROFILE %s
+
+// PROFILE: clang-nvlink-wrapper{{.*}}libclang_rt.profile.a
diff --git a/clang/test/Driver/openmp-offload-gpu.c 
b/clang/test/Driver/openmp-offload-gpu.c
index fb1bc9ffdbbd4..727d2387a99a0 100644
--- a/clang/test/Driver/openmp-offload-gpu.c
+++ b/clang/test/Driver/openmp-offload-gpu.c
@@ -410,3 +410,12 @@
 // RUN:   | FileCheck --check-prefix=SHOULD-EXTRACT %s
 //
 // SHOULD-EXTRACT: clang-linker-wrapper{{.*}}"--should-extract=gfx906"
+
+//
+// Check that `-fprofile-generate` flags are forwarded to link in the runtime.
+//
+// RUN:   %clang -### --target=x86_64-unknown-linux-gnu -fopenmp=libomp \
+// RUN:     --offload-arch=gfx906 -fprofile-generate -nogpulib -nogpuinc %s 
2>&1 \
+// RUN:   | FileCheck --check-prefix=PROFILE %s
+//
+// PROFILE: 
clang-linker-wrapper{{.*}}--device-compiler=amdgcn-amd-amdhsa=-fprofile-generate

>From 27a56a2ed14dc1317fd98f3f6f5ad4f74bc7dd16 Mon Sep 17 00:00:00 2001
From: Joseph Huber <[email protected]>
Date: Mon, 16 Mar 2026 17:11:02 -0500
Subject: [PATCH 3/3] [compiler-rt] Rework profile data handling for GPU
 targets

Summary:
Currently, the GPU iterates through all of the present symbols and
copies them by prefix. This is inefficient as it requires a lot of small
high-latency data transfers rather than a few large ones. Additionally,
we force every single profiling symbol to have protected visibility.
This means potentially hundreds of unnecessary symbols in the symbol
table.

This PR changes the interface to move towards the start / stop section
handling. AMDGPU supports this natively as an ELF target, so we need
little changes. Instead of overriding visibility, we use a single table
to define the bounds that we can obtain with one contiguous load.

Using a table interface should also work for the in-progress HIP
implementation for this, as it wraps the start / stop sections into
standard void pointers which will be inside of an already mapped region
of memory, so they should be accessible from the HIP API.

NVPTX is more difficult as it is an ELF platform without this support. I
have hooked up the 'Other' handling to work around this, but even then
it's a bit of a stretch. I could remove this support here, but I wanted
to demonstrate that we can share the ABI. However, NVPTX will only work
if we force LTO and change the backend to emit variables in the same
section next to each-other. Maybe this will be easier if NVIDIA ever
provides a SASS target.

This provides the same output that the tests expects.
---
 compiler-rt/include/profile/InstrProfData.inc |  29 +++
 compiler-rt/lib/profile/InstrProfiling.h      |   5 +
 .../lib/profile/InstrProfilingPlatformGPU.c   |  47 ++++-
 .../lib/profile/InstrProfilingPlatformLinux.c |   2 +-
 .../lib/profile/InstrProfilingPlatformOther.c |  57 ++++--
 .../llvm/ProfileData/InstrProfData.inc        |  29 +++
 llvm/lib/ProfileData/InstrProf.cpp            |  11 +-
 .../Instrumentation/InstrProfiling.cpp        |  38 ++--
 .../Instrumentation/PGOInstrumentation.cpp    |   3 -
 .../common/include/GlobalHandler.h            |  14 +-
 .../common/src/GlobalHandler.cpp              | 172 +++++++-----------
 11 files changed, 253 insertions(+), 154 deletions(-)

diff --git a/compiler-rt/include/profile/InstrProfData.inc 
b/compiler-rt/include/profile/InstrProfData.inc
index 46d6bb5bd8896..f56f3a34510bc 100644
--- a/compiler-rt/include/profile/InstrProfData.inc
+++ b/compiler-rt/include/profile/InstrProfData.inc
@@ -142,6 +142,31 @@ INSTR_PROF_VALUE_NODE(PtrToNodeT, 
llvm::PointerType::getUnqual(Ctx), Next, \
 #undef INSTR_PROF_VALUE_NODE
 /* INSTR_PROF_VALUE_NODE end. */
 
+/* INSTR_PROF_GPU_SECT start. */
+/* Fields of the GPU profile section bounds structure, populated by the
+ * compiler runtime and read by the host to extract profiling data. */
+#ifndef INSTR_PROF_GPU_SECT
+#define INSTR_PROF_GPU_SECT(Type, LLVMType, Name, Initializer)
+#else
+#define INSTR_PROF_DATA_DEFINED
+#endif
+INSTR_PROF_GPU_SECT(void *, llvm::PointerType::getUnqual(Ctx), NamesStart, \
+                    
ConstantPointerNull::get(llvm::PointerType::getUnqual(Ctx)))
+INSTR_PROF_GPU_SECT(void *, llvm::PointerType::getUnqual(Ctx), NamesStop, \
+                    
ConstantPointerNull::get(llvm::PointerType::getUnqual(Ctx)))
+INSTR_PROF_GPU_SECT(void *, llvm::PointerType::getUnqual(Ctx), CountersStart, \
+                    
ConstantPointerNull::get(llvm::PointerType::getUnqual(Ctx)))
+INSTR_PROF_GPU_SECT(void *, llvm::PointerType::getUnqual(Ctx), CountersStop, \
+                    
ConstantPointerNull::get(llvm::PointerType::getUnqual(Ctx)))
+INSTR_PROF_GPU_SECT(void *, llvm::PointerType::getUnqual(Ctx), DataStart, \
+                    
ConstantPointerNull::get(llvm::PointerType::getUnqual(Ctx)))
+INSTR_PROF_GPU_SECT(void *, llvm::PointerType::getUnqual(Ctx), DataStop, \
+                    
ConstantPointerNull::get(llvm::PointerType::getUnqual(Ctx)))
+INSTR_PROF_GPU_SECT(void *, llvm::PointerType::getUnqual(Ctx), VersionVar, \
+                    
ConstantPointerNull::get(llvm::PointerType::getUnqual(Ctx)))
+#undef INSTR_PROF_GPU_SECT
+/* INSTR_PROF_GPU_SECT end. */
+
 /* INSTR_PROF_RAW_HEADER  start */
 /* Definition of member fields of the raw profile header data structure. */
 /* Please update llvm/docs/InstrProfileFormat.rst as appropriate when updating
@@ -761,6 +786,10 @@ serializeValueProfDataFrom(ValueProfRecordClosure *Closure,
  * specified via command line. */
 #define INSTR_PROF_PROFILE_NAME_VAR __llvm_profile_filename
 
+/* GPU profiling section bounds structure, populated by the compiler runtime
+ * and read by the host to extract profiling data. */
+#define INSTR_PROF_SECT_BOUNDS_TABLE __llvm_profile_sections
+
 /* section name strings common to all targets other
    than WIN32 */
 #define INSTR_PROF_DATA_COMMON __llvm_prf_data
diff --git a/compiler-rt/lib/profile/InstrProfiling.h 
b/compiler-rt/lib/profile/InstrProfiling.h
index f01cbec44be64..53cc10b342d09 100644
--- a/compiler-rt/lib/profile/InstrProfiling.h
+++ b/compiler-rt/lib/profile/InstrProfiling.h
@@ -57,6 +57,11 @@ typedef struct 
COMPILER_RT_ALIGNAS(INSTR_PROF_DATA_ALIGNMENT) VTableProfData {
 #include "profile/InstrProfData.inc"
 } VTableProfData;
 
+typedef struct __llvm_profile_gpu_sections {
+#define INSTR_PROF_GPU_SECT(Type, LLVMType, Name, Initializer) Type Name;
+#include "profile/InstrProfData.inc"
+} __llvm_profile_gpu_sections;
+
 typedef struct COMPILER_RT_ALIGNAS(INSTR_PROF_DATA_ALIGNMENT)
     __llvm_gcov_init_func_struct {
 #define COVINIT_FUNC(Type, LLVMType, Name, Initializer) Type Name;
diff --git a/compiler-rt/lib/profile/InstrProfilingPlatformGPU.c 
b/compiler-rt/lib/profile/InstrProfilingPlatformGPU.c
index 6d8dacf030ff2..3100608eeb6d0 100644
--- a/compiler-rt/lib/profile/InstrProfilingPlatformGPU.c
+++ b/compiler-rt/lib/profile/InstrProfilingPlatformGPU.c
@@ -17,6 +17,9 @@
 #include "InstrProfiling.h"
 #include <gpuintrin.h>
 
+// Symbols exported to the GPU runtime need to be visible in the .dynsym table.
+#define COMPILER_RT_GPU_VISIBILITY __attribute__((visibility("protected")))
+
 // Wave-cooperative counter increment. The instrumentation pass emits calls to
 // this in place of the default non-atomic load/add/store or atomicrmw 
sequence.
 COMPILER_RT_VISIBILITY void __llvm_profile_instrument_gpu(uint64_t *counter,
@@ -27,9 +30,45 @@ COMPILER_RT_VISIBILITY void 
__llvm_profile_instrument_gpu(uint64_t *counter,
                               __ATOMIC_RELAXED, __MEMORY_SCOPE_DEVICE);
 }
 
-// InstrProfilingValue.c is excluded from GPU builds but passes may still emit
-// calls to this for memory intrinsics. provide a no-op to prevent link errors.
-COMPILER_RT_VISIBILITY void
-__llvm_profile_instrument_memop(int64_t i, void *ptr, int32_t i2) {}
+#if defined(__AMDGPU__)
+
+#define PROF_NAME_START INSTR_PROF_SECT_START(INSTR_PROF_NAME_COMMON)
+#define PROF_NAME_STOP INSTR_PROF_SECT_STOP(INSTR_PROF_NAME_COMMON)
+#define PROF_CNTS_START INSTR_PROF_SECT_START(INSTR_PROF_CNTS_COMMON)
+#define PROF_CNTS_STOP INSTR_PROF_SECT_STOP(INSTR_PROF_CNTS_COMMON)
+#define PROF_DATA_START INSTR_PROF_SECT_START(INSTR_PROF_DATA_COMMON)
+#define PROF_DATA_STOP INSTR_PROF_SECT_STOP(INSTR_PROF_DATA_COMMON)
+
+extern char PROF_NAME_START[] COMPILER_RT_VISIBILITY COMPILER_RT_WEAK;
+extern char PROF_NAME_STOP[] COMPILER_RT_VISIBILITY COMPILER_RT_WEAK;
+extern char PROF_CNTS_START[] COMPILER_RT_VISIBILITY COMPILER_RT_WEAK;
+extern char PROF_CNTS_STOP[] COMPILER_RT_VISIBILITY COMPILER_RT_WEAK;
+extern __llvm_profile_data PROF_DATA_START[] COMPILER_RT_VISIBILITY
+    COMPILER_RT_WEAK;
+extern __llvm_profile_data PROF_DATA_STOP[] COMPILER_RT_VISIBILITY
+    COMPILER_RT_WEAK;
+
+// AMDGPU is a proper ELF target and exports the linker-defined section bounds.
+COMPILER_RT_GPU_VISIBILITY
+__llvm_profile_gpu_sections INSTR_PROF_SECT_BOUNDS_TABLE = {
+    PROF_NAME_START,
+    PROF_NAME_STOP,
+    PROF_CNTS_START,
+    PROF_CNTS_STOP,
+    PROF_DATA_START,
+    PROF_DATA_STOP,
+    &INSTR_PROF_RAW_VERSION_VAR};
+
+#elif defined(__NVPTX__)
+
+// NVPTX supports neither sections nor ELF symbols, we rely on the handling in
+// the 'InstrProfilingPlatformOther.c' file to fill this at initialization 
time.
+// FIXME: This will not work until we make the NVPTX backend emit section
+//        globals next to each other.
+COMPILER_RT_GPU_VISIBILITY
+__llvm_profile_gpu_sections INSTR_PROF_SECT_BOUNDS_TABLE = {
+    NULL, NULL, NULL, NULL, NULL, NULL, &INSTR_PROF_RAW_VERSION_VAR};
+
+#endif
 
 #endif
diff --git a/compiler-rt/lib/profile/InstrProfilingPlatformLinux.c 
b/compiler-rt/lib/profile/InstrProfilingPlatformLinux.c
index acdb222004fd4..7a22be6bb5861 100644
--- a/compiler-rt/lib/profile/InstrProfilingPlatformLinux.c
+++ b/compiler-rt/lib/profile/InstrProfilingPlatformLinux.c
@@ -23,7 +23,7 @@
 #if defined(__linux__) || defined(__FreeBSD__) || defined(__Fuchsia__) ||      
\
     (defined(__sun__) && defined(__svr4__)) || defined(__NetBSD__) ||          
\
     defined(_AIX) || defined(__wasm__) || defined(__HAIKU__) ||                
\
-    defined(COMPILER_RT_PROFILE_BAREMETAL)
+    (defined(COMPILER_RT_PROFILE_BAREMETAL) && !defined(__NVPTX__))
 
 #if !defined(_AIX) && !defined(__wasm__) &&                                    
\
     !defined(COMPILER_RT_PROFILE_BAREMETAL)
diff --git a/compiler-rt/lib/profile/InstrProfilingPlatformOther.c 
b/compiler-rt/lib/profile/InstrProfilingPlatformOther.c
index f5d1c74f10115..4b21283279422 100644
--- a/compiler-rt/lib/profile/InstrProfilingPlatformOther.c
+++ b/compiler-rt/lib/profile/InstrProfilingPlatformOther.c
@@ -13,28 +13,40 @@
 // This implementation expects the compiler instrumentation pass to define a
 // constructor in each file which calls into this file.
 
-#if !defined(__APPLE__) && !defined(__linux__) && !defined(__FreeBSD__) &&     
\
-    !defined(__Fuchsia__) && !(defined(__sun__) && defined(__svr4__)) &&       
\
-    !defined(__NetBSD__) && !defined(_WIN32) && !defined(_AIX) &&              
\
-    !defined(__wasm__) && !defined(__HAIKU__) &&                               
\
-    !defined(COMPILER_RT_PROFILE_BAREMETAL)
-
-#include <stdlib.h>
-#include <stdio.h>
+#if (!defined(__APPLE__) && !defined(__linux__) && !defined(__FreeBSD__) &&    
\
+     !defined(__Fuchsia__) && !(defined(__sun__) && defined(__svr4__)) &&      
\
+     !defined(__NetBSD__) && !defined(_WIN32) && !defined(_AIX) &&             
\
+     !defined(__wasm__) && !defined(__HAIKU__) &&                              
\
+     !defined(COMPILER_RT_PROFILE_BAREMETAL)) ||                               
\
+    defined(__NVPTX__)
 
 #include "InstrProfiling.h"
 #include "InstrProfilingInternal.h"
 
+#if defined(__NVPTX__)
+extern __llvm_profile_gpu_sections INSTR_PROF_SECT_BOUNDS_TABLE;
+#define DataFirst                                                              
\
+  (*(const __llvm_profile_data **)&INSTR_PROF_SECT_BOUNDS_TABLE.DataStart)
+#define DataLast                                                               
\
+  (*(const __llvm_profile_data **)&INSTR_PROF_SECT_BOUNDS_TABLE.DataStop)
+#define NamesFirst (*(const char **)&INSTR_PROF_SECT_BOUNDS_TABLE.NamesStart)
+#define NamesLast (*(const char **)&INSTR_PROF_SECT_BOUNDS_TABLE.NamesStop)
+#define CountersFirst (*(char **)&INSTR_PROF_SECT_BOUNDS_TABLE.CountersStart)
+#define CountersLast (*(char **)&INSTR_PROF_SECT_BOUNDS_TABLE.CountersStop)
+#else
 static const __llvm_profile_data *DataFirst = NULL;
 static const __llvm_profile_data *DataLast = NULL;
-static const VTableProfData *VTableProfDataFirst = NULL;
-static const VTableProfData *VTableProfDataLast = NULL;
 static const char *NamesFirst = NULL;
 static const char *NamesLast = NULL;
-static const char *VNamesFirst = NULL;
-static const char *VNamesLast = NULL;
 static char *CountersFirst = NULL;
 static char *CountersLast = NULL;
+#endif
+static const VTableProfData *VTableProfDataFirst = NULL;
+static const VTableProfData *VTableProfDataLast = NULL;
+static const char *VNamesFirst = NULL;
+static const char *VNamesLast = NULL;
+static char *BitmapFirst = NULL;
+static char *BitmapLast = NULL;
 
 static const void *getMinAddr(const void *A1, const void *A2) {
   return A1 < A2 ? A1 : A2;
@@ -55,6 +67,23 @@ COMPILER_RT_VISIBILITY
 void __llvm_profile_register_function(void *Data_) {
   /* TODO: Only emit this function if we can't use linker magic. */
   const __llvm_profile_data *Data = (__llvm_profile_data *)Data_;
+
+#if defined(__NVPTX__)
+  // NVPTX stores absolute counter/bitmap addresses to avoid circular
+  // dependencies in PTX global variable initializers. Convert to relative
+  // offsets so the host-side profile reader sees the standard format.
+  {
+    uintptr_t Rel = (uintptr_t)Data->CounterPtr - (uintptr_t)Data_;
+    __builtin_memcpy((char *)Data_ +
+                         __builtin_offsetof(__llvm_profile_data, CounterPtr),
+                     &Rel, sizeof(Rel));
+    Rel = (uintptr_t)Data->BitmapPtr - (uintptr_t)Data_;
+    __builtin_memcpy((char *)Data_ +
+                         __builtin_offsetof(__llvm_profile_data, BitmapPtr),
+                     &Rel, sizeof(Rel));
+  }
+#endif
+
   if (!DataFirst) {
     DataFirst = Data;
     DataLast = Data + 1;
@@ -117,9 +146,7 @@ COMPILER_RT_VISIBILITY
 char *__llvm_profile_end_bitmap(void) { return BitmapLast; }
 
 COMPILER_RT_VISIBILITY
-ValueProfNode *__llvm_profile_begin_vnodes(void) {
-  return 0;
-}
+ValueProfNode *__llvm_profile_begin_vnodes(void) { return 0; }
 COMPILER_RT_VISIBILITY
 ValueProfNode *__llvm_profile_end_vnodes(void) { return 0; }
 
diff --git a/llvm/include/llvm/ProfileData/InstrProfData.inc 
b/llvm/include/llvm/ProfileData/InstrProfData.inc
index 46d6bb5bd8896..f56f3a34510bc 100644
--- a/llvm/include/llvm/ProfileData/InstrProfData.inc
+++ b/llvm/include/llvm/ProfileData/InstrProfData.inc
@@ -142,6 +142,31 @@ INSTR_PROF_VALUE_NODE(PtrToNodeT, 
llvm::PointerType::getUnqual(Ctx), Next, \
 #undef INSTR_PROF_VALUE_NODE
 /* INSTR_PROF_VALUE_NODE end. */
 
+/* INSTR_PROF_GPU_SECT start. */
+/* Fields of the GPU profile section bounds structure, populated by the
+ * compiler runtime and read by the host to extract profiling data. */
+#ifndef INSTR_PROF_GPU_SECT
+#define INSTR_PROF_GPU_SECT(Type, LLVMType, Name, Initializer)
+#else
+#define INSTR_PROF_DATA_DEFINED
+#endif
+INSTR_PROF_GPU_SECT(void *, llvm::PointerType::getUnqual(Ctx), NamesStart, \
+                    
ConstantPointerNull::get(llvm::PointerType::getUnqual(Ctx)))
+INSTR_PROF_GPU_SECT(void *, llvm::PointerType::getUnqual(Ctx), NamesStop, \
+                    
ConstantPointerNull::get(llvm::PointerType::getUnqual(Ctx)))
+INSTR_PROF_GPU_SECT(void *, llvm::PointerType::getUnqual(Ctx), CountersStart, \
+                    
ConstantPointerNull::get(llvm::PointerType::getUnqual(Ctx)))
+INSTR_PROF_GPU_SECT(void *, llvm::PointerType::getUnqual(Ctx), CountersStop, \
+                    
ConstantPointerNull::get(llvm::PointerType::getUnqual(Ctx)))
+INSTR_PROF_GPU_SECT(void *, llvm::PointerType::getUnqual(Ctx), DataStart, \
+                    
ConstantPointerNull::get(llvm::PointerType::getUnqual(Ctx)))
+INSTR_PROF_GPU_SECT(void *, llvm::PointerType::getUnqual(Ctx), DataStop, \
+                    
ConstantPointerNull::get(llvm::PointerType::getUnqual(Ctx)))
+INSTR_PROF_GPU_SECT(void *, llvm::PointerType::getUnqual(Ctx), VersionVar, \
+                    
ConstantPointerNull::get(llvm::PointerType::getUnqual(Ctx)))
+#undef INSTR_PROF_GPU_SECT
+/* INSTR_PROF_GPU_SECT end. */
+
 /* INSTR_PROF_RAW_HEADER  start */
 /* Definition of member fields of the raw profile header data structure. */
 /* Please update llvm/docs/InstrProfileFormat.rst as appropriate when updating
@@ -761,6 +786,10 @@ serializeValueProfDataFrom(ValueProfRecordClosure *Closure,
  * specified via command line. */
 #define INSTR_PROF_PROFILE_NAME_VAR __llvm_profile_filename
 
+/* GPU profiling section bounds structure, populated by the compiler runtime
+ * and read by the host to extract profiling data. */
+#define INSTR_PROF_SECT_BOUNDS_TABLE __llvm_profile_sections
+
 /* section name strings common to all targets other
    than WIN32 */
 #define INSTR_PROF_DATA_COMMON __llvm_prf_data
diff --git a/llvm/lib/ProfileData/InstrProf.cpp 
b/llvm/lib/ProfileData/InstrProf.cpp
index 82469481881c0..b96db851fa6bd 100644
--- a/llvm/lib/ProfileData/InstrProf.cpp
+++ b/llvm/lib/ProfileData/InstrProf.cpp
@@ -486,25 +486,18 @@ bool isGPUProfTarget(const Module &M) {
 }
 
 void setPGOFuncVisibility(Module &M, GlobalVariable *FuncNameVar) {
-  // If the target is a GPU, make the symbol protected so it can
-  // be read from the host device
-  if (isGPUProfTarget(M))
-    FuncNameVar->setVisibility(GlobalValue::ProtectedVisibility);
   // Hide the symbol so that we correctly get a copy for each executable.
-  else if (!GlobalValue::isLocalLinkage(FuncNameVar->getLinkage()))
+  if (!GlobalValue::isLocalLinkage(FuncNameVar->getLinkage()))
     FuncNameVar->setVisibility(GlobalValue::HiddenVisibility);
 }
 
 GlobalVariable *createPGOFuncNameVar(Module &M,
                                      GlobalValue::LinkageTypes Linkage,
                                      StringRef PGOFuncName) {
-  // Ensure profiling variables on GPU are visible to be read from host
-  if (isGPUProfTarget(M))
-    Linkage = GlobalValue::ExternalLinkage;
   // We generally want to match the function's linkage, but 
available_externally
   // and extern_weak both have the wrong semantics, and anything that doesn't
   // need to link across compilation units doesn't need to be visible at all.
-  else if (Linkage == GlobalValue::ExternalWeakLinkage)
+  if (Linkage == GlobalValue::ExternalWeakLinkage)
     Linkage = GlobalValue::LinkOnceAnyLinkage;
   else if (Linkage == GlobalValue::AvailableExternallyLinkage)
     Linkage = GlobalValue::LinkOnceODRLinkage;
diff --git a/llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp 
b/llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp
index c60426439c910..e4a0dbe0ab550 100644
--- a/llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp
+++ b/llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp
@@ -1423,6 +1423,10 @@ static inline Constant *getFuncAddrForProfData(Function 
*Fn) {
 }
 
 static bool needsRuntimeRegistrationOfSectionRange(const Triple &TT) {
+  // NVPTX is an ELF target but PTX does not expose sections or linker symbols.
+  if (TT.isNVPTX())
+    return true;
+
   // compiler-rt uses linker support to get data/counters/name start/end for
   // ELF, COFF, Mach-O, XCOFF, and Wasm.
   if (TT.isOSBinFormatELF() || TT.isOSBinFormatCOFF() ||
@@ -1813,10 +1817,6 @@ void 
InstrLowerer::createDataVariable(InstrProfCntrInstBase *Inc) {
   for (uint32_t Kind = IPVK_First; Kind <= IPVK_Last; ++Kind)
     Int16ArrayVals[Kind] = ConstantInt::get(Int16Ty, PD.NumValueSites[Kind]);
 
-  if (isGPUProfTarget(M)) {
-    Linkage = GlobalValue::ExternalLinkage;
-    Visibility = GlobalValue::ProtectedVisibility;
-  }
   // If the data variable is not referenced by code (if we don't emit
   // @llvm.instrprof.value.profile, NS will be 0), and the counter keeps the
   // data variable live under linker GC, the data variable can be private. This
@@ -1828,9 +1828,9 @@ void 
InstrLowerer::createDataVariable(InstrProfCntrInstBase *Inc) {
   // If profd is in a deduplicate comdat, NS==0 with a hash suffix guarantees
   // that other copies must have the same CFG and cannot have value profiling.
   // If no hash suffix, other profd copies may be referenced by code.
-  else if (NS == 0 && !(DataReferencedByCode && NeedComdat && !Renamed) &&
-           (TT.isOSBinFormatELF() ||
-            (!DataReferencedByCode && TT.isOSBinFormatCOFF()))) {
+  if (NS == 0 && !(DataReferencedByCode && NeedComdat && !Renamed) &&
+      (TT.isOSBinFormatELF() ||
+       (!DataReferencedByCode && TT.isOSBinFormatCOFF()))) {
     Linkage = GlobalValue::PrivateLinkage;
     Visibility = GlobalValue::DefaultVisibility;
   }
@@ -1847,6 +1847,14 @@ void 
InstrLowerer::createDataVariable(InstrProfCntrInstBase *Inc) {
     RelativeCounterPtr = ConstantExpr::getPtrToInt(CounterPtr, IntPtrTy);
     if (BitmapPtr != nullptr)
       RelativeBitmapPtr = ConstantExpr::getPtrToInt(BitmapPtr, IntPtrTy);
+  } else if (TT.isNVPTX()) {
+    // The NVPTX target cannot handle self-referencing constant expressions in
+    // global initializers at all. Use absolute pointers and have the runtime
+    // registration convert them to relative offsets.
+    DataSectionKind = IPSK_data;
+    RelativeCounterPtr = ConstantExpr::getPtrToInt(CounterPtr, IntPtrTy);
+    if (BitmapPtr != nullptr)
+      RelativeBitmapPtr = ConstantExpr::getPtrToInt(BitmapPtr, IntPtrTy);
   } else {
     // Reference the counter variable with a label difference (link-time
     // constant).
@@ -1951,10 +1959,6 @@ void InstrLowerer::emitNameData() {
   NamesVar = new GlobalVariable(M, NamesVal->getType(), true,
                                 GlobalValue::PrivateLinkage, NamesVal,
                                 getInstrProfNamesVarName());
-  if (isGPUProfTarget(M)) {
-    NamesVar->setLinkage(GlobalValue::ExternalLinkage);
-    NamesVar->setVisibility(GlobalValue::ProtectedVisibility);
-  }
 
   NamesSize = CompressedNameStr.size();
   setGlobalVariableLargeSection(TT, *NamesVar);
@@ -2046,9 +2050,14 @@ void InstrLowerer::emitRegistration() {
 }
 
 bool InstrLowerer::emitRuntimeHook() {
+  // GPU profiling data is read directly by the host offload runtime, not the
+  // standard runtime hook.
+  if (TT.isGPU())
+    return false;
+
   // We expect the linker to be invoked with -u<hook_var> flag for Linux
   // in which case there is no need to emit the external variable.
-  if (TT.isOSLinux() || TT.isOSAIX())
+  if (TT.isOSLinux() || TT.isOSAIX() || TT.isGPU())
     return false;
 
   // If the module's provided its own runtime, we don't need to do anything.
@@ -2060,10 +2069,7 @@ bool InstrLowerer::emitRuntimeHook() {
   auto *Var =
       new GlobalVariable(M, Int32Ty, false, GlobalValue::ExternalLinkage,
                          nullptr, getInstrProfRuntimeHookVarName());
-  if (isGPUProfTarget(M))
-    Var->setVisibility(GlobalValue::ProtectedVisibility);
-  else
-    Var->setVisibility(GlobalValue::HiddenVisibility);
+  Var->setVisibility(GlobalValue::HiddenVisibility);
 
   if (TT.isOSBinFormatELF() && !TT.isPS()) {
     // Mark the user variable as used so that it isn't stripped out.
diff --git a/llvm/lib/Transforms/Instrumentation/PGOInstrumentation.cpp 
b/llvm/lib/Transforms/Instrumentation/PGOInstrumentation.cpp
index 0232d45e5b7bb..db032d6fcad45 100644
--- a/llvm/lib/Transforms/Instrumentation/PGOInstrumentation.cpp
+++ b/llvm/lib/Transforms/Instrumentation/PGOInstrumentation.cpp
@@ -469,9 +469,6 @@ createIRLevelProfileFlagVar(Module &M,
       M, IntTy64, true, GlobalValue::WeakAnyLinkage,
       Constant::getIntegerValue(IntTy64, APInt(64, ProfileVersion)), VarName);
   IRLevelVersionVariable->setVisibility(GlobalValue::HiddenVisibility);
-  if (isGPUProfTarget(M))
-    IRLevelVersionVariable->setVisibility(
-        llvm::GlobalValue::ProtectedVisibility);
 
   Triple TT(M.getTargetTriple());
   if (TT.supportsCOMDAT()) {
diff --git a/offload/plugins-nextgen/common/include/GlobalHandler.h 
b/offload/plugins-nextgen/common/include/GlobalHandler.h
index af7dac66ca85d..529d697d355b2 100644
--- a/offload/plugins-nextgen/common/include/GlobalHandler.h
+++ b/offload/plugins-nextgen/common/include/GlobalHandler.h
@@ -65,6 +65,12 @@ struct __llvm_profile_data {
 #include "llvm/ProfileData/InstrProfData.inc"
 };
 
+struct __llvm_profile_gpu_sections {
+#define INSTR_PROF_GPU_SECT(Type, LLVMType, Name, Initializer)                 
\
+  std::remove_const<Type>::type Name;
+#include "llvm/ProfileData/InstrProfData.inc"
+};
+
 extern "C" {
 extern int __attribute__((weak)) __llvm_write_custom_profile(
     const char *Target, const __llvm_profile_data *DataBegin,
@@ -72,11 +78,11 @@ extern int __attribute__((weak)) 
__llvm_write_custom_profile(
     const char *CountersEnd, const char *NamesBegin, const char *NamesEnd,
     const uint64_t *VersionOverride);
 }
-/// PGO profiling data extracted from a GPU device
+/// PGO profiling data extracted from a GPU device via __llvm_profile_sections.
 struct GPUProfGlobals {
-  SmallVector<int64_t> Counts;
-  SmallVector<__llvm_profile_data> Data;
-  SmallVector<uint8_t> NamesData;
+  SmallVector<char> NamesSection;
+  SmallVector<char> CountersSection;
+  SmallVector<char> DataSection;
   Triple TargetTriple;
   uint64_t Version = INSTR_PROF_RAW_VERSION;
 
diff --git a/offload/plugins-nextgen/common/src/GlobalHandler.cpp 
b/offload/plugins-nextgen/common/src/GlobalHandler.cpp
index b92c606d14da1..7db497d3ee7f5 100644
--- a/offload/plugins-nextgen/common/src/GlobalHandler.cpp
+++ b/offload/plugins-nextgen/common/src/GlobalHandler.cpp
@@ -16,6 +16,7 @@
 
 #include "Shared/Utils.h"
 
+#include "llvm/ProfileData/InstrProf.h"
 #include "llvm/ProfileData/InstrProfData.inc"
 #include "llvm/Support/Error.h"
 
@@ -179,67 +180,61 @@ Error 
GenericGlobalHandlerTy::readGlobalFromImage(GenericDeviceTy &Device,
 Expected<GPUProfGlobals>
 GenericGlobalHandlerTy::readProfilingGlobals(GenericDeviceTy &Device,
                                              DeviceImageTy &Image) {
-  GPUProfGlobals DeviceProfileData;
+  const char *TableName = INSTR_PROF_QUOTE(INSTR_PROF_SECT_BOUNDS_TABLE);
+  if (!isSymbolInImage(Device, Image, TableName))
+    return GPUProfGlobals{};
+
+  GPUProfGlobals ProfData;
   auto ObjFile = getELFObjectFile(Image);
   if (!ObjFile)
     return ObjFile.takeError();
 
   std::unique_ptr<ELFObjectFileBase> ELFObj(
       static_cast<ELFObjectFileBase *>(ObjFile->release()));
-  DeviceProfileData.TargetTriple = ELFObj->makeTriple();
-
-  // Iterate through elf symbols
-  for (auto &Sym : ELFObj->symbols()) {
-    auto NameOrErr = Sym.getName();
-    if (!NameOrErr)
-      return NameOrErr.takeError();
-
-    // Check if given current global is a profiling global based
-    // on name
-    if (*NameOrErr == getInstrProfNamesVarName()) {
-      // Read in profiled function names from ELF
-      auto SectionOrErr = Sym.getSection();
-      if (!SectionOrErr)
-        return SectionOrErr.takeError();
-
-      auto ContentsOrErr = (*SectionOrErr)->getContents();
-      if (!ContentsOrErr)
-        return ContentsOrErr.takeError();
-
-      SmallVector<uint8_t> NameBytes(ContentsOrErr->bytes());
-      DeviceProfileData.NamesData = NameBytes;
-    } else if (NameOrErr->starts_with(getInstrProfCountersVarPrefix())) {
-      // Read global variable profiling counts
-      SmallVector<int64_t> Counts(Sym.getSize() / sizeof(int64_t), 0);
-      GlobalTy CountGlobal(NameOrErr->str(), Sym.getSize(), Counts.data());
-      if (auto Err = readGlobalFromDevice(Device, Image, CountGlobal))
-        return Err;
-      DeviceProfileData.Counts.append(std::move(Counts));
-    } else if (NameOrErr->starts_with(getInstrProfDataVarPrefix())) {
-      // Read profiling data for this global variable
-      __llvm_profile_data Data{};
-      GlobalTy DataGlobal(NameOrErr->str(), Sym.getSize(), &Data);
-      if (auto Err = readGlobalFromDevice(Device, Image, DataGlobal))
-        return Err;
-      DeviceProfileData.Data.push_back(std::move(Data));
-    } else if (*NameOrErr == INSTR_PROF_QUOTE(INSTR_PROF_RAW_VERSION_VAR)) {
-      uint64_t RawVersionData;
-      GlobalTy RawVersionGlobal(NameOrErr->str(), Sym.getSize(),
-                                &RawVersionData);
-      if (auto Err = readGlobalFromDevice(Device, Image, RawVersionGlobal))
-        return Err;
-      DeviceProfileData.Version = RawVersionData;
-    }
-  }
-  return DeviceProfileData;
+  ProfData.TargetTriple = ELFObj->makeTriple();
+
+  __llvm_profile_gpu_sections Table = {};
+  GlobalTy TableGlobal(TableName, sizeof(Table), &Table);
+  if (auto Err = readGlobalFromDevice(Device, Image, TableGlobal))
+    return Err;
+
+  // Read the contiguous data from one of the profiling sections on the device.
+  auto ReadSection = [&](void *Start, void *Stop,
+                         SmallVector<char> &Out) -> Error {
+    uintptr_t Begin = reinterpret_cast<uintptr_t>(Start);
+    uintptr_t End = reinterpret_cast<uintptr_t>(Stop);
+    size_t Size = End - Begin;
+    Out.resize_for_overwrite(Size);
+    return Device.dataRetrieve(Out.data(), Start, Size, /*AsyncInfo=*/nullptr);
+  };
+
+  if (auto Err =
+          ReadSection(Table.NamesStart, Table.NamesStop, 
ProfData.NamesSection))
+    return Err;
+  if (auto Err = ReadSection(Table.CountersStart, Table.CountersStop,
+                             ProfData.CountersSection))
+    return Err;
+  if (auto Err =
+          ReadSection(Table.DataStart, Table.DataStop, ProfData.DataSection))
+    return Err;
+
+  // Get the profiling version from the device.
+  if (auto Err = Device.dataRetrieve(&ProfData.Version, Table.VersionVar,
+                                     sizeof(uint64_t),
+                                     /*AsyncInfo=*/nullptr))
+    return Err;
+
+  return ProfData;
 }
 
 void GPUProfGlobals::dump() const {
   outs() << "======= GPU Profile =======\nTarget: " << TargetTriple.str()
          << "\n";
 
-  outs() << "======== Counters =========\n";
-  for (size_t i = 0; i < Counts.size(); i++) {
+  size_t NumCounters = CountersSection.size() / sizeof(int64_t);
+  outs() << "======== Counters (" << NumCounters << ") =========\n";
+  auto *Counts = reinterpret_cast<const int64_t *>(CountersSection.data());
+  for (size_t i = 0; i < NumCounters; i++) {
     if (i > 0 && i % 10 == 0)
       outs() << "\n";
     else if (i != 0)
@@ -248,33 +243,14 @@ void GPUProfGlobals::dump() const {
   }
   outs() << "\n";
 
-  outs() << "========== Data ===========\n";
-  for (const auto &ProfData : Data) {
-    outs() << "{ ";
-// The ProfData.Name maybe array, eg: NumValueSites[IPVK_Last+1] .
-// If we print out it directly, we are accessing out of bound data.
-// Skip dumping the array for now.
-#define INSTR_PROF_DATA(Type, LLVMType, Name, Initializer)                     
\
-  if (sizeof(#Name) > 2 && #Name[sizeof(#Name) - 2] == ']') {                  
\
-    outs() << "[...] ";                                                        
\
-  } else {                                                                     
\
-    outs() << ProfData.Name << " ";                                            
\
-  }
-#include "llvm/ProfileData/InstrProfData.inc"
-    outs() << "}\n";
-  }
+  size_t NumDataEntries = DataSection.size() / sizeof(__llvm_profile_data);
+  outs() << "========== Data (" << NumDataEntries << ") ===========\n";
 
   outs() << "======== Functions ========\n";
-  std::string s;
-  s.reserve(NamesData.size());
-  for (uint8_t Name : NamesData) {
-    s.push_back((char)Name);
-  }
-
   InstrProfSymtab Symtab;
-  if (Error Err = Symtab.create(StringRef(s))) {
+  if (Error Err =
+          Symtab.create(StringRef(NamesSection.data(), NamesSection.size())))
     consumeError(std::move(Err));
-  }
   Symtab.dumpNames(outs());
   outs() << "===========================\n";
 }
@@ -286,35 +262,27 @@ Error GPUProfGlobals::write() const {
                          "The compiler-rt profiling library must be linked for 
"
                          "GPU PGO to work.");
 
-  size_t DataSize = Data.size() * sizeof(__llvm_profile_data),
-         CountsSize = Counts.size() * sizeof(int64_t);
-  __llvm_profile_data *DataBegin, *DataEnd;
-  char *CountersBegin, *CountersEnd, *NamesBegin, *NamesEnd;
-
-  // Initialize array of contiguous data. We need to make sure each section is
-  // contiguous so that the PGO library can compute deltas properly
-  SmallVector<uint8_t> ContiguousData(NamesData.size() + DataSize + 
CountsSize);
-
-  // Compute region pointers
-  DataBegin = (__llvm_profile_data *)(ContiguousData.data() + CountsSize);
-  DataEnd =
-      (__llvm_profile_data *)(ContiguousData.data() + CountsSize + DataSize);
-  CountersBegin = (char *)ContiguousData.data();
-  CountersEnd = (char *)(ContiguousData.data() + CountsSize);
-  NamesBegin = (char *)(ContiguousData.data() + CountsSize + DataSize);
-  NamesEnd = (char *)(ContiguousData.data() + CountsSize + DataSize +
-                      NamesData.size());
-
-  // Copy data to contiguous buffer
-  memcpy(DataBegin, Data.data(), DataSize);
-  memcpy(CountersBegin, Counts.data(), CountsSize);
-  memcpy(NamesBegin, NamesData.data(), NamesData.size());
-
-  // Invoke compiler-rt entrypoint
-  int result = __llvm_write_custom_profile(
-      TargetTriple.str().c_str(), DataBegin, DataEnd, CountersBegin,
-      CountersEnd, NamesBegin, NamesEnd, &Version);
-  if (result != 0)
+  // The sections must be laid out contiguously so that lprofWriteDataImpl
+  // computes the correct CountersDelta from the pointer arithmetic.
+  // TODO: Move this interface to compiler-rt.
+  SmallVector<char> Buffer(CountersSection.size() + DataSection.size() +
+                           NamesSection.size());
+  char *CountersBegin = Buffer.data();
+  char *DataBegin = CountersBegin + CountersSection.size();
+  char *NamesBegin = DataBegin + DataSection.size();
+
+  memcpy(CountersBegin, CountersSection.data(), CountersSection.size());
+  memcpy(DataBegin, DataSection.data(), DataSection.size());
+  memcpy(NamesBegin, NamesSection.data(), NamesSection.size());
+
+  int Result = __llvm_write_custom_profile(
+      TargetTriple.str().c_str(),
+      reinterpret_cast<const __llvm_profile_data *>(DataBegin),
+      reinterpret_cast<const __llvm_profile_data *>(DataBegin +
+                                                    DataSection.size()),
+      CountersBegin, CountersBegin + CountersSection.size(), NamesBegin,
+      NamesBegin + NamesSection.size(), &Version);
+  if (Result != 0)
     return Plugin::error(ErrorCode::HOST_IO,
                          "error writing GPU PGO data to file");
 
@@ -322,5 +290,5 @@ Error GPUProfGlobals::write() const {
 }
 
 bool GPUProfGlobals::empty() const {
-  return Counts.empty() && Data.empty() && NamesData.empty();
+  return CountersSection.empty() && DataSection.empty() && 
NamesSection.empty();
 }

_______________________________________________
cfe-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[clang] [compiler-rt] [llvm] [openmp] [compiler-rt] Rework profile data handling for GPU targets (PR #187136)

Reply via email to