llvmbot wrote:

<!--LLVM PR SUMMARY COMMENT-->
@llvm/pr-subscribers-clang-driver

@llvm/pr-subscribers-backend-amdgpu

Author: Yaxun (Sam) Liu (yxsamliu)

<details>
<summary>Changes</summary>

This patch adds device-side Profile Guided Optimization (PGO) support
for HIP/AMDGPU, enabling profile-guided compiler optimizations for GPU
kernels.

Key features:
- Wave-aggregated counter increments to reduce atomic contention
- Per-TU contiguous counter allocation to avoid linker reordering issues
- Uniformity detection to identify wave-uniform vs divergent branches
- Uniformity-aware spill placement to prevent PGO regressions on GPUs

The uniformity detection is critical because standard PGO can cause
severe performance regressions on GPUs. When PGO moves register spills
to "cold" paths, but those paths are entered divergently (different
threads take different paths), partial-wave memory accesses cause poor
coalescing and up to 3.7x slowdown. By detecting uniformity at profile
collection time and gating spill placement decisions, we achieve:
- 12-14% speedup on uniform branches
- No regression on divergent branches (gating prevents the issue)

Implementation spans:
- LLVM instrumentation (InstrProfiling.cpp): AMDGPU-specific lowering,
  contiguous counter arrays, uniform counter instrumentation
- Profile format (InstrProfData.inc): NumOffloadProfilingThreads field,
  UniformityBits in indexed profile (Version 14)
- Profile reader/writer: Handle expanded counters, uniformity bits
- compiler-rt (InstrProfilingPlatformROCm.c): Device profile collection
  via HIP APIs, .unifcnts file for uniform counters
- Clang driver (HIPAMD.cpp): Profile filename rewriting for device
- SpillPlacement.cpp: Frequency flattening for divergent blocks
- llvm-profdata: Read .unifcnts, compute uniformity during merge

Tested with lit tests and end-to-end benchmarks on gfx1100.

---

Patch is 158.20 KiB, truncated to 20.00 KiB below, full version: 
https://github.com/llvm/llvm-project/pull/177665.diff


30 Files Affected:

- (modified) clang/docs/UsersManual.rst (+55) 
- (modified) clang/lib/Driver/ToolChains/HIPAMD.cpp (+14) 
- (modified) compiler-rt/include/profile/InstrProfData.inc (+6-2) 
- (modified) compiler-rt/lib/profile/CMakeLists.txt (+1) 
- (modified) compiler-rt/lib/profile/InstrProfiling.h (+20) 
- (modified) compiler-rt/lib/profile/InstrProfilingFile.c (+2) 
- (modified) compiler-rt/lib/profile/InstrProfilingInternal.h (+7) 
- (added) compiler-rt/lib/profile/InstrProfilingPlatformROCm.c (+702) 
- (modified) llvm/docs/CommandGuide/llvm-profdata.rst (+11) 
- (modified) llvm/docs/InstrProfileFormat.rst (+73) 
- (modified) llvm/include/llvm/ProfileData/InstrProf.h (+46-5) 
- (modified) llvm/include/llvm/ProfileData/InstrProfData.inc (+12-3) 
- (modified) llvm/include/llvm/ProfileData/InstrProfWriter.h (+10) 
- (modified) llvm/include/llvm/Transforms/Instrumentation/CFGMST.h (+73-52) 
- (modified) llvm/lib/CodeGen/SpillPlacement.cpp (+49-1) 
- (modified) llvm/lib/Passes/StandardInstrumentations.cpp (+8-2) 
- (modified) llvm/lib/ProfileData/InstrProf.cpp (+113-26) 
- (modified) llvm/lib/ProfileData/InstrProfReader.cpp (+98-9) 
- (modified) llvm/lib/ProfileData/InstrProfWriter.cpp (+61-11) 
- (modified) llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp (+1023-6) 
- (modified) llvm/lib/Transforms/Instrumentation/PGOInstrumentation.cpp (+62-8) 
- (added) 
llvm/test/Instrumentation/InstrProfiling/amdgpu-contiguous-counters.ll (+41) 
- (added) llvm/test/Instrumentation/InstrProfiling/amdgpu-uniform-counters.ll 
(+31) 
- (modified) llvm/test/Instrumentation/InstrProfiling/coverage.ll (+4-4) 
- (modified) llvm/test/Instrumentation/InstrProfiling/inline-data-var-create.ll 
(+12-11) 
- (modified) llvm/test/Instrumentation/InstrProfiling/platform.ll (+16) 
- (modified) llvm/test/Transforms/PGOProfile/comdat_internal.ll (+2-2) 
- (modified) llvm/test/Transforms/PGOProfile/instrprof_burst_sampling_fast.ll 
(+1-1) 
- (modified) llvm/test/tools/llvm-profdata/profile-version.test (+1-1) 
- (modified) llvm/tools/llvm-profdata/llvm-profdata.cpp (+120-5) 


``````````diff
diff --git a/clang/docs/UsersManual.rst b/clang/docs/UsersManual.rst
index c624efb26f67d..cea46f5719066 100644
--- a/clang/docs/UsersManual.rst
+++ b/clang/docs/UsersManual.rst
@@ -3500,6 +3500,61 @@ generation and profile use (which can lead to discarded 
counters in such
 functions). Using these APIs in the program's cold regions introduces less
 overhead and leads to more optimized code.
 
+GPU/Device Profiling
+^^^^^^^^^^^^^^^^^^^^
+
+Clang supports profile-guided optimization for GPU device code when targeting
+AMD GPUs with HIP. The same ``-fprofile-generate`` and ``-fprofile-use`` flags
+work for both host and device code.
+
+The device PGO workflow differs from CPU PGO in several ways:
+
+1. **Profile collection**: Device counters are stored in GPU memory and
+   transferred to host memory at program exit by the compiler-rt profiling
+   runtime. The runtime uses HIP APIs (``hipGetSymbolAddress``, ``hipMemcpy``)
+   to read device profile data.
+
+2. **Counter expansion**: GPU kernels execute with massive parallelism
+   (thousands of concurrent threads). To reduce atomic contention, each
+   counter is expanded into multiple slots. Threads hash their position
+   to select a slot, and ``llvm-profdata merge`` sums the slots.
+
+3. **Wave uniformity**: GPU threads execute in groups called "waves" (or
+   "warps"). When all threads in a wave take the same branch, the branch
+   is "uniform". The ``--wave-size=N`` option to ``llvm-profdata merge``
+   computes uniformity information that helps the compiler make better
+   optimization decisions for GPU code.
+
+Here is an example workflow for HIP device PGO:
+
+.. code-block:: console
+
+   # Step 1: Build instrumented version
+   $ hipcc -O2 -fprofile-generate kernel.hip -o kernel
+
+   # Step 2: Run to collect profiles (device + host)
+   $ ./kernel
+   # Creates default.profraw with both host and device profiles
+
+   # Step 3: Merge profiles with wave size for uniformity detection
+   $ llvm-profdata merge --wave-size=32 -o kernel.profdata default.profraw
+
+   # Step 4: Build optimized version
+   $ hipcc -O2 -fprofile-use=kernel.profdata kernel.hip -o kernel_optimized
+
+.. note::
+
+   Device PGO requires a ROCm installation with HIP runtime support.
+   The ``--wave-size`` should match the target GPU's wave size (32 for
+   RDNA/gfx10+, 64 for GCN/gfx9).
+
+**Limitations**
+
+- Device PGO is currently supported only for AMDGPU targets
+- Sampling-based profiling is not supported for device code
+- Context-sensitive profiling (``-fcs-profile-generate``) is not supported
+  for device code
+
 Disabling Instrumentation
 ^^^^^^^^^^^^^^^^^^^^^^^^^
 
diff --git a/clang/lib/Driver/ToolChains/HIPAMD.cpp 
b/clang/lib/Driver/ToolChains/HIPAMD.cpp
index 617809285c165..d487e05909f59 100644
--- a/clang/lib/Driver/ToolChains/HIPAMD.cpp
+++ b/clang/lib/Driver/ToolChains/HIPAMD.cpp
@@ -303,6 +303,20 @@ HIPAMDToolChain::TranslateArgs(const 
llvm::opt::DerivedArgList &Args,
   const OptTable &Opts = getDriver().getOpts();
 
   for (Arg *A : Args) {
+    // Handle device-side profile data file for PGO
+    if (A->getOption().matches(options::OPT_fprofile_use_EQ)) {
+      StringRef ProfileFile = A->getValue();
+      std::string DeviceProfileFile = std::string(ProfileFile);
+      const char *Extension = strrchr(ProfileFile.data(), '.');
+      if (Extension) {
+        size_t BaseLen = Extension - ProfileFile.data();
+        DeviceProfileFile.insert(BaseLen, ".amdgcn-amd-amdhsa");
+      }
+      DAL->AddJoinedArg(A, Opts.getOption(options::OPT_fprofile_instr_use_EQ),
+                        DeviceProfileFile);
+      A->claim();
+      continue;
+    }
     // Filter unsupported sanitizers passed from the HostTC.
     if (!handleSanitizeOption(*this, *DAL, Args, BoundArch, A))
       DAL->append(A);
diff --git a/compiler-rt/include/profile/InstrProfData.inc 
b/compiler-rt/include/profile/InstrProfData.inc
index 46d6bb5bd8896..fffe5a26b1cb9 100644
--- a/compiler-rt/include/profile/InstrProfData.inc
+++ b/compiler-rt/include/profile/InstrProfData.inc
@@ -90,8 +90,12 @@ INSTR_PROF_DATA(IntPtrT, llvm::PointerType::getUnqual(Ctx), 
Values, \
 INSTR_PROF_DATA(const uint32_t, llvm::Type::getInt32Ty(Ctx), NumCounters, \
                 ConstantInt::get(llvm::Type::getInt32Ty(Ctx), NumCounters))
 INSTR_PROF_DATA(const uint16_t, Int16ArrayTy, NumValueSites[IPVK_Last+1], \
-                ConstantArray::get(Int16ArrayTy, Int16ArrayVals)) \
-INSTR_PROF_DATA(const uint32_t, llvm::Type::getInt32Ty(Ctx), NumBitmapBytes, \
+                ConstantArray::get(Int16ArrayTy, Int16ArrayVals))
+INSTR_PROF_DATA(const uint16_t, llvm::Type::getInt16Ty(Ctx),
+                NumOffloadProfilingThreads,
+                ConstantInt::get(llvm::Type::getInt16Ty(Ctx),
+                                 NumOffloadProfilingThreadsVal))
+INSTR_PROF_DATA(const uint32_t, llvm::Type::getInt32Ty(Ctx), NumBitmapBytes,
                 ConstantInt::get(llvm::Type::getInt32Ty(Ctx), NumBitmapBytes))
 #undef INSTR_PROF_DATA
 /* INSTR_PROF_DATA end. */
diff --git a/compiler-rt/lib/profile/CMakeLists.txt 
b/compiler-rt/lib/profile/CMakeLists.txt
index 7c8473cc5f200..d4f64dcb2c6c7 100644
--- a/compiler-rt/lib/profile/CMakeLists.txt
+++ b/compiler-rt/lib/profile/CMakeLists.txt
@@ -73,6 +73,7 @@ set(PROFILE_SOURCES
   InstrProfilingPlatformFuchsia.c
   InstrProfilingPlatformLinux.c
   InstrProfilingPlatformOther.c
+  InstrProfilingPlatformROCm.c
   InstrProfilingPlatformWindows.c
   )
 
diff --git a/compiler-rt/lib/profile/InstrProfiling.h 
b/compiler-rt/lib/profile/InstrProfiling.h
index 187ef55ef3784..e604df72d2044 100644
--- a/compiler-rt/lib/profile/InstrProfiling.h
+++ b/compiler-rt/lib/profile/InstrProfiling.h
@@ -349,4 +349,24 @@ extern char INSTR_PROF_PROFILE_NAME_VAR[1]; /* 
__llvm_profile_filename. */
 
 const __llvm_gcov_init_func_struct *__llvm_profile_begin_covinit();
 const __llvm_gcov_init_func_struct *__llvm_profile_end_covinit();
+
+/* A struct to hold the device pointers and sizes for the profile sections. */
+typedef struct HIPProfileSectionInfo {
+  void *CountersBegin;
+  size_t CountersSize;
+  void *DataBegin;
+  size_t DataSize;
+  void *NamesBegin;
+  size_t NamesSize;
+} HIPProfileSectionInfo;
+
+/*!
+ * \brief Register a HIP module's device-side profile data sections.
+ *
+ * This function is called by the host-side instrumentation code to provide
+ * the runtime with the necessary information to collect profile data from
+ * the device.
+ */
+void __llvm_profile_hip_register_module(HIPProfileSectionInfo *Info);
+
 #endif /* PROFILE_INSTRPROFILING_H_ */
diff --git a/compiler-rt/lib/profile/InstrProfilingFile.c 
b/compiler-rt/lib/profile/InstrProfilingFile.c
index 71127b05aafb8..aa9d567a1d17f 100644
--- a/compiler-rt/lib/profile/InstrProfilingFile.c
+++ b/compiler-rt/lib/profile/InstrProfilingFile.c
@@ -1198,6 +1198,8 @@ int __llvm_profile_write_file(void) {
   if (rc)
     PROF_ERR("Failed to write file \"%s\": %s\n", Filename, strerror(errno));
 
+  __llvm_profile_hip_collect_device_data();
+
   // Restore SIGKILL.
   if (PDeathSig == 1)
     lprofRestoreSigKill();
diff --git a/compiler-rt/lib/profile/InstrProfilingInternal.h 
b/compiler-rt/lib/profile/InstrProfilingInternal.h
index 5647782527eb7..be6d2627dd100 100644
--- a/compiler-rt/lib/profile/InstrProfilingInternal.h
+++ b/compiler-rt/lib/profile/InstrProfilingInternal.h
@@ -212,5 +212,12 @@ int __llvm_write_binary_ids(ProfDataWriter *Writer);
 int lprofWriteOneBinaryId(ProfDataWriter *Writer, uint64_t BinaryIdLen,
                           const uint8_t *BinaryIdData,
                           uint64_t BinaryIdPadding);
+#ifdef __cplusplus
+extern "C" {
+#endif
+COMPILER_RT_VISIBILITY int __llvm_profile_hip_collect_device_data(void);
+#ifdef __cplusplus
+}
+#endif
 
 #endif
diff --git a/compiler-rt/lib/profile/InstrProfilingPlatformROCm.c 
b/compiler-rt/lib/profile/InstrProfilingPlatformROCm.c
new file mode 100644
index 0000000000000..9b429cf8e8b22
--- /dev/null
+++ b/compiler-rt/lib/profile/InstrProfilingPlatformROCm.c
@@ -0,0 +1,702 @@
+//===- InstrProfilingPlatformROCm.c - Profile data ROCm platform ---------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM 
Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include "InstrProfiling.h"
+#include "InstrProfilingInternal.h"
+#include "InstrProfilingPort.h"
+#include <dlfcn.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+static int ProcessDeviceOffloadPrf(void *DeviceOffloadPrf);
+
+static int IsVerboseMode() {
+  static int IsVerbose = -1;
+  if (IsVerbose == -1) {
+    if (getenv("LLVM_PROFILE_VERBOSE"))
+      IsVerbose = 1;
+    else
+      IsVerbose = 0;
+  }
+  return IsVerbose;
+}
+
+/* -------------------------------------------------------------------------- 
*/
+/*  Dynamic loading of HIP runtime symbols                                   */
+/* -------------------------------------------------------------------------- 
*/
+
+typedef int (*hipMemcpyFromSymbolTy)(void *, const void *, size_t, size_t, 
int);
+typedef int (*hipGetSymbolAddressTy)(void **, const void *);
+typedef int (*hipMemcpyTy)(void *, void *, size_t, int);
+typedef int (*hipModuleGetGlobalTy)(void **, size_t *, void *, const char *);
+
+static hipMemcpyFromSymbolTy pHipMemcpyFromSymbol = NULL;
+static hipGetSymbolAddressTy pHipGetSymbolAddress = NULL;
+static hipMemcpyTy pHipMemcpy = NULL;
+static hipModuleGetGlobalTy pHipModuleGetGlobal = NULL;
+
+/* -------------------------------------------------------------------------- 
*/
+/*  HSA types and function pointers for direct memory copies                  
*/
+/*  This bypasses CLR's memory tracking, allowing copies from any device ptr  
*/
+/* -------------------------------------------------------------------------- 
*/
+
+typedef uint32_t hsa_status_t;
+typedef struct {
+  uint64_t handle;
+} hsa_agent_t;
+typedef struct {
+  uint64_t handle;
+} hsa_signal_t;
+
+#define HSA_STATUS_SUCCESS 0
+#define HSA_AGENT_INFO_NAME 0
+#define HSA_AGENT_INFO_DEVICE 17
+#define HSA_DEVICE_TYPE_GPU 1
+#define HSA_SIGNAL_CONDITION_LT 0
+
+typedef hsa_status_t (*hsa_init_ty)(void);
+typedef hsa_status_t (*hsa_iterate_agents_ty)(hsa_status_t (*)(hsa_agent_t,
+                                                               void *),
+                                              void *);
+typedef hsa_status_t (*hsa_agent_get_info_ty)(hsa_agent_t, uint32_t, void *);
+typedef hsa_status_t (*hsa_signal_create_ty)(int64_t, uint32_t,
+                                             const hsa_agent_t *,
+                                             hsa_signal_t *);
+typedef hsa_status_t (*hsa_signal_destroy_ty)(hsa_signal_t);
+typedef void (*hsa_signal_store_relaxed_ty)(hsa_signal_t, int64_t);
+typedef int64_t (*hsa_signal_wait_scacquire_ty)(hsa_signal_t, uint32_t, 
int64_t,
+                                                uint64_t, uint32_t);
+typedef hsa_status_t (*hsa_amd_memory_lock_ty)(void *, size_t, hsa_agent_t *,
+                                               int, void **);
+typedef hsa_status_t (*hsa_amd_memory_unlock_ty)(void *);
+typedef hsa_status_t (*hsa_amd_memory_async_copy_ty)(void *, hsa_agent_t,
+                                                     const void *, hsa_agent_t,
+                                                     size_t, uint32_t,
+                                                     const hsa_signal_t *,
+                                                     hsa_signal_t);
+
+static hsa_init_ty pHsaInit = NULL;
+static hsa_iterate_agents_ty pHsaIterateAgents = NULL;
+static hsa_agent_get_info_ty pHsaAgentGetInfo = NULL;
+static hsa_signal_create_ty pHsaSignalCreate = NULL;
+static hsa_signal_destroy_ty pHsaSignalDestroy = NULL;
+static hsa_signal_store_relaxed_ty pHsaSignalStoreRelaxed = NULL;
+static hsa_signal_wait_scacquire_ty pHsaSignalWaitScacquire = NULL;
+static hsa_amd_memory_lock_ty pHsaAmdMemoryLock = NULL;
+static hsa_amd_memory_unlock_ty pHsaAmdMemoryUnlock = NULL;
+static hsa_amd_memory_async_copy_ty pHsaAmdMemoryAsyncCopy = NULL;
+
+static hsa_agent_t GpuAgent = {0};
+static hsa_agent_t CpuAgent = {0};
+static int HsaInitialized = 0;
+
+static hsa_status_t FindAgentCallback(hsa_agent_t Agent, void *Data) {
+  (void)Data;
+  uint32_t DeviceType = 0;
+  if (pHsaAgentGetInfo(Agent, HSA_AGENT_INFO_DEVICE, &DeviceType) ==
+      HSA_STATUS_SUCCESS) {
+    if (DeviceType == HSA_DEVICE_TYPE_GPU && GpuAgent.handle == 0) {
+      GpuAgent = Agent;
+    } else if (DeviceType != HSA_DEVICE_TYPE_GPU && CpuAgent.handle == 0) {
+      CpuAgent = Agent;
+    }
+  }
+  return HSA_STATUS_SUCCESS;
+}
+
+static int EnsureHsaLoaded(void) {
+  static int HsaLoadAttempted = 0;
+  if (HsaLoadAttempted)
+    return HsaInitialized;
+  HsaLoadAttempted = 1;
+
+  void *Handle = dlopen("libhsa-runtime64.so", RTLD_LAZY | RTLD_LOCAL);
+  if (!Handle) {
+    if (IsVerboseMode())
+      PROF_NOTE("HSA not available: %s\n", dlerror());
+    return 0;
+  }
+
+  pHsaInit = (hsa_init_ty)dlsym(Handle, "hsa_init");
+  pHsaIterateAgents =
+      (hsa_iterate_agents_ty)dlsym(Handle, "hsa_iterate_agents");
+  pHsaAgentGetInfo = (hsa_agent_get_info_ty)dlsym(Handle, 
"hsa_agent_get_info");
+  pHsaSignalCreate = (hsa_signal_create_ty)dlsym(Handle, "hsa_signal_create");
+  pHsaSignalDestroy =
+      (hsa_signal_destroy_ty)dlsym(Handle, "hsa_signal_destroy");
+  pHsaSignalStoreRelaxed =
+      (hsa_signal_store_relaxed_ty)dlsym(Handle, "hsa_signal_store_relaxed");
+  pHsaSignalWaitScacquire =
+      (hsa_signal_wait_scacquire_ty)dlsym(Handle, "hsa_signal_wait_scacquire");
+  pHsaAmdMemoryLock =
+      (hsa_amd_memory_lock_ty)dlsym(Handle, "hsa_amd_memory_lock");
+  pHsaAmdMemoryUnlock =
+      (hsa_amd_memory_unlock_ty)dlsym(Handle, "hsa_amd_memory_unlock");
+  pHsaAmdMemoryAsyncCopy =
+      (hsa_amd_memory_async_copy_ty)dlsym(Handle, "hsa_amd_memory_async_copy");
+
+  if (!pHsaInit || !pHsaIterateAgents || !pHsaAgentGetInfo ||
+      !pHsaSignalCreate || !pHsaSignalDestroy || !pHsaSignalStoreRelaxed ||
+      !pHsaSignalWaitScacquire || !pHsaAmdMemoryLock || !pHsaAmdMemoryUnlock ||
+      !pHsaAmdMemoryAsyncCopy) {
+    if (IsVerboseMode())
+      PROF_NOTE("%s", "HSA: some symbols not found\n");
+    return 0;
+  }
+
+  /* HSA is typically already initialized by HIP, but call init anyway */
+  /* Note: hsa_init is reference-counted, so this is safe */
+  if (pHsaInit() != HSA_STATUS_SUCCESS) {
+    if (IsVerboseMode())
+      PROF_NOTE("%s", "HSA init failed\n");
+    return 0;
+  }
+
+  /* Find GPU and CPU agents */
+  pHsaIterateAgents(FindAgentCallback, NULL);
+  if (GpuAgent.handle == 0 || CpuAgent.handle == 0) {
+    if (IsVerboseMode())
+      PROF_NOTE("%s", "HSA: GPU or CPU agent not found\n");
+    return 0;
+  }
+
+  HsaInitialized = 1;
+  if (IsVerboseMode())
+    PROF_NOTE("HSA initialized: GPU agent=%lx, CPU agent=%lx\n",
+              (unsigned long)GpuAgent.handle, (unsigned long)CpuAgent.handle);
+  return 1;
+}
+
+/* Copy from device to host using HSA APIs (bypasses CLR memory tracking) */
+static int hsaMemcpyDtoH(void *Dst, const void *Src, size_t Size) {
+  if (!EnsureHsaLoaded())
+    return -1;
+
+  void *PinnedDst = NULL;
+  hsa_signal_t Signal = {0};
+  int Result = -1;
+
+  /* Pin host memory */
+  if (pHsaAmdMemoryLock(Dst, Size, NULL, 0, &PinnedDst) != HSA_STATUS_SUCCESS) 
{
+    if (IsVerboseMode())
+      PROF_NOTE("%s", "HSA: failed to lock host memory\n");
+    return -1;
+  }
+
+  /* Create completion signal */
+  if (pHsaSignalCreate(1, 0, NULL, &Signal) != HSA_STATUS_SUCCESS) {
+    if (IsVerboseMode())
+      PROF_NOTE("%s", "HSA: failed to create signal\n");
+    pHsaAmdMemoryUnlock(Dst);
+    return -1;
+  }
+
+  /* Async copy from device to host */
+  if (pHsaAmdMemoryAsyncCopy(PinnedDst, CpuAgent, Src, GpuAgent, Size, 0, NULL,
+                             Signal) != HSA_STATUS_SUCCESS) {
+    if (IsVerboseMode())
+      PROF_NOTE("%s", "HSA: async copy failed\n");
+    goto cleanup;
+  }
+
+  /* Wait for completion (timeout: 30 seconds) */
+  if (pHsaSignalWaitScacquire(Signal, HSA_SIGNAL_CONDITION_LT, 1,
+                              30000000000ULL, 0) < 0) {
+    if (IsVerboseMode())
+      PROF_NOTE("%s", "HSA: wait failed or timeout\n");
+    goto cleanup;
+  }
+
+  Result = 0; /* Success */
+
+cleanup:
+  pHsaSignalDestroy(Signal);
+  pHsaAmdMemoryUnlock(Dst);
+  return Result;
+}
+
+static void EnsureHipLoaded(void) {
+  static int Initialized = 0;
+  if (Initialized)
+    return;
+  Initialized = 1;
+
+  void *Handle = dlopen("libamdhip64.so", RTLD_LAZY | RTLD_LOCAL);
+  if (!Handle) {
+    fprintf(stderr, "compiler-rt: failed to open libamdhip64.so: %s\n",
+            dlerror());
+    return;
+  }
+
+  pHipMemcpyFromSymbol =
+      (hipMemcpyFromSymbolTy)dlsym(Handle, "hipMemcpyFromSymbol");
+  pHipGetSymbolAddress =
+      (hipGetSymbolAddressTy)dlsym(Handle, "hipGetSymbolAddress");
+  pHipMemcpy = (hipMemcpyTy)dlsym(Handle, "hipMemcpy");
+  pHipModuleGetGlobal =
+      (hipModuleGetGlobalTy)dlsym(Handle, "hipModuleGetGlobal");
+}
+
+/* -------------------------------------------------------------------------- 
*/
+/*  Public wrappers that forward to the loaded HIP symbols                   */
+/* -------------------------------------------------------------------------- 
*/
+
+static int hipMemcpyFromSymbol(void *dst, const void *symbol, size_t sizeBytes,
+                               size_t offset, int kind) {
+  EnsureHipLoaded();
+  return pHipMemcpyFromSymbol
+             ? pHipMemcpyFromSymbol(dst, symbol, sizeBytes, offset, kind)
+             : -1;
+}
+
+static int hipGetSymbolAddress(void **devPtr, const void *symbol) {
+  EnsureHipLoaded();
+  return pHipGetSymbolAddress ? pHipGetSymbolAddress(devPtr, symbol) : -1;
+}
+
+static int hipMemcpy(void *dest, void *src, size_t len, int kind /*2=DToH*/) {
+  EnsureHipLoaded();
+  return pHipMemcpy ? pHipMemcpy(dest, src, len, kind) : -1;
+}
+
+/* Copy from device to host - tries HSA first (bypasses CLR), falls back to 
HIP.
+ * This is needed because hipMemcpy may fail on device pointers that are not
+ * registered with CLR (e.g., profile counter sections obtained from
+ * __llvm_offload_prf structure). HSA APIs work with any device pointer. */
+static int memcpyDeviceToHost(void *Dst, void *Src, size_t Size) {
+  /* Try HSA first - this works with unregistered device pointers */
+  if (hsaMemcpyDtoH(Dst, Src, Size) == 0) {
+    return 0;
+  }
+
+  /* Fall back to HIP if HSA is not available */
+  if (IsVerboseMode())
+    PROF_NOTE("%s", "HSA copy failed, falling back to HIP\n");
+  return hipMemcpy(Dst, Src, Size, 2 /* DToH */);
+}
+
+static int hipModuleGetGlobal(void **DevPtr, size_t *Bytes, void *Module,
+                              const char *Name) {
+  EnsureHipLoaded();
+  return pHipModuleGetGlobal ? pHipModuleGetGlobal(DevPtr, Bytes, Module, Name)
+                             : -1;
+}
+
+/* -------------------------------------------------------------------------- 
*/
+/*  Dynamic module tracking                                                   
*/
+/* -------------------------------------------------------------------------- 
*/
+
+#define MAX_DYNAMIC_MODULES 256
+
+typedef struct {
+  void *ModulePtr; /* hipModule_t returned by hipModuleLoad            */
+  void *DeviceVar; /* address of __llvm_offload_prf in this module     */
+  int Processed;   /* 0 = not yet collected, 1 = data already copied   */
+} HipDynamicModuleInfo;
+
+static HipDynamicModuleInfo DynamicModules[MAX_DYNAMIC_MODULES];
+static int NumDynamicModules = 0;
+
+/* -------------------------------------------------------------------------- 
*/
+/*  Registration / un-registration helpers                                   */
+/* -------------------------------------------------------------------------- 
*/
+
+void __llvm_profile_hip_register_dynamic_module(int ModuleLoadRc, void **Ptr) {
+  if (IsVerboseMode())
+    PROF_NOTE("Registering loaded module %d: rc=%d, module=%p\n",
+              NumDynamicModules, ModuleLoadRc, *Ptr);
+
+  if (ModuleLoadRc)
+    return;
+
+  if (NumDynamicModules >= MAX_DYNAMIC_MODULES) {
+    PROF_ERR("Too many dynamic modules registered. Maximum is %d.\n",
+             MAX_DYNAMIC_MODULES);
+    return;
+  }
+
+  HipDynamicModule...
[truncated]

``````````

</details>


https://github.com/llvm/llvm-project/pull/177665
_______________________________________________
cfe-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to