[clang] [llvm] [Clang][NVVM] Support `-f[no-]cuda-prec-sqrt` and propagate precision flag to `NVVMReflect` (PR #134244)

via cfe-commits Thu, 03 Apr 2025 06:34:57 -0700

https://github.com/Lai-YT created 
https://github.com/llvm/llvm-project/pull/134244


This PR demonstrates a potential solution to the issue raised in #131749. The 
goal is to provide a more concrete standpoint, and I'm open to feedback. Major 
changes are welcome, and if this isn't the direction we want to take, I'm 
completely fine with not merging this PR. 😊

## What's Changed?

The flag `-fcuda-prec-sqrt` has been added to the Clang driver and Clang 
frontend. This sets the `CodeGenOpts.CudaPreciseSqrt` option, which influences 
the value of the module flag `"nvvm-reflect-prec-sqrt"`. This flag is then 
resolved by the _NVVMReflect_ pass for` __nvvm_reflect("__CUDA_PREC_SQRT")`.

I'm unsure about the ideal location for this flag, so I might have made some 
mistakes. Any reviews or suggestions are greatly appreciated. 🙏

### Misc.

The module flag `"nvvm-reflect-prec-sqrt"` is added before 
`"nvvm-reflect-ftz"`. This ordering is intentional because one of the tests for 
`"nvvm-reflect-ftz"` relies on it being the last module flag. You can see the 
related test here:

https://github.com/llvm/llvm-project/blob/52f3cad9ffa35a472699d541736bd72dd01d6e62/clang/test/CodeGenCUDA/flush-denormals.cu#L47-L51

---

This is my first PR for LLVM, so if I come across as impolite in any way, 
please let me know, and I will address it right away. ✨

>From 523ad696c4b8377b10238ac96a91d866232f4b59 Mon Sep 17 00:00:00 2001
From: Lai-YT <381xvmv...@gmail.com>
Date: Thu, 3 Apr 2025 15:19:36 +0800
Subject: [PATCH 1/3] [Clang] Add `-f[no-]cuda-prec-sqrt` flag

NVCC provides the `-prec-sqrt` flag to control whether a precise or
approximate square root function is used. However, LLVM previously
always use the approximated version.

With this change, Clang introduces the `-f[no-]cuda-prec-sqrt` flag,
allowing users to specify precision behavior. The default is set to
false to maintain existing behavior.
---
 clang/include/clang/Basic/CodeGenOptions.h | 4 ++++
 clang/include/clang/Driver/Options.td      | 5 +++++
 clang/lib/Driver/ToolChains/Cuda.cpp       | 5 +++++
 clang/test/Driver/cuda-prec-sqrt.cu        | 6 ++++++
 4 files changed, 20 insertions(+)
 create mode 100644 clang/test/Driver/cuda-prec-sqrt.cu

diff --git a/clang/include/clang/Basic/CodeGenOptions.h 
b/clang/include/clang/Basic/CodeGenOptions.h
index e39a73bdb13ac..1ca4360d67820 100644
--- a/clang/include/clang/Basic/CodeGenOptions.h
+++ b/clang/include/clang/Basic/CodeGenOptions.h
@@ -317,6 +317,10 @@ class CodeGenOptions : public CodeGenOptionsBase {
   /// CUDA runtime back-end for incorporating them into host-side object file.
   std::string CudaGpuBinaryFileName;
 
+  /// Whether a precise or approximate square root should be used for CUDA
+  /// device code.
+  bool CudaPreciseSqrt;
+
   /// List of filenames passed in using the -fembed-offload-object option. 
These
   /// are offloading binaries containing device images and metadata.
   std::vector<std::string> OffloadObjects;
diff --git a/clang/include/clang/Driver/Options.td 
b/clang/include/clang/Driver/Options.td
index e69b804de63b5..88ec378222840 100644
--- a/clang/include/clang/Driver/Options.td
+++ b/clang/include/clang/Driver/Options.td
@@ -1279,6 +1279,11 @@ def fcuda_flush_denormals_to_zero : Flag<["-"], 
"fcuda-flush-denormals-to-zero">
   Alias<fgpu_flush_denormals_to_zero>;
 def fno_cuda_flush_denormals_to_zero : Flag<["-"], 
"fno-cuda-flush-denormals-to-zero">,
   Alias<fno_gpu_flush_denormals_to_zero>;
+defm cuda_prec_sqrt : BoolFOption<"cuda-prec-sqrt",
+  CodeGenOpts<"CudaPreciseSqrt">, DefaultFalse,
+  PosFlag<SetTrue, [], [ClangOption, CC1Option], "Enable">,
+  NegFlag<SetFalse, [], [ClangOption], "Disable">,
+  BothFlags<[], [ClangOption], " precise square root for CUDA device code.">>;
 def : Flag<["-"], "fcuda-rdc">, Alias<fgpu_rdc>;
 def : Flag<["-"], "fno-cuda-rdc">, Alias<fno_gpu_rdc>;
 defm cuda_short_ptr : BoolFOption<"cuda-short-ptr",
diff --git a/clang/lib/Driver/ToolChains/Cuda.cpp 
b/clang/lib/Driver/ToolChains/Cuda.cpp
index 06b0b0913d24e..00048e9217518 100644
--- a/clang/lib/Driver/ToolChains/Cuda.cpp
+++ b/clang/lib/Driver/ToolChains/Cuda.cpp
@@ -19,6 +19,7 @@
 #include "llvm/ADT/StringExtras.h"
 #include "llvm/Config/llvm-config.h" // for LLVM_HOST_TRIPLE
 #include "llvm/Option/ArgList.h"
+#include "llvm/Option/Option.h"
 #include "llvm/Support/FileSystem.h"
 #include "llvm/Support/FormatAdapters.h"
 #include "llvm/Support/FormatVariadic.h"
@@ -862,6 +863,10 @@ void CudaToolChain::addClangTargetOptions(
   if (CudaInstallation.version() >= CudaVersion::CUDA_90)
     CC1Args.push_back("-fcuda-allow-variadic-functions");
 
+  if (DriverArgs.hasFlag(options::OPT_fcuda_prec_sqrt,
+                         options::OPT_fno_cuda_prec_sqrt, false))
+    CC1Args.append({"-fcuda-prec-sqrt"});
+
   if (DriverArgs.hasFlag(options::OPT_fcuda_short_ptr,
                          options::OPT_fno_cuda_short_ptr, false))
     CC1Args.append({"-mllvm", "--nvptx-short-ptr"});
diff --git a/clang/test/Driver/cuda-prec-sqrt.cu 
b/clang/test/Driver/cuda-prec-sqrt.cu
new file mode 100644
index 0000000000000..563c41b75d49a
--- /dev/null
+++ b/clang/test/Driver/cuda-prec-sqrt.cu
@@ -0,0 +1,6 @@
+// Checks that the -fcuda-prec-sqrt flag is passed to the cc1 frontend.
+
+// RUN: %clang -### --target=x86_64-linux-gnu -c -fcuda-prec-sqrt -nocudainc 
-nocudalib --cuda-path=%S/Inputs/CUDA/usr/local/cuda %s 2>&1 | FileCheck %s
+
+// CHECK: "-triple" "nvptx64-nvidia-cuda"
+// CHECK-SAME: "-fcuda-prec-sqrt"

>From 203f061ae015bd47939123c442435d59a6756f2a Mon Sep 17 00:00:00 2001
From: Lai-YT <381xvmv...@gmail.com>
Date: Thu, 3 Apr 2025 15:22:00 +0800
Subject: [PATCH 2/3] [Clang][CodeGen] Add module flag for square root
 precision

A module flag is now set based on the `-f[no]-cuda-prec-sqrt` flag,
allowing the NVVMReflect pass to recognize and apply the specified
square root precision.
---
 clang/lib/CodeGen/CodeGenModule.cpp |  4 ++++
 clang/test/CodeGenCUDA/prec-sqrt.cu | 15 +++++++++++++++
 2 files changed, 19 insertions(+)
 create mode 100644 clang/test/CodeGenCUDA/prec-sqrt.cu

diff --git a/clang/lib/CodeGen/CodeGenModule.cpp 
b/clang/lib/CodeGen/CodeGenModule.cpp
index 8f9cf965af2b9..7f99a951ab97f 100644
--- a/clang/lib/CodeGen/CodeGenModule.cpp
+++ b/clang/lib/CodeGen/CodeGenModule.cpp
@@ -1286,6 +1286,10 @@ void CodeGenModule::Release() {
   }
 
   if (LangOpts.CUDAIsDevice && getTriple().isNVPTX()) {
+    // Indicate whether __nvvm_reflect should be configured to use precise
+    // square root.  (This corresponds to its "__CUDA_PREC_SQRT" property.)
+    getModule().addModuleFlag(llvm::Module::Override, "nvvm-reflect-prec-sqrt",
+                              CodeGenOpts.CudaPreciseSqrt);
     // Indicate whether __nvvm_reflect should be configured to flush denormal
     // floating point values to 0.  (This corresponds to its "__CUDA_FTZ"
     // property.)
diff --git a/clang/test/CodeGenCUDA/prec-sqrt.cu 
b/clang/test/CodeGenCUDA/prec-sqrt.cu
new file mode 100644
index 0000000000000..88c7692e8bb0a
--- /dev/null
+++ b/clang/test/CodeGenCUDA/prec-sqrt.cu
@@ -0,0 +1,15 @@
+// RUN: %clang_cc1 -fcuda-is-device \
+// RUN:   -triple nvptx-nvidia-cuda -emit-llvm -o - %s | \
+// RUN:   FileCheck -check-prefixes=NO-PREC-SQRT %s
+
+// RUN: %clang_cc1 -fcuda-is-device -fcuda-prec-sqrt \
+// RUN:   -triple nvptx-nvidia-cuda -emit-llvm -o - %s | \
+// RUN:   FileCheck -check-prefixes=PREC-SQRT %s
+
+#include "Inputs/cuda.h"
+
+extern "C" __device__ void foo() {}
+
+
+// NO-PREC-SQRT: !{i32 4, !"nvvm-reflect-prec-sqrt", i32 0}
+// PREC-SQRT: !{i32 4, !"nvvm-reflect-prec-sqrt", i32 1}

>From 281a6c113785e321dce6c0883801ba248093fc60 Mon Sep 17 00:00:00 2001
From: Lai-YT <381xvmv...@gmail.com>
Date: Thu, 3 Apr 2025 16:14:04 +0800
Subject: [PATCH 3/3] [NVVMReflect] Recognize `__CUDA_PREC_SQRT`

The `__nv_sqrtf` intrinsic in libdevice.bc, defined by NVIDIA, depends
not only on `__nvvm_reflect("__CUDA_FTZ")` but also on
`__nvvm_reflect("__CUDA_PREC_SQRT")`. However, the NVVMReflect pass
previously failed to recognize `__CUDA_PREC_SQRT`, causing its value to
default to `0`.

This change enables the NVVMReflect pass to correctly pick up the module
flag "nvvm-reflect-prec-sqrt", which Clang sets based on the
`-fcuda-prec-sqrt` flag, ensuring proper behavior.
---
 llvm/lib/Target/NVPTX/NVVMReflect.cpp        |  6 +++++
 llvm/test/CodeGen/NVPTX/nvvm-reflect-sqrt.ll | 28 ++++++++++++++++++++
 2 files changed, 34 insertions(+)
 create mode 100644 llvm/test/CodeGen/NVPTX/nvvm-reflect-sqrt.ll

diff --git a/llvm/lib/Target/NVPTX/NVVMReflect.cpp 
b/llvm/lib/Target/NVPTX/NVVMReflect.cpp
index 20b8bef1899b4..593c98ea036c5 100644
--- a/llvm/lib/Target/NVPTX/NVVMReflect.cpp
+++ b/llvm/lib/Target/NVPTX/NVVMReflect.cpp
@@ -173,6 +173,12 @@ static bool runNVVMReflect(Function &F, unsigned 
SmVersion) {
       if (auto *Flag = mdconst::extract_or_null<ConstantInt>(
               F.getParent()->getModuleFlag("nvvm-reflect-ftz")))
         ReflectVal = Flag->getSExtValue();
+    } else if (ReflectArg == "__CUDA_PREC_SQRT") {
+      // Try to pull __CUDA_PREC_SQRT from the nvvm-reflect-prec-sqrt module
+      // flag.
+      if (auto *Flag = mdconst::extract_or_null<ConstantInt>(
+              F.getParent()->getModuleFlag("nvvm-reflect-prec-sqrt")))
+        ReflectVal = Flag->getSExtValue();
     } else if (ReflectArg == "__CUDA_ARCH") {
       ReflectVal = SmVersion * 10;
     }
diff --git a/llvm/test/CodeGen/NVPTX/nvvm-reflect-sqrt.ll 
b/llvm/test/CodeGen/NVPTX/nvvm-reflect-sqrt.ll
new file mode 100644
index 0000000000000..5b584547f836b
--- /dev/null
+++ b/llvm/test/CodeGen/NVPTX/nvvm-reflect-sqrt.ll
@@ -0,0 +1,28 @@
+; We run nvvm-reflect (and then optimize) this module twice, once with metadata
+; that enables precise sqrt, and again with metadata that disables it.
+
+; RUN: cat %s > %t.noprec
+; RUN: echo '!0 = !{i32 4, !"nvvm-reflect-prec-sqrt", i32 0}' >> %t.noprec
+; RUN: opt %t.noprec -S -mtriple=nvptx-nvidia-cuda -passes='nvvm-reflect' \
+; RUN:   | FileCheck %s --check-prefix=PREC_SQRT_0 --check-prefix=CHECK
+
+; RUN: cat %s > %t.prec
+; RUN: echo '!0 = !{i32 4, !"nvvm-reflect-prec-sqrt", i32 1}' >> %t.prec
+; RUN: opt %t.prec -S -mtriple=nvptx-nvidia-cuda -passes='nvvm-reflect' \
+; RUN:   | FileCheck %s --check-prefix=PREC_SQRT_1 --check-prefix=CHECK
+
+@.str = private unnamed_addr constant [17 x i8] c"__CUDA_PREC_SQRT\00", align 1
+
+declare i32 @__nvvm_reflect(ptr)
+
+; CHECK-LABEL: @foo
+define i32 @foo() {
+  ; CHECK-NOT: call i32 @__nvvm_reflect
+  %reflect = call i32 @__nvvm_reflect(ptr @.str)
+  ; PREC_SQRT_0: ret i32 0
+  ; PREC_SQRT_1: ret i32 1
+  ret i32 %reflect
+}
+
+!llvm.module.flags = !{!0}
+; A module flag is added to the end of this file by the RUN lines at the top.

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[clang] [llvm] [Clang][NVVM] Support `-f[no-]cuda-prec-sqrt` and propagate precision flag to `NVVMReflect` (PR #134244)

Reply via email to