[PATCH] D137154: Adding nvvm_reflect clang builtin

2022-11-01 Thread Hugh Delaney via Phabricator via cfe-commits
hdelan created this revision.
Herald added subscribers: mattd, gchakrabarti, asavonic, hiraditya.
Herald added a project: All.
hdelan requested review of this revision.
Herald added subscribers: llvm-commits, cfe-commits, jholewinski.
Herald added projects: clang, LLVM.

This patch adds __nvvm_reflect as a clang builtin for NVPTX backend. This means
that __nvvm_reflect can be used in source code in order to enable conditional 
compilation
based on compute capability and FTZ properties.


Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D137154

Files:
  clang/include/clang/Basic/BuiltinsNVPTX.def
  clang/test/CodeGenCUDA/nvvm-reflect.cu
  llvm/include/llvm/IR/IntrinsicsNVVM.td
  llvm/lib/Target/NVPTX/NVVMReflect.cpp
  llvm/test/CodeGen/NVPTX/nvvm-reflect-opaque.ll
  llvm/test/CodeGen/NVPTX/nvvm-reflect.ll

Index: llvm/test/CodeGen/NVPTX/nvvm-reflect.ll
===
--- llvm/test/CodeGen/NVPTX/nvvm-reflect.ll
+++ llvm/test/CodeGen/NVPTX/nvvm-reflect.ll
@@ -41,7 +41,7 @@
   ret float %ret
 }
 
-declare i32 @llvm.nvvm.reflect.p0i8(i8*)
+declare i32 @llvm.nvvm.reflect(i8*)
 
 ; CHECK-LABEL: define i32 @intrinsic
 define i32 @intrinsic() {
@@ -49,7 +49,7 @@
 ; USE_FTZ_0: ret i32 0
 ; USE_FTZ_1: ret i32 1
   %ptr = tail call i8* @llvm.nvvm.ptr.constant.to.gen.p0i8.p4i8(i8 addrspace(4)* getelementptr inbounds ([11 x i8], [11 x i8] addrspace(4)* @str, i32 0, i32 0))
-  %reflect = tail call i32 @llvm.nvvm.reflect.p0i8(i8* %ptr)
+  %reflect = tail call i32 @llvm.nvvm.reflect(i8* %ptr)
   ret i32 %reflect
 }
 
Index: llvm/test/CodeGen/NVPTX/nvvm-reflect-opaque.ll
===
--- llvm/test/CodeGen/NVPTX/nvvm-reflect-opaque.ll
+++ llvm/test/CodeGen/NVPTX/nvvm-reflect-opaque.ll
@@ -41,7 +41,7 @@
   ret float %ret
 }
 
-declare i32 @llvm.nvvm.reflect.p0i8(ptr)
+declare i32 @llvm.nvvm.reflect(ptr)
 
 ; CHECK-LABEL: define i32 @intrinsic
 define i32 @intrinsic() {
@@ -49,7 +49,7 @@
 ; USE_FTZ_0: ret i32 0
 ; USE_FTZ_1: ret i32 1
   %ptr = tail call ptr @llvm.nvvm.ptr.constant.to.gen.p0i8.p4i8(ptr addrspace(4) @str)
-  %reflect = tail call i32 @llvm.nvvm.reflect.p0i8(ptr %ptr)
+  %reflect = tail call i32 @llvm.nvvm.reflect(ptr %ptr)
   ret i32 %reflect
 }
 
Index: llvm/lib/Target/NVPTX/NVVMReflect.cpp
===
--- llvm/lib/Target/NVPTX/NVVMReflect.cpp
+++ llvm/lib/Target/NVPTX/NVVMReflect.cpp
@@ -40,6 +40,7 @@
 #include 
 #include 
 #define NVVM_REFLECT_FUNCTION "__nvvm_reflect"
+#define NVVM_REFLECT_LLVM_INTRINSIC_NAME "llvm.nvvm.reflect"
 
 using namespace llvm;
 
@@ -119,6 +120,7 @@
   continue;
 Function *Callee = Call->getCalledFunction();
 if (!Callee || (Callee->getName() != NVVM_REFLECT_FUNCTION &&
+Callee->getName() != NVVM_REFLECT_LLVM_INTRINSIC_NAME &&
 Callee->getIntrinsicID() != Intrinsic::nvvm_reflect))
   continue;
 
Index: llvm/include/llvm/IR/IntrinsicsNVVM.td
===
--- llvm/include/llvm/IR/IntrinsicsNVVM.td
+++ llvm/include/llvm/IR/IntrinsicsNVVM.td
@@ -1578,7 +1578,8 @@
 Intrinsic<[], [llvm_anyptr_ty], [], "llvm.nvvm.compiler.warn">;
 
 def int_nvvm_reflect :
-  Intrinsic<[llvm_i32_ty], [llvm_anyptr_ty], [IntrNoMem], "llvm.nvvm.reflect">;
+  Intrinsic<[llvm_i32_ty], [llvm_ptr_ty], [IntrNoMem], "llvm.nvvm.reflect">,
+  ClangBuiltin<"__nvvm_reflect">;
 
 // isspacep.{const, global, local, shared}
 def int_nvvm_isspacep_const
Index: clang/test/CodeGenCUDA/nvvm-reflect.cu
===
--- /dev/null
+++ clang/test/CodeGenCUDA/nvvm-reflect.cu
@@ -0,0 +1,81 @@
+// REQUIRES: nvptx-registered-target
+
+// Checking to see that __nvvm_reflect resolves to the correct llvm.nvvm.reflect
+// intrinsic
+// RUN: %clang_cc1 -fcuda-is-device -triple nvptx64-nvidia-cuda -disable-llvm-passes -S -emit-llvm -x c++ %s -o - | FileCheck %s --check-prefix=NO_NVVM_REFLECT_PASS
+
+// Prepare bitcode file to link with
+// RUN: %clang_cc1 -triple nvptx-unknown-cuda -emit-llvm-bc \
+// RUN:-disable-llvm-passes -o %t.bc %s
+
+// Checking to see if the correct values are substituted for the nvvm_reflect
+// call when llvm passes are enabled.
+// RUN: %clang_cc1 -fcuda-is-device -triple nvptx64-nvidia-cuda -target-cpu \
+// RUN:sm_50 -S -o /dev/null %s -mllvm -print-after-all 2>&1 \
+// RUN:| FileCheck %s --check-prefix=ARCH_REFLECT_1
+// RUN: %clang_cc1 -fcuda-is-device -triple nvptx64-nvidia-cuda -target-cpu \
+// RUN:sm_52 -S -o /dev/null %s -mllvm -print-after-all 2>&1 \
+// RUN:| FileCheck %s --check-prefix=ARCH_REFLECT_2
+// RUN: %clang_cc1 -fcuda-is-device -triple nvptx64-nvidia-cuda -target-cpu \
+// RUN:sm_53 -S -o /dev/null %s -mllvm -print-after-all 2>&1 \
+// RUN:| FileCheck %s --check-prefix=ARCH_REFLECT_3
+// RUN: %clang_cc1 -fcuda-is-device -

[PATCH] D137154: Adding nvvm_reflect clang builtin

2022-11-01 Thread Hugh Delaney via Phabricator via cfe-commits
hdelan added inline comments.



Comment at: llvm/include/llvm/IR/IntrinsicsNVVM.td:1581
 def int_nvvm_reflect :
-  Intrinsic<[llvm_i32_ty], [llvm_anyptr_ty], [IntrNoMem], "llvm.nvvm.reflect">;
+  Intrinsic<[llvm_i32_ty], [llvm_ptr_ty], [IntrNoMem], "llvm.nvvm.reflect">,
+  ClangBuiltin<"__nvvm_reflect">;

asavonic wrote:
> What is there a reason to change the intrinsic?
The clang builtin doesn't link up properly with the llvm intrinsic if 
`llvm_anyptr_ty` is used. The `llvm_anyptr_ty` suggests to clang to find a 
general `arg` parameter instead of an `Pointer` which results in a segfault. 
Changing to `llvm_ptr_ty` fixes the issue


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D137154/new/

https://reviews.llvm.org/D137154

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D137154: Adding nvvm_reflect clang builtin

2022-11-01 Thread Hugh Delaney via Phabricator via cfe-commits
hdelan updated this revision to Diff 472310.
hdelan added a comment.

Removing redundant check


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D137154/new/

https://reviews.llvm.org/D137154

Files:
  llvm/lib/Target/NVPTX/NVVMReflect.cpp


Index: llvm/lib/Target/NVPTX/NVVMReflect.cpp
===
--- llvm/lib/Target/NVPTX/NVVMReflect.cpp
+++ llvm/lib/Target/NVPTX/NVVMReflect.cpp
@@ -40,7 +40,6 @@
 #include 
 #include 
 #define NVVM_REFLECT_FUNCTION "__nvvm_reflect"
-#define NVVM_REFLECT_LLVM_INTRINSIC_NAME "llvm.nvvm.reflect"
 
 using namespace llvm;
 
@@ -120,7 +119,6 @@
   continue;
 Function *Callee = Call->getCalledFunction();
 if (!Callee || (Callee->getName() != NVVM_REFLECT_FUNCTION &&
-Callee->getName() != NVVM_REFLECT_LLVM_INTRINSIC_NAME &&
 Callee->getIntrinsicID() != Intrinsic::nvvm_reflect))
   continue;
 


Index: llvm/lib/Target/NVPTX/NVVMReflect.cpp
===
--- llvm/lib/Target/NVPTX/NVVMReflect.cpp
+++ llvm/lib/Target/NVPTX/NVVMReflect.cpp
@@ -40,7 +40,6 @@
 #include 
 #include 
 #define NVVM_REFLECT_FUNCTION "__nvvm_reflect"
-#define NVVM_REFLECT_LLVM_INTRINSIC_NAME "llvm.nvvm.reflect"
 
 using namespace llvm;
 
@@ -120,7 +119,6 @@
   continue;
 Function *Callee = Call->getCalledFunction();
 if (!Callee || (Callee->getName() != NVVM_REFLECT_FUNCTION &&
-Callee->getName() != NVVM_REFLECT_LLVM_INTRINSIC_NAME &&
 Callee->getIntrinsicID() != Intrinsic::nvvm_reflect))
   continue;
 
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D137154: Adding nvvm_reflect clang builtin

2022-11-01 Thread Hugh Delaney via Phabricator via cfe-commits
hdelan updated this revision to Diff 472311.
hdelan added a comment.

Removing redundant check


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D137154/new/

https://reviews.llvm.org/D137154

Files:
  clang/include/clang/Basic/BuiltinsNVPTX.def
  clang/test/CodeGenCUDA/nvvm-reflect.cu
  llvm/include/llvm/IR/IntrinsicsNVVM.td
  llvm/test/CodeGen/NVPTX/nvvm-reflect-opaque.ll
  llvm/test/CodeGen/NVPTX/nvvm-reflect.ll

Index: llvm/test/CodeGen/NVPTX/nvvm-reflect.ll
===
--- llvm/test/CodeGen/NVPTX/nvvm-reflect.ll
+++ llvm/test/CodeGen/NVPTX/nvvm-reflect.ll
@@ -41,7 +41,7 @@
   ret float %ret
 }
 
-declare i32 @llvm.nvvm.reflect.p0i8(i8*)
+declare i32 @llvm.nvvm.reflect(i8*)
 
 ; CHECK-LABEL: define i32 @intrinsic
 define i32 @intrinsic() {
@@ -49,7 +49,7 @@
 ; USE_FTZ_0: ret i32 0
 ; USE_FTZ_1: ret i32 1
   %ptr = tail call i8* @llvm.nvvm.ptr.constant.to.gen.p0i8.p4i8(i8 addrspace(4)* getelementptr inbounds ([11 x i8], [11 x i8] addrspace(4)* @str, i32 0, i32 0))
-  %reflect = tail call i32 @llvm.nvvm.reflect.p0i8(i8* %ptr)
+  %reflect = tail call i32 @llvm.nvvm.reflect(i8* %ptr)
   ret i32 %reflect
 }
 
Index: llvm/test/CodeGen/NVPTX/nvvm-reflect-opaque.ll
===
--- llvm/test/CodeGen/NVPTX/nvvm-reflect-opaque.ll
+++ llvm/test/CodeGen/NVPTX/nvvm-reflect-opaque.ll
@@ -41,7 +41,7 @@
   ret float %ret
 }
 
-declare i32 @llvm.nvvm.reflect.p0i8(ptr)
+declare i32 @llvm.nvvm.reflect(ptr)
 
 ; CHECK-LABEL: define i32 @intrinsic
 define i32 @intrinsic() {
@@ -49,7 +49,7 @@
 ; USE_FTZ_0: ret i32 0
 ; USE_FTZ_1: ret i32 1
   %ptr = tail call ptr @llvm.nvvm.ptr.constant.to.gen.p0i8.p4i8(ptr addrspace(4) @str)
-  %reflect = tail call i32 @llvm.nvvm.reflect.p0i8(ptr %ptr)
+  %reflect = tail call i32 @llvm.nvvm.reflect(ptr %ptr)
   ret i32 %reflect
 }
 
Index: llvm/include/llvm/IR/IntrinsicsNVVM.td
===
--- llvm/include/llvm/IR/IntrinsicsNVVM.td
+++ llvm/include/llvm/IR/IntrinsicsNVVM.td
@@ -1578,7 +1578,8 @@
 Intrinsic<[], [llvm_anyptr_ty], [], "llvm.nvvm.compiler.warn">;
 
 def int_nvvm_reflect :
-  Intrinsic<[llvm_i32_ty], [llvm_anyptr_ty], [IntrNoMem], "llvm.nvvm.reflect">;
+  Intrinsic<[llvm_i32_ty], [llvm_ptr_ty], [IntrNoMem], "llvm.nvvm.reflect">,
+  ClangBuiltin<"__nvvm_reflect">;
 
 // isspacep.{const, global, local, shared}
 def int_nvvm_isspacep_const
Index: clang/test/CodeGenCUDA/nvvm-reflect.cu
===
--- /dev/null
+++ clang/test/CodeGenCUDA/nvvm-reflect.cu
@@ -0,0 +1,81 @@
+// REQUIRES: nvptx-registered-target
+
+// Checking to see that __nvvm_reflect resolves to the correct llvm.nvvm.reflect
+// intrinsic
+// RUN: %clang_cc1 -fcuda-is-device -triple nvptx64-nvidia-cuda -disable-llvm-passes -S -emit-llvm -x c++ %s -o - | FileCheck %s --check-prefix=NO_NVVM_REFLECT_PASS
+
+// Prepare bitcode file to link with
+// RUN: %clang_cc1 -triple nvptx-unknown-cuda -emit-llvm-bc \
+// RUN:-disable-llvm-passes -o %t.bc %s
+
+// Checking to see if the correct values are substituted for the nvvm_reflect
+// call when llvm passes are enabled.
+// RUN: %clang_cc1 -fcuda-is-device -triple nvptx64-nvidia-cuda -target-cpu \
+// RUN:sm_50 -S -o /dev/null %s -mllvm -print-after-all 2>&1 \
+// RUN:| FileCheck %s --check-prefix=ARCH_REFLECT_1
+// RUN: %clang_cc1 -fcuda-is-device -triple nvptx64-nvidia-cuda -target-cpu \
+// RUN:sm_52 -S -o /dev/null %s -mllvm -print-after-all 2>&1 \
+// RUN:| FileCheck %s --check-prefix=ARCH_REFLECT_2
+// RUN: %clang_cc1 -fcuda-is-device -triple nvptx64-nvidia-cuda -target-cpu \
+// RUN:sm_53 -S -o /dev/null %s -mllvm -print-after-all 2>&1 \
+// RUN:| FileCheck %s --check-prefix=ARCH_REFLECT_3
+// RUN: %clang_cc1 -fcuda-is-device -triple nvptx64-nvidia-cuda -target-cpu \
+// RUN:sm_60 -S -o /dev/null %s -mllvm -print-after-all 2>&1 \
+// RUN:| FileCheck %s --check-prefix=ARCH_REFLECT_4
+// RUN: %clang_cc1 -fcuda-is-device -triple nvptx64-nvidia-cuda -target-cpu \
+// RUN:sm_61 -S -o /dev/null %s -mllvm -print-after-all 2>&1 \
+// RUN:| FileCheck %s --check-prefix=ARCH_REFLECT_5
+// RUN: %clang_cc1 -fcuda-is-device -triple nvptx64-nvidia-cuda -target-cpu \
+// RUN:sm_62 -S -o /dev/null %s -mllvm -print-after-all 2>&1 \
+// RUN:| FileCheck %s --check-prefix=ARCH_REFLECT_6
+// RUN: %clang_cc1 -fcuda-is-device -triple nvptx64-nvidia-cuda -target-cpu \
+// RUN:sm_70 -S -o /dev/null %s -mllvm -print-after-all 2>&1 \
+// RUN:| FileCheck %s --check-prefix=ARCH_REFLECT_7
+// RUN: %clang_cc1 -fcuda-is-device -triple nvptx64-nvidia-cuda -target-cpu \
+// RUN:sm_72 -S -o /dev/null %s -mllvm -print-after-all 2>&1 \
+// RUN:| FileCheck %s --check-prefix=ARCH_REFLECT_8
+// RUN: %clang_cc1 -fcuda-is-device -triple nvptx64-nvidia-cuda -target-cpu \
+// RUN: 

[PATCH] D137154: Adding nvvm_reflect clang builtin

2022-11-03 Thread Hugh Delaney via Phabricator via cfe-commits
hdelan marked an inline comment as not done.
hdelan added a comment.

In DPC++ for CUDA we use libclc as a wrapper around CUDA SDK's libdevice. Like 
libdevice we want to precompile libclc to bc for the CUDA backend without 
specializing for a particular arch, so that we can call different __nv funcs 
based on the arch. For this reason we use the `__nvvm_reflect` llvm intrinsic.

Without a clang builtin a number of workarounds are necessary to get the 
desired behaviour from `__nvvm_reflect`.

1. We must declare `__nvvm_reflect` in source code everywhere where it is used.
2. Some addrspace casting is necessary in openCL, since strings are in 
the__constant address space. We must use an IR wrapper function to ensure the 
correct behaviour. See 
https://github.com/intel/llvm/blob/sycl/libclc/ptx-nvidiacl/libspirv/reflect.ll

We agree that `__nvvm_reflect` is not a perfect solution, but it is something 
we must use because of libdevice. Adding the clang builtin would enable us to 
write libdevice-like libraries without having to resort to the methods 
mentioned above.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D137154/new/

https://reviews.llvm.org/D137154

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D137154: Adding nvvm_reflect clang builtin

2022-11-04 Thread Hugh Delaney via Phabricator via cfe-commits
hdelan added inline comments.



Comment at: clang/include/clang/Basic/BuiltinsNVPTX.def:827
 
+BUILTIN(__nvvm_reflect, "icC*", "nc")
+

tra wrote:
> Do we actually need this patch at all.
> 
> `extern "C" int __nvvm_reflect(const char *);` appears to work just fine 
> already:
> https://godbolt.org/z/caGenxn1e
> 
> 
It does work fine unless we are using openCL, which requires the addrspace 
casting as mentioned above


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D137154/new/

https://reviews.llvm.org/D137154

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D137154: Adding nvvm_reflect clang builtin

2022-11-09 Thread Hugh Delaney via Phabricator via cfe-commits
hdelan added a comment.

Thanks for feedback. Instead of adding `__nvvm_reflect` as a clang builtin, 
would it be acceptable if I modified the NVVMReflect pass so that it works with 
addrspace casting as well? This would allow us to use `__nvvm_reflect` in openCL


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D137154/new/

https://reviews.llvm.org/D137154

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D116583: Change the default optimisation level of PTXAS from -O0 to -O3. This makes the optimisation levels of PTXAS and the ptxjitcompiler equal (ptxjitcompiler defaults to -O3).

2022-01-31 Thread Hugh Delaney via Phabricator via cfe-commits
hdelan added inline comments.



Comment at: clang/lib/Driver/ToolChains/Cuda.cpp:433
   } else {
-// If no -O was passed, pass -O0 to ptxas -- no opt flag should correspond
-// to no optimizations, but ptxas's default is -O3.
-CmdArgs.push_back("-O0");
+// If no -O was passed, pass -O3 to ptxas -- this makes ptxas's
+// optimization level the same as the ptxjitcompiler.

tra wrote:
> hdelan wrote:
> > tra wrote:
> > > I think this would be contrary to the expectation that lack of `-O` in 
> > > clang means - `do not optimize` and it generally implies the whole 
> > > compilation chain, including assembler. Matching whatever nvidia tools do 
> > > is an insufficient reason for breaking this assumption, IMO. 
> > > 
> > > If you do want do run optimized ptxas on unoptimized PTX, you can use 
> > > `-Xcuda-ptxas -O3`.
> > I think for the average user, consistency across the `ptxjitcompiler` and 
> > `ptxas` is far more important than assuming that no `-O` means no 
> > optimization. I think most users will assume that no `-O` will assume that 
> > whatever tools being used will take their default optimization level, which 
> > in the case of clang is `-O0` and in the case of `ptxas` is `-O3`. 
> > 
> > We have had a few bugs with `ptxas`/`ptxjitcompiler` at higher optimization 
> > levels, which were quite hard to pin down since offline `ptxas` and 
> > `ptxjitcompiler` were using different optimisation levels, making bugs 
> > appear in one and not the other. Of course we are aware of this now but 
> > this inconsistency can result in bugs that are difficult to diagnose. 
> > Having consistency between the `ptxjitcompiler` and `ptxas` is therefore of 
> > practical benefit. Whereas if we are to leave it as is, with `ptxas` 
> > defaulting to `-O0`, the benefit is purely semantic and not practical.
> > I think for the average user, consistency across the ptxjitcompiler and 
> > ptxas is far more important than assuming that no -O means no optimization. 
> 
> The default is intended to provide the least amount of surprises for the most 
> users. There are more users of clang as a CUDA compiler than users of clang 
> as a cuda compiler who care about consistency with ptxjitcompiler. My point 
> is that the improvements for a subset of users should be balanced vs 
> usability in the common case. In this case the benefit does not justify the 
> downsides, IMO.
> 
> Please add me as a reviewer when the patch is ready for public review and 
> we'll discuss it in a wider LLVM community.
We have come to the same conclusion that it is best to leave this unchanged 
upstream. However this change has been made locally in `intel/llvm`. 


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D116583/new/

https://reviews.llvm.org/D116583

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D116583: Change the default optimisation level of PTXAS from -O0 to -O3. This makes the optimisation levels of PTXAS and the ptxjitcompiler equal (ptxjitcompiler defaults to -O3).

2022-01-31 Thread Hugh Delaney via Phabricator via cfe-commits
hdelan abandoned this revision.
hdelan added a comment.

Closing revision


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D116583/new/

https://reviews.llvm.org/D116583

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D116583: Change the default optimisation level of PTXAS from -O0 to -O3. This makes the optimisation levels of PTXAS and the ptxjitcompiler equal (ptxjitcompiler defaults to -O3).

2022-01-04 Thread Hugh Delaney via Phabricator via cfe-commits
hdelan created this revision.
Herald added a subscriber: asavonic.
hdelan requested review of this revision.
Herald added a project: clang.
Herald added a subscriber: cfe-commits.

Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D116583

Files:
  clang/lib/Driver/ToolChains/Cuda.cpp
  clang/test/Driver/cuda-external-tools.cu


Index: clang/test/Driver/cuda-external-tools.cu
===
--- clang/test/Driver/cuda-external-tools.cu
+++ clang/test/Driver/cuda-external-tools.cu
@@ -40,10 +40,10 @@
 // RUN:   --no-cuda-noopt-device-debug -O2 -c %s 2>&1 \
 // RUN: | FileCheck -check-prefixes=CHECK,ARCH64,SM35,OPT2 %s
 
-// Regular compile without -O.  This should result in us passing -O0 to ptxas.
+// Regular compile without -O.  This should result in us passing -O3 to ptxas.
 // RUN: %clang -### -target x86_64-linux-gnu -c %s 2>&1 \
 // RUN:   --offload-arch=sm_35 --cuda-path=%S/Inputs/CUDA/usr/local/cuda \
-// RUN: | FileCheck -check-prefixes=CHECK,ARCH64,SM35,OPT0 %s
+// RUN: | FileCheck -check-prefixes=CHECK,ARCH64,SM35,OPT3 %s
 
 // Regular compiles with -Os and -Oz.  For lack of a better option, we map
 // these to ptxas -O3.
@@ -75,7 +75,7 @@
 // Compile with -fintegrated-as.  This should still cause us to invoke ptxas.
 // RUN: %clang -### -target x86_64-linux-gnu -fintegrated-as -c %s 2>&1 \
 // RUN:   --offload-arch=sm_35 --cuda-path=%S/Inputs/CUDA/usr/local/cuda \
-// RUN: | FileCheck -check-prefixes=CHECK,ARCH64,SM35,OPT0 %s
+// RUN: | FileCheck -check-prefixes=CHECK,ARCH64,SM35,OPT3 %s
 // Check that we still pass -c when generating relocatable device code.
 // RUN: %clang -### -target x86_64-linux-gnu -fintegrated-as -fgpu-rdc -c %s 
2>&1 \
 // RUN:   --offload-arch=sm_35 --cuda-path=%S/Inputs/CUDA/usr/local/cuda \
Index: clang/lib/Driver/ToolChains/Cuda.cpp
===
--- clang/lib/Driver/ToolChains/Cuda.cpp
+++ clang/lib/Driver/ToolChains/Cuda.cpp
@@ -409,9 +409,6 @@
 CmdArgs.push_back("--return-at-end");
   } else if (Arg *A = Args.getLastArg(options::OPT_O_Group)) {
 // Map the -O we received to -O{0,1,2,3}.
-//
-// TODO: Perhaps we should map host -O2 to ptxas -O3. -O3 is ptxas's
-// default, so it may correspond more closely to the spirit of clang -O2.
 
 // -O3 seems like the least-bad option when -Osomething is specified to
 // clang but it isn't handled below.
@@ -433,9 +430,9 @@
 }
 CmdArgs.push_back(Args.MakeArgString(llvm::Twine("-O") + OOpt));
   } else {
-// If no -O was passed, pass -O0 to ptxas -- no opt flag should correspond
-// to no optimizations, but ptxas's default is -O3.
-CmdArgs.push_back("-O0");
+// If no -O was passed, pass -O3 to ptxas -- this makes ptxas's
+// optimization level the same as the ptxjitcompiler.
+CmdArgs.push_back("-O3");
   }
   if (DIKind == DebugDirectivesOnly)
 CmdArgs.push_back("-lineinfo");


Index: clang/test/Driver/cuda-external-tools.cu
===
--- clang/test/Driver/cuda-external-tools.cu
+++ clang/test/Driver/cuda-external-tools.cu
@@ -40,10 +40,10 @@
 // RUN:   --no-cuda-noopt-device-debug -O2 -c %s 2>&1 \
 // RUN: | FileCheck -check-prefixes=CHECK,ARCH64,SM35,OPT2 %s
 
-// Regular compile without -O.  This should result in us passing -O0 to ptxas.
+// Regular compile without -O.  This should result in us passing -O3 to ptxas.
 // RUN: %clang -### -target x86_64-linux-gnu -c %s 2>&1 \
 // RUN:   --offload-arch=sm_35 --cuda-path=%S/Inputs/CUDA/usr/local/cuda \
-// RUN: | FileCheck -check-prefixes=CHECK,ARCH64,SM35,OPT0 %s
+// RUN: | FileCheck -check-prefixes=CHECK,ARCH64,SM35,OPT3 %s
 
 // Regular compiles with -Os and -Oz.  For lack of a better option, we map
 // these to ptxas -O3.
@@ -75,7 +75,7 @@
 // Compile with -fintegrated-as.  This should still cause us to invoke ptxas.
 // RUN: %clang -### -target x86_64-linux-gnu -fintegrated-as -c %s 2>&1 \
 // RUN:   --offload-arch=sm_35 --cuda-path=%S/Inputs/CUDA/usr/local/cuda \
-// RUN: | FileCheck -check-prefixes=CHECK,ARCH64,SM35,OPT0 %s
+// RUN: | FileCheck -check-prefixes=CHECK,ARCH64,SM35,OPT3 %s
 // Check that we still pass -c when generating relocatable device code.
 // RUN: %clang -### -target x86_64-linux-gnu -fintegrated-as -fgpu-rdc -c %s 2>&1 \
 // RUN:   --offload-arch=sm_35 --cuda-path=%S/Inputs/CUDA/usr/local/cuda \
Index: clang/lib/Driver/ToolChains/Cuda.cpp
===
--- clang/lib/Driver/ToolChains/Cuda.cpp
+++ clang/lib/Driver/ToolChains/Cuda.cpp
@@ -409,9 +409,6 @@
 CmdArgs.push_back("--return-at-end");
   } else if (Arg *A = Args.getLastArg(options::OPT_O_Group)) {
 // Map the -O we received to -O{0,1,2,3}.
-//
-// TODO: Perhaps we should map host -O2 to ptxas -O3. -O3 is ptxas's
-// default, so it may correspond more closely to the spirit of clang -O2

[PATCH] D116583: Change the default optimisation level of PTXAS from -O0 to -O3. This makes the optimisation levels of PTXAS and the ptxjitcompiler equal (ptxjitcompiler defaults to -O3).

2022-01-05 Thread Hugh Delaney via Phabricator via cfe-commits
hdelan added inline comments.



Comment at: clang/lib/Driver/ToolChains/Cuda.cpp:433
   } else {
-// If no -O was passed, pass -O0 to ptxas -- no opt flag should correspond
-// to no optimizations, but ptxas's default is -O3.
-CmdArgs.push_back("-O0");
+// If no -O was passed, pass -O3 to ptxas -- this makes ptxas's
+// optimization level the same as the ptxjitcompiler.

tra wrote:
> I think this would be contrary to the expectation that lack of `-O` in clang 
> means - `do not optimize` and it generally implies the whole compilation 
> chain, including assembler. Matching whatever nvidia tools do is an 
> insufficient reason for breaking this assumption, IMO. 
> 
> If you do want do run optimized ptxas on unoptimized PTX, you can use 
> `-Xcuda-ptxas -O3`.
I think for the average user, consistency across the `ptxjitcompiler` and 
`ptxas` is far more important than assuming that no `-O` means no optimization. 
I think most users will assume that no `-O` will assume that whatever tools 
being used will take their default optimization level, which in the case of 
clang is `-O0` and in the case of `ptxas` is `-O3`. 

We have had a few bugs with `ptxas`/`ptxjitcompiler` at higher optimization 
levels, which were quite hard to pin down since offline `ptxas` and 
`ptxjitcompiler` were using different optimisation levels, making bugs appear 
in one and not the other. Of course we are aware of this now but this 
inconsistency can result in bugs that are difficult to diagnose. Having 
consistency between the `ptxjitcompiler` and `ptxas` is therefore of practical 
benefit. Whereas if we are to leave it as is, with `ptxas` defaulting to `-O0`, 
the benefit is purely semantic and not practical.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D116583/new/

https://reviews.llvm.org/D116583

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits