[PATCH] D137154: Adding nvvm_reflect clang builtin
hdelan created this revision. Herald added subscribers: mattd, gchakrabarti, asavonic, hiraditya. Herald added a project: All. hdelan requested review of this revision. Herald added subscribers: llvm-commits, cfe-commits, jholewinski. Herald added projects: clang, LLVM. This patch adds __nvvm_reflect as a clang builtin for NVPTX backend. This means that __nvvm_reflect can be used in source code in order to enable conditional compilation based on compute capability and FTZ properties. Repository: rG LLVM Github Monorepo https://reviews.llvm.org/D137154 Files: clang/include/clang/Basic/BuiltinsNVPTX.def clang/test/CodeGenCUDA/nvvm-reflect.cu llvm/include/llvm/IR/IntrinsicsNVVM.td llvm/lib/Target/NVPTX/NVVMReflect.cpp llvm/test/CodeGen/NVPTX/nvvm-reflect-opaque.ll llvm/test/CodeGen/NVPTX/nvvm-reflect.ll Index: llvm/test/CodeGen/NVPTX/nvvm-reflect.ll === --- llvm/test/CodeGen/NVPTX/nvvm-reflect.ll +++ llvm/test/CodeGen/NVPTX/nvvm-reflect.ll @@ -41,7 +41,7 @@ ret float %ret } -declare i32 @llvm.nvvm.reflect.p0i8(i8*) +declare i32 @llvm.nvvm.reflect(i8*) ; CHECK-LABEL: define i32 @intrinsic define i32 @intrinsic() { @@ -49,7 +49,7 @@ ; USE_FTZ_0: ret i32 0 ; USE_FTZ_1: ret i32 1 %ptr = tail call i8* @llvm.nvvm.ptr.constant.to.gen.p0i8.p4i8(i8 addrspace(4)* getelementptr inbounds ([11 x i8], [11 x i8] addrspace(4)* @str, i32 0, i32 0)) - %reflect = tail call i32 @llvm.nvvm.reflect.p0i8(i8* %ptr) + %reflect = tail call i32 @llvm.nvvm.reflect(i8* %ptr) ret i32 %reflect } Index: llvm/test/CodeGen/NVPTX/nvvm-reflect-opaque.ll === --- llvm/test/CodeGen/NVPTX/nvvm-reflect-opaque.ll +++ llvm/test/CodeGen/NVPTX/nvvm-reflect-opaque.ll @@ -41,7 +41,7 @@ ret float %ret } -declare i32 @llvm.nvvm.reflect.p0i8(ptr) +declare i32 @llvm.nvvm.reflect(ptr) ; CHECK-LABEL: define i32 @intrinsic define i32 @intrinsic() { @@ -49,7 +49,7 @@ ; USE_FTZ_0: ret i32 0 ; USE_FTZ_1: ret i32 1 %ptr = tail call ptr @llvm.nvvm.ptr.constant.to.gen.p0i8.p4i8(ptr addrspace(4) @str) - %reflect = tail call i32 @llvm.nvvm.reflect.p0i8(ptr %ptr) + %reflect = tail call i32 @llvm.nvvm.reflect(ptr %ptr) ret i32 %reflect } Index: llvm/lib/Target/NVPTX/NVVMReflect.cpp === --- llvm/lib/Target/NVPTX/NVVMReflect.cpp +++ llvm/lib/Target/NVPTX/NVVMReflect.cpp @@ -40,6 +40,7 @@ #include #include #define NVVM_REFLECT_FUNCTION "__nvvm_reflect" +#define NVVM_REFLECT_LLVM_INTRINSIC_NAME "llvm.nvvm.reflect" using namespace llvm; @@ -119,6 +120,7 @@ continue; Function *Callee = Call->getCalledFunction(); if (!Callee || (Callee->getName() != NVVM_REFLECT_FUNCTION && +Callee->getName() != NVVM_REFLECT_LLVM_INTRINSIC_NAME && Callee->getIntrinsicID() != Intrinsic::nvvm_reflect)) continue; Index: llvm/include/llvm/IR/IntrinsicsNVVM.td === --- llvm/include/llvm/IR/IntrinsicsNVVM.td +++ llvm/include/llvm/IR/IntrinsicsNVVM.td @@ -1578,7 +1578,8 @@ Intrinsic<[], [llvm_anyptr_ty], [], "llvm.nvvm.compiler.warn">; def int_nvvm_reflect : - Intrinsic<[llvm_i32_ty], [llvm_anyptr_ty], [IntrNoMem], "llvm.nvvm.reflect">; + Intrinsic<[llvm_i32_ty], [llvm_ptr_ty], [IntrNoMem], "llvm.nvvm.reflect">, + ClangBuiltin<"__nvvm_reflect">; // isspacep.{const, global, local, shared} def int_nvvm_isspacep_const Index: clang/test/CodeGenCUDA/nvvm-reflect.cu === --- /dev/null +++ clang/test/CodeGenCUDA/nvvm-reflect.cu @@ -0,0 +1,81 @@ +// REQUIRES: nvptx-registered-target + +// Checking to see that __nvvm_reflect resolves to the correct llvm.nvvm.reflect +// intrinsic +// RUN: %clang_cc1 -fcuda-is-device -triple nvptx64-nvidia-cuda -disable-llvm-passes -S -emit-llvm -x c++ %s -o - | FileCheck %s --check-prefix=NO_NVVM_REFLECT_PASS + +// Prepare bitcode file to link with +// RUN: %clang_cc1 -triple nvptx-unknown-cuda -emit-llvm-bc \ +// RUN:-disable-llvm-passes -o %t.bc %s + +// Checking to see if the correct values are substituted for the nvvm_reflect +// call when llvm passes are enabled. +// RUN: %clang_cc1 -fcuda-is-device -triple nvptx64-nvidia-cuda -target-cpu \ +// RUN:sm_50 -S -o /dev/null %s -mllvm -print-after-all 2>&1 \ +// RUN:| FileCheck %s --check-prefix=ARCH_REFLECT_1 +// RUN: %clang_cc1 -fcuda-is-device -triple nvptx64-nvidia-cuda -target-cpu \ +// RUN:sm_52 -S -o /dev/null %s -mllvm -print-after-all 2>&1 \ +// RUN:| FileCheck %s --check-prefix=ARCH_REFLECT_2 +// RUN: %clang_cc1 -fcuda-is-device -triple nvptx64-nvidia-cuda -target-cpu \ +// RUN:sm_53 -S -o /dev/null %s -mllvm -print-after-all 2>&1 \ +// RUN:| FileCheck %s --check-prefix=ARCH_REFLECT_3 +// RUN: %clang_cc1 -fcuda-is-device -
[PATCH] D137154: Adding nvvm_reflect clang builtin
hdelan added inline comments. Comment at: llvm/include/llvm/IR/IntrinsicsNVVM.td:1581 def int_nvvm_reflect : - Intrinsic<[llvm_i32_ty], [llvm_anyptr_ty], [IntrNoMem], "llvm.nvvm.reflect">; + Intrinsic<[llvm_i32_ty], [llvm_ptr_ty], [IntrNoMem], "llvm.nvvm.reflect">, + ClangBuiltin<"__nvvm_reflect">; asavonic wrote: > What is there a reason to change the intrinsic? The clang builtin doesn't link up properly with the llvm intrinsic if `llvm_anyptr_ty` is used. The `llvm_anyptr_ty` suggests to clang to find a general `arg` parameter instead of an `Pointer` which results in a segfault. Changing to `llvm_ptr_ty` fixes the issue Repository: rG LLVM Github Monorepo CHANGES SINCE LAST ACTION https://reviews.llvm.org/D137154/new/ https://reviews.llvm.org/D137154 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[PATCH] D137154: Adding nvvm_reflect clang builtin
hdelan updated this revision to Diff 472310. hdelan added a comment. Removing redundant check Repository: rG LLVM Github Monorepo CHANGES SINCE LAST ACTION https://reviews.llvm.org/D137154/new/ https://reviews.llvm.org/D137154 Files: llvm/lib/Target/NVPTX/NVVMReflect.cpp Index: llvm/lib/Target/NVPTX/NVVMReflect.cpp === --- llvm/lib/Target/NVPTX/NVVMReflect.cpp +++ llvm/lib/Target/NVPTX/NVVMReflect.cpp @@ -40,7 +40,6 @@ #include #include #define NVVM_REFLECT_FUNCTION "__nvvm_reflect" -#define NVVM_REFLECT_LLVM_INTRINSIC_NAME "llvm.nvvm.reflect" using namespace llvm; @@ -120,7 +119,6 @@ continue; Function *Callee = Call->getCalledFunction(); if (!Callee || (Callee->getName() != NVVM_REFLECT_FUNCTION && -Callee->getName() != NVVM_REFLECT_LLVM_INTRINSIC_NAME && Callee->getIntrinsicID() != Intrinsic::nvvm_reflect)) continue; Index: llvm/lib/Target/NVPTX/NVVMReflect.cpp === --- llvm/lib/Target/NVPTX/NVVMReflect.cpp +++ llvm/lib/Target/NVPTX/NVVMReflect.cpp @@ -40,7 +40,6 @@ #include #include #define NVVM_REFLECT_FUNCTION "__nvvm_reflect" -#define NVVM_REFLECT_LLVM_INTRINSIC_NAME "llvm.nvvm.reflect" using namespace llvm; @@ -120,7 +119,6 @@ continue; Function *Callee = Call->getCalledFunction(); if (!Callee || (Callee->getName() != NVVM_REFLECT_FUNCTION && -Callee->getName() != NVVM_REFLECT_LLVM_INTRINSIC_NAME && Callee->getIntrinsicID() != Intrinsic::nvvm_reflect)) continue; ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[PATCH] D137154: Adding nvvm_reflect clang builtin
hdelan updated this revision to Diff 472311. hdelan added a comment. Removing redundant check Repository: rG LLVM Github Monorepo CHANGES SINCE LAST ACTION https://reviews.llvm.org/D137154/new/ https://reviews.llvm.org/D137154 Files: clang/include/clang/Basic/BuiltinsNVPTX.def clang/test/CodeGenCUDA/nvvm-reflect.cu llvm/include/llvm/IR/IntrinsicsNVVM.td llvm/test/CodeGen/NVPTX/nvvm-reflect-opaque.ll llvm/test/CodeGen/NVPTX/nvvm-reflect.ll Index: llvm/test/CodeGen/NVPTX/nvvm-reflect.ll === --- llvm/test/CodeGen/NVPTX/nvvm-reflect.ll +++ llvm/test/CodeGen/NVPTX/nvvm-reflect.ll @@ -41,7 +41,7 @@ ret float %ret } -declare i32 @llvm.nvvm.reflect.p0i8(i8*) +declare i32 @llvm.nvvm.reflect(i8*) ; CHECK-LABEL: define i32 @intrinsic define i32 @intrinsic() { @@ -49,7 +49,7 @@ ; USE_FTZ_0: ret i32 0 ; USE_FTZ_1: ret i32 1 %ptr = tail call i8* @llvm.nvvm.ptr.constant.to.gen.p0i8.p4i8(i8 addrspace(4)* getelementptr inbounds ([11 x i8], [11 x i8] addrspace(4)* @str, i32 0, i32 0)) - %reflect = tail call i32 @llvm.nvvm.reflect.p0i8(i8* %ptr) + %reflect = tail call i32 @llvm.nvvm.reflect(i8* %ptr) ret i32 %reflect } Index: llvm/test/CodeGen/NVPTX/nvvm-reflect-opaque.ll === --- llvm/test/CodeGen/NVPTX/nvvm-reflect-opaque.ll +++ llvm/test/CodeGen/NVPTX/nvvm-reflect-opaque.ll @@ -41,7 +41,7 @@ ret float %ret } -declare i32 @llvm.nvvm.reflect.p0i8(ptr) +declare i32 @llvm.nvvm.reflect(ptr) ; CHECK-LABEL: define i32 @intrinsic define i32 @intrinsic() { @@ -49,7 +49,7 @@ ; USE_FTZ_0: ret i32 0 ; USE_FTZ_1: ret i32 1 %ptr = tail call ptr @llvm.nvvm.ptr.constant.to.gen.p0i8.p4i8(ptr addrspace(4) @str) - %reflect = tail call i32 @llvm.nvvm.reflect.p0i8(ptr %ptr) + %reflect = tail call i32 @llvm.nvvm.reflect(ptr %ptr) ret i32 %reflect } Index: llvm/include/llvm/IR/IntrinsicsNVVM.td === --- llvm/include/llvm/IR/IntrinsicsNVVM.td +++ llvm/include/llvm/IR/IntrinsicsNVVM.td @@ -1578,7 +1578,8 @@ Intrinsic<[], [llvm_anyptr_ty], [], "llvm.nvvm.compiler.warn">; def int_nvvm_reflect : - Intrinsic<[llvm_i32_ty], [llvm_anyptr_ty], [IntrNoMem], "llvm.nvvm.reflect">; + Intrinsic<[llvm_i32_ty], [llvm_ptr_ty], [IntrNoMem], "llvm.nvvm.reflect">, + ClangBuiltin<"__nvvm_reflect">; // isspacep.{const, global, local, shared} def int_nvvm_isspacep_const Index: clang/test/CodeGenCUDA/nvvm-reflect.cu === --- /dev/null +++ clang/test/CodeGenCUDA/nvvm-reflect.cu @@ -0,0 +1,81 @@ +// REQUIRES: nvptx-registered-target + +// Checking to see that __nvvm_reflect resolves to the correct llvm.nvvm.reflect +// intrinsic +// RUN: %clang_cc1 -fcuda-is-device -triple nvptx64-nvidia-cuda -disable-llvm-passes -S -emit-llvm -x c++ %s -o - | FileCheck %s --check-prefix=NO_NVVM_REFLECT_PASS + +// Prepare bitcode file to link with +// RUN: %clang_cc1 -triple nvptx-unknown-cuda -emit-llvm-bc \ +// RUN:-disable-llvm-passes -o %t.bc %s + +// Checking to see if the correct values are substituted for the nvvm_reflect +// call when llvm passes are enabled. +// RUN: %clang_cc1 -fcuda-is-device -triple nvptx64-nvidia-cuda -target-cpu \ +// RUN:sm_50 -S -o /dev/null %s -mllvm -print-after-all 2>&1 \ +// RUN:| FileCheck %s --check-prefix=ARCH_REFLECT_1 +// RUN: %clang_cc1 -fcuda-is-device -triple nvptx64-nvidia-cuda -target-cpu \ +// RUN:sm_52 -S -o /dev/null %s -mllvm -print-after-all 2>&1 \ +// RUN:| FileCheck %s --check-prefix=ARCH_REFLECT_2 +// RUN: %clang_cc1 -fcuda-is-device -triple nvptx64-nvidia-cuda -target-cpu \ +// RUN:sm_53 -S -o /dev/null %s -mllvm -print-after-all 2>&1 \ +// RUN:| FileCheck %s --check-prefix=ARCH_REFLECT_3 +// RUN: %clang_cc1 -fcuda-is-device -triple nvptx64-nvidia-cuda -target-cpu \ +// RUN:sm_60 -S -o /dev/null %s -mllvm -print-after-all 2>&1 \ +// RUN:| FileCheck %s --check-prefix=ARCH_REFLECT_4 +// RUN: %clang_cc1 -fcuda-is-device -triple nvptx64-nvidia-cuda -target-cpu \ +// RUN:sm_61 -S -o /dev/null %s -mllvm -print-after-all 2>&1 \ +// RUN:| FileCheck %s --check-prefix=ARCH_REFLECT_5 +// RUN: %clang_cc1 -fcuda-is-device -triple nvptx64-nvidia-cuda -target-cpu \ +// RUN:sm_62 -S -o /dev/null %s -mllvm -print-after-all 2>&1 \ +// RUN:| FileCheck %s --check-prefix=ARCH_REFLECT_6 +// RUN: %clang_cc1 -fcuda-is-device -triple nvptx64-nvidia-cuda -target-cpu \ +// RUN:sm_70 -S -o /dev/null %s -mllvm -print-after-all 2>&1 \ +// RUN:| FileCheck %s --check-prefix=ARCH_REFLECT_7 +// RUN: %clang_cc1 -fcuda-is-device -triple nvptx64-nvidia-cuda -target-cpu \ +// RUN:sm_72 -S -o /dev/null %s -mllvm -print-after-all 2>&1 \ +// RUN:| FileCheck %s --check-prefix=ARCH_REFLECT_8 +// RUN: %clang_cc1 -fcuda-is-device -triple nvptx64-nvidia-cuda -target-cpu \ +// RUN:
[PATCH] D137154: Adding nvvm_reflect clang builtin
hdelan marked an inline comment as not done. hdelan added a comment. In DPC++ for CUDA we use libclc as a wrapper around CUDA SDK's libdevice. Like libdevice we want to precompile libclc to bc for the CUDA backend without specializing for a particular arch, so that we can call different __nv funcs based on the arch. For this reason we use the `__nvvm_reflect` llvm intrinsic. Without a clang builtin a number of workarounds are necessary to get the desired behaviour from `__nvvm_reflect`. 1. We must declare `__nvvm_reflect` in source code everywhere where it is used. 2. Some addrspace casting is necessary in openCL, since strings are in the__constant address space. We must use an IR wrapper function to ensure the correct behaviour. See https://github.com/intel/llvm/blob/sycl/libclc/ptx-nvidiacl/libspirv/reflect.ll We agree that `__nvvm_reflect` is not a perfect solution, but it is something we must use because of libdevice. Adding the clang builtin would enable us to write libdevice-like libraries without having to resort to the methods mentioned above. Repository: rG LLVM Github Monorepo CHANGES SINCE LAST ACTION https://reviews.llvm.org/D137154/new/ https://reviews.llvm.org/D137154 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[PATCH] D137154: Adding nvvm_reflect clang builtin
hdelan added inline comments. Comment at: clang/include/clang/Basic/BuiltinsNVPTX.def:827 +BUILTIN(__nvvm_reflect, "icC*", "nc") + tra wrote: > Do we actually need this patch at all. > > `extern "C" int __nvvm_reflect(const char *);` appears to work just fine > already: > https://godbolt.org/z/caGenxn1e > > It does work fine unless we are using openCL, which requires the addrspace casting as mentioned above Repository: rG LLVM Github Monorepo CHANGES SINCE LAST ACTION https://reviews.llvm.org/D137154/new/ https://reviews.llvm.org/D137154 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[PATCH] D137154: Adding nvvm_reflect clang builtin
hdelan added a comment. Thanks for feedback. Instead of adding `__nvvm_reflect` as a clang builtin, would it be acceptable if I modified the NVVMReflect pass so that it works with addrspace casting as well? This would allow us to use `__nvvm_reflect` in openCL Repository: rG LLVM Github Monorepo CHANGES SINCE LAST ACTION https://reviews.llvm.org/D137154/new/ https://reviews.llvm.org/D137154 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[PATCH] D116583: Change the default optimisation level of PTXAS from -O0 to -O3. This makes the optimisation levels of PTXAS and the ptxjitcompiler equal (ptxjitcompiler defaults to -O3).
hdelan added inline comments. Comment at: clang/lib/Driver/ToolChains/Cuda.cpp:433 } else { -// If no -O was passed, pass -O0 to ptxas -- no opt flag should correspond -// to no optimizations, but ptxas's default is -O3. -CmdArgs.push_back("-O0"); +// If no -O was passed, pass -O3 to ptxas -- this makes ptxas's +// optimization level the same as the ptxjitcompiler. tra wrote: > hdelan wrote: > > tra wrote: > > > I think this would be contrary to the expectation that lack of `-O` in > > > clang means - `do not optimize` and it generally implies the whole > > > compilation chain, including assembler. Matching whatever nvidia tools do > > > is an insufficient reason for breaking this assumption, IMO. > > > > > > If you do want do run optimized ptxas on unoptimized PTX, you can use > > > `-Xcuda-ptxas -O3`. > > I think for the average user, consistency across the `ptxjitcompiler` and > > `ptxas` is far more important than assuming that no `-O` means no > > optimization. I think most users will assume that no `-O` will assume that > > whatever tools being used will take their default optimization level, which > > in the case of clang is `-O0` and in the case of `ptxas` is `-O3`. > > > > We have had a few bugs with `ptxas`/`ptxjitcompiler` at higher optimization > > levels, which were quite hard to pin down since offline `ptxas` and > > `ptxjitcompiler` were using different optimisation levels, making bugs > > appear in one and not the other. Of course we are aware of this now but > > this inconsistency can result in bugs that are difficult to diagnose. > > Having consistency between the `ptxjitcompiler` and `ptxas` is therefore of > > practical benefit. Whereas if we are to leave it as is, with `ptxas` > > defaulting to `-O0`, the benefit is purely semantic and not practical. > > I think for the average user, consistency across the ptxjitcompiler and > > ptxas is far more important than assuming that no -O means no optimization. > > The default is intended to provide the least amount of surprises for the most > users. There are more users of clang as a CUDA compiler than users of clang > as a cuda compiler who care about consistency with ptxjitcompiler. My point > is that the improvements for a subset of users should be balanced vs > usability in the common case. In this case the benefit does not justify the > downsides, IMO. > > Please add me as a reviewer when the patch is ready for public review and > we'll discuss it in a wider LLVM community. We have come to the same conclusion that it is best to leave this unchanged upstream. However this change has been made locally in `intel/llvm`. Repository: rG LLVM Github Monorepo CHANGES SINCE LAST ACTION https://reviews.llvm.org/D116583/new/ https://reviews.llvm.org/D116583 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[PATCH] D116583: Change the default optimisation level of PTXAS from -O0 to -O3. This makes the optimisation levels of PTXAS and the ptxjitcompiler equal (ptxjitcompiler defaults to -O3).
hdelan abandoned this revision. hdelan added a comment. Closing revision Repository: rG LLVM Github Monorepo CHANGES SINCE LAST ACTION https://reviews.llvm.org/D116583/new/ https://reviews.llvm.org/D116583 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[PATCH] D116583: Change the default optimisation level of PTXAS from -O0 to -O3. This makes the optimisation levels of PTXAS and the ptxjitcompiler equal (ptxjitcompiler defaults to -O3).
hdelan created this revision. Herald added a subscriber: asavonic. hdelan requested review of this revision. Herald added a project: clang. Herald added a subscriber: cfe-commits. Repository: rG LLVM Github Monorepo https://reviews.llvm.org/D116583 Files: clang/lib/Driver/ToolChains/Cuda.cpp clang/test/Driver/cuda-external-tools.cu Index: clang/test/Driver/cuda-external-tools.cu === --- clang/test/Driver/cuda-external-tools.cu +++ clang/test/Driver/cuda-external-tools.cu @@ -40,10 +40,10 @@ // RUN: --no-cuda-noopt-device-debug -O2 -c %s 2>&1 \ // RUN: | FileCheck -check-prefixes=CHECK,ARCH64,SM35,OPT2 %s -// Regular compile without -O. This should result in us passing -O0 to ptxas. +// Regular compile without -O. This should result in us passing -O3 to ptxas. // RUN: %clang -### -target x86_64-linux-gnu -c %s 2>&1 \ // RUN: --offload-arch=sm_35 --cuda-path=%S/Inputs/CUDA/usr/local/cuda \ -// RUN: | FileCheck -check-prefixes=CHECK,ARCH64,SM35,OPT0 %s +// RUN: | FileCheck -check-prefixes=CHECK,ARCH64,SM35,OPT3 %s // Regular compiles with -Os and -Oz. For lack of a better option, we map // these to ptxas -O3. @@ -75,7 +75,7 @@ // Compile with -fintegrated-as. This should still cause us to invoke ptxas. // RUN: %clang -### -target x86_64-linux-gnu -fintegrated-as -c %s 2>&1 \ // RUN: --offload-arch=sm_35 --cuda-path=%S/Inputs/CUDA/usr/local/cuda \ -// RUN: | FileCheck -check-prefixes=CHECK,ARCH64,SM35,OPT0 %s +// RUN: | FileCheck -check-prefixes=CHECK,ARCH64,SM35,OPT3 %s // Check that we still pass -c when generating relocatable device code. // RUN: %clang -### -target x86_64-linux-gnu -fintegrated-as -fgpu-rdc -c %s 2>&1 \ // RUN: --offload-arch=sm_35 --cuda-path=%S/Inputs/CUDA/usr/local/cuda \ Index: clang/lib/Driver/ToolChains/Cuda.cpp === --- clang/lib/Driver/ToolChains/Cuda.cpp +++ clang/lib/Driver/ToolChains/Cuda.cpp @@ -409,9 +409,6 @@ CmdArgs.push_back("--return-at-end"); } else if (Arg *A = Args.getLastArg(options::OPT_O_Group)) { // Map the -O we received to -O{0,1,2,3}. -// -// TODO: Perhaps we should map host -O2 to ptxas -O3. -O3 is ptxas's -// default, so it may correspond more closely to the spirit of clang -O2. // -O3 seems like the least-bad option when -Osomething is specified to // clang but it isn't handled below. @@ -433,9 +430,9 @@ } CmdArgs.push_back(Args.MakeArgString(llvm::Twine("-O") + OOpt)); } else { -// If no -O was passed, pass -O0 to ptxas -- no opt flag should correspond -// to no optimizations, but ptxas's default is -O3. -CmdArgs.push_back("-O0"); +// If no -O was passed, pass -O3 to ptxas -- this makes ptxas's +// optimization level the same as the ptxjitcompiler. +CmdArgs.push_back("-O3"); } if (DIKind == DebugDirectivesOnly) CmdArgs.push_back("-lineinfo"); Index: clang/test/Driver/cuda-external-tools.cu === --- clang/test/Driver/cuda-external-tools.cu +++ clang/test/Driver/cuda-external-tools.cu @@ -40,10 +40,10 @@ // RUN: --no-cuda-noopt-device-debug -O2 -c %s 2>&1 \ // RUN: | FileCheck -check-prefixes=CHECK,ARCH64,SM35,OPT2 %s -// Regular compile without -O. This should result in us passing -O0 to ptxas. +// Regular compile without -O. This should result in us passing -O3 to ptxas. // RUN: %clang -### -target x86_64-linux-gnu -c %s 2>&1 \ // RUN: --offload-arch=sm_35 --cuda-path=%S/Inputs/CUDA/usr/local/cuda \ -// RUN: | FileCheck -check-prefixes=CHECK,ARCH64,SM35,OPT0 %s +// RUN: | FileCheck -check-prefixes=CHECK,ARCH64,SM35,OPT3 %s // Regular compiles with -Os and -Oz. For lack of a better option, we map // these to ptxas -O3. @@ -75,7 +75,7 @@ // Compile with -fintegrated-as. This should still cause us to invoke ptxas. // RUN: %clang -### -target x86_64-linux-gnu -fintegrated-as -c %s 2>&1 \ // RUN: --offload-arch=sm_35 --cuda-path=%S/Inputs/CUDA/usr/local/cuda \ -// RUN: | FileCheck -check-prefixes=CHECK,ARCH64,SM35,OPT0 %s +// RUN: | FileCheck -check-prefixes=CHECK,ARCH64,SM35,OPT3 %s // Check that we still pass -c when generating relocatable device code. // RUN: %clang -### -target x86_64-linux-gnu -fintegrated-as -fgpu-rdc -c %s 2>&1 \ // RUN: --offload-arch=sm_35 --cuda-path=%S/Inputs/CUDA/usr/local/cuda \ Index: clang/lib/Driver/ToolChains/Cuda.cpp === --- clang/lib/Driver/ToolChains/Cuda.cpp +++ clang/lib/Driver/ToolChains/Cuda.cpp @@ -409,9 +409,6 @@ CmdArgs.push_back("--return-at-end"); } else if (Arg *A = Args.getLastArg(options::OPT_O_Group)) { // Map the -O we received to -O{0,1,2,3}. -// -// TODO: Perhaps we should map host -O2 to ptxas -O3. -O3 is ptxas's -// default, so it may correspond more closely to the spirit of clang -O2
[PATCH] D116583: Change the default optimisation level of PTXAS from -O0 to -O3. This makes the optimisation levels of PTXAS and the ptxjitcompiler equal (ptxjitcompiler defaults to -O3).
hdelan added inline comments. Comment at: clang/lib/Driver/ToolChains/Cuda.cpp:433 } else { -// If no -O was passed, pass -O0 to ptxas -- no opt flag should correspond -// to no optimizations, but ptxas's default is -O3. -CmdArgs.push_back("-O0"); +// If no -O was passed, pass -O3 to ptxas -- this makes ptxas's +// optimization level the same as the ptxjitcompiler. tra wrote: > I think this would be contrary to the expectation that lack of `-O` in clang > means - `do not optimize` and it generally implies the whole compilation > chain, including assembler. Matching whatever nvidia tools do is an > insufficient reason for breaking this assumption, IMO. > > If you do want do run optimized ptxas on unoptimized PTX, you can use > `-Xcuda-ptxas -O3`. I think for the average user, consistency across the `ptxjitcompiler` and `ptxas` is far more important than assuming that no `-O` means no optimization. I think most users will assume that no `-O` will assume that whatever tools being used will take their default optimization level, which in the case of clang is `-O0` and in the case of `ptxas` is `-O3`. We have had a few bugs with `ptxas`/`ptxjitcompiler` at higher optimization levels, which were quite hard to pin down since offline `ptxas` and `ptxjitcompiler` were using different optimisation levels, making bugs appear in one and not the other. Of course we are aware of this now but this inconsistency can result in bugs that are difficult to diagnose. Having consistency between the `ptxjitcompiler` and `ptxas` is therefore of practical benefit. Whereas if we are to leave it as is, with `ptxas` defaulting to `-O0`, the benefit is purely semantic and not practical. Repository: rG LLVM Github Monorepo CHANGES SINCE LAST ACTION https://reviews.llvm.org/D116583/new/ https://reviews.llvm.org/D116583 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits